One of the most popular dataset in Machine Learning is Boston Housing dataset (source: https://archive.ics.uci.edu/ml/machine-learning-databases/housing/). This dataset is practically used in any Machine Learning lecture or book. So why don't we take a look, how can we predict housing prices also within Oracle Data Visualization (DV).
As you probably know, Oracle DV is not exactly a tool of choice for data scientists as they would prefer working for example with Oracle Data Science cloud or any other coding environment for Python or R . Oracle DV is meant to be used primarily by business users who definitely are not data scientists by default, but they would still like to include some machine learning capabilities in their analyses and data visualisations. Oracle DV actually supports this requirement and provides easy to use and relatively capable environment for business users to perform machine learning tasks.
Therefore let's take a look how could they approach and execute this at first glance not so simple task.
Exploratory Data Analysis
Let's start with some exploratory data analyses. Once your initial dataset is uploaded to Oracle Analytics it is stored into Data Set Storage.
The data set contains 14 features:
- CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centres
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per $10,000
- PT: pupil-teacher ratio by town
- B: 1000 (Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT: % lower status of the population
- MV: Median value of owner-occupied homes in $1000's
Target feature (what we plan to predict) is MV, the median value in $1000's.
To get a bit more insights, let's check the correlations between individual features. We can see that there are no particular correlations which are indicated by dark green or dark red colour in correlation matrix below. This indicates (without any further exploration) that we don't need to remove any of the features from the data set.
Let's check a bit more how MV is related to the rest of the attributes. We can use scatter graphs for this. Additionally we can add a trendline (out-of-the-box, one-click feature of Oracle DV).
As you can see, these graphs confirm the findings from the correlation matrix. We can continue this data exploratory exercise in more details. One thing to try out is to use Explore feature to find out more statistics and patterns in the data set. Unfortunately, for this particular dataset, there isn't much insights, so I am skipping this quite cool feature of Oracle DV.
Data Preparation using Data Flows
After we have familiarised ourselves with the data set, we can now focus on data preparation. Oracle DV includes Data Flow feature which can be used for this task.
In the first preparation step, we will create two separate data sets, the Train and the Test data set. We will use Train to build our machine learning model, and the Test will be used as a "unseen" data set, which we will used for the comparison between predicted and actual values.
We can create a new data flow in which we will split the initial data set into two data sets. Data Flows is a functionality that can be seen as a "light-weight" ETL tool. It supports basically a lot of features which are required to produce to separate data sets.
For this we have used a new column, called SAMPLING. Based on the formula below, every sixth row is allocated to the Test data set.
Both final data sets are stored in Data Set Storage or in Database. In our example, we are using Oracle Autonomous Data Warehouse.
Data flow generates two data sets. Here are some details of the Train data set.
Building a new Machine Learning model
The second step is creation of the Machine Learning model. Housing price prediction is an example of supervised learning on continuous variable, hence it is a regression problem. In Data Flows, we can use the Train Numeric Prediction step to train the model.
Regression in Oracle DV is supported by four algorithms:
In our example, we decided to use Random Forest, but for the sake of exercise, you can choose any algorithm.
We can see that parameters that we need to set before we run the Data Flow are in fact very limited, and in most cases if you leave them unchanged, you might build a decent model. For the business user, this is one of the points where some machine learning might come handy and useful.
Finally, we should save our machine learning model.
When saved, we can run your Data Flow in order to generate a new machine learning model. Once finished we can check how well it is performing.
Machine Learning Model Evaluation
Quality section in Inspect window provides you with some key metrics such as RMSE, R-squared and others.
You can actually now repeat these last few steps by using different algorithm or different parameter settings and compare the models among themselves.
For example, you can compare Random Forest regression results with the results of the Linear Regression model which we created before. This can be done directly in Oracle DV.
We can see that Random Forest is slightly better.
Machine Learning Model Deployment
Now, the model is trained. Let's assume our Random Forest based model is the best, therefore we will use it for the prediction. There are two options how to deploy a model.
Option 1: Creating a new predicted data set
The first deployment option is to create and run another Data Flow which would result in a new PREDICTED data set.
This time we will use Apply Model step in Data Flow.
While building the Data Flow, you can already observe what the result will look like. There are three new columns created:
- Predicted Value,
- Prediction Confidence Percentage and
- Prediction Group or Segment, which explains the decision rule applied for the Predicted Value.
After Data Flow is executed, the new Predicted data set is generated.
Productisation
Before we go and use Predicted data set in our analyses, let's check one more thing.
We can assume that these three Data Flows will be required to run on daily basis. In order to support this task and to make sure it is executed every day:
- Oracle DV gives us option to set up a new Sequence, which runs all three Data Flows one after another.
- We can use Scheduler to schedule a job to execute this Sequence daily at specific time.
Option 2: Using Scenarios to deploy Machine Learning Model on the fly
The second option to use machine learning model in our analysis is to apply it directly, on-the-fly in the analysis itself. You can start your analysis with the Test (so far unseen) data and then add machine learning model using Scenario in the analysis.
Added model then behaves the same as stored data set.
For example, to calculate a variance between MV value (from Test data set) and MV Prediction value from machine learning model, simply create a new Calculation. When you add MV Prediction to the formula, you can actually observe the "raw" function call of the prediction model in R or Python (see EVALUATE_SCRIPT).
Conclusion
With this we conclude our exercise of predicting housing prices. As this is a standard machine learning problem, we can see that it is possible to create and run such predictions using Oracle DV. It is expected that predictions won't be as good as if they were done using standard data science platforms, but on the other hand Oracle DV is not meant to be used for "hard core" data scientist, but more likely for business users.