This is the 3rd part of the blog post series on Telco Churn Prediction, which I'm revisiting after 2+ years from my original blog post Getting Started with Machine Learning in Oracle Data Visualization.
In my two previous posts in this series, I am talking about Data Analysis and Data Preparation. Both steps are mandatory steps before any machine learning is applied.
My plan in this blog is to demonstrate how to create a machine learning model, how to improve it by setting parameters or simply by replacing an algorithm, and finally, how to create a project in Oracle Analytics in order to compare all created machine learning models among each other in order to decide which is the best model for my prediction.
Creating a new Machine Learning model
A new machine learning model can be created by using (again) Data Flows functionality in Oracle Analytics. This hasn't change much since my initial blog from two years ago. Basically, this is a 3-step process in which you need to:
- select your training dataset,
- choose machine learning algorithm and set parameters for it and
- store results in the form of machine learning model and statistics datasets.
We will look at neural network, but in general the process is the same for any algorithm you might choose.
So, let's start by creating a new Data Flow. The starting point is training dataset which we prepared in previous steps.
Becuase all required data preparation steps have already been done, no additional transformations on the dataset are required. Therefore, we can simply add a new Train Binary Classifier step in our data flow.
Further on, we need to choose an algorithm from the list of available Binary Classifiers:
As agreed, we will showcase model training on Neural Network for Classification algorithm, however any selection would do the job.
Two steps are automatically added:
- Train Binary Classifier and
- Save Model.
The last step is straightforward as we need to provide the model name, however Train Binary Classifier requires from us to set some parameters which are used by the algorithms.
Data scientist would argue the list of parameters you can set, but let's have in mind we have a "business user" hat on, therefore it is unlikely to expect really deep knowledge of neural networks from these type of users. However, by setting the parameters you might be able to create better prediction models and to improve some key statistics that are important when evaluating the model.
The parameters that you can set for Neural Network are the following:
When defining any of the models, you need to specify your target column. In our case, this is Churn.
You also have to specify which is the positive class in target. It is Yes in our case.
Then you must decide which imputation strategy you would use for Null or NA values, for both Categorical and Numerical features.
For Categorical features, when it is needed, transformation into numerical value is required. There are two options to do that, to use Indexer method (assign a number to categorical values, ie. 1,2,3,4, ...) or you can choose Onehot method which we described in previous blog post.
You also have to define at which percent of Null value it still makes sense to build a model.
And you need to specify what portion of the dataset will be used for training. In most cases this is 60% to 80%.
These are more or less generic parameters that are applicable for any algorithm. And there are three parameters which are specific to Neural Networks. These parameters are:
- Batch Size,
- Optimizer Method and
- Activation Function.
Batch size defines number of training instances (examples) in one forward/backward propagation pass. Neural Networks tend to use quite a lot of memory space, so by setting this parameter can adjust the required memory.
Building a neural network is an iterative process in which a minimum of the cost function is being identified. In each iteration algorithm tries to get closer to that minimum, and algorithm used in this process is called an Optimization and algorithm is called Optimizer method. There are several methods available and Oracle Analytics implements three methods: L-BFGS, Stochastic Gradient Descent and Adam (in theory this one should be the best).
And finally, Activation Function. Neural Network is build from several layers and every layer consists of several nodes. Activation function of a node defines the output of that node for a given input. Activation function adds non-linearity to the neural network. Without this non-linearity, neural network would be able to perform only linear mappings from inputs to outputs. There are alternative activation functions available in Oracle Analytics: Logistic Sigmoid, TanH, Relu and Identity:
Sigmoid |
TanH |
Let's save this model and run the data flow.
We can now check the quality of the model by navigating to Machine Learning.
If we inspect the created model, we can observe the following evaluation metrics which help us to assess model's quality.
As you can see, the statistics are not very impressive. For more details on each of the statistics, please visit my original post Getting Started with Machine Learning in Oracle Data Visualization.
We can see that this model is somewhat accurate (72%), but all other metrics are far from being good. That is why, the model requires some adjustments. This usually turns into a lengthy and manual process of setting parameters one by one.
After some trials we could come to the following model parameters:
When we run the data flow and create a new model, the quality of the new model is better:
This model is not ideal, but performs better that the previous one. It has improved in all metrics. We could have further played with the parameter settings, but it would have turned out that additional improvements are just slightly better.
In order to compare the Neural Network model with models created using other algorithms, we created several models. For each of them, we can try and play with parameter settings.
Finally, 5 machine learning models have been created:
Model Evaluation and Comparison
When a model is created in Oracle Analytics, additional "statistical" datasets are created. These can be used in a project to compare machine learning models among themselves.
In my case, I have created several data flows and sequence with which I automated the process. As a result I got one single data set for Classification Report dataset that contains data for all machine learning models.
A new project is now created. I can now present and compare all confusion matrices, model statistics and classification report results on one single canvas.
By observing different metrics, we can see that it turned out that Logistic Regression seems to be slightly better than Neural Network.
One would expect that Neural Network would perform the best. However, the implementation of the Neural Network in Oracle Analytics is rather very simple with only one hidden layer. But even though we can see that results for Neural Network aren't the worst.
For more, here is detailed list of parameters used in our Neural Network model:
On the other hand side, explanation for the Logistic Regression model can be observed by which features have the highest impact on the prediction. It turns out, that ContractType_Month-to-month, ContractType_TwoYears, BaseCharges and TotalCharges are the key drivers for the prediction and have the highest correlation with the predicted Churn.
Conclusions and what to expect
Based on this exercise and the results obtained, we can conclude, that it is very important how you prepare data to be used for prediction. If we compare results obtained in the initial blog post and results from this exercise, it clearly shows data preparation plays the key role in model training and prediction.
With a minor effort, most of data preparation tasks can be done using Data Flows functionality. Personally this exercise also shows that there is a room for improvement in the functionality of Data Flows. If Oracle further invests in this tool, then it could become one of the most powerful features of Oracle Analytics stack.
We have clearly seen that playing with parameters settings during the model training has a major impact on the quality. Training a machine learning model in Oracle Analytics is a lengthy process, so I would expect that some more options in terms of process automation would become available.
My expectation is that Oracle will continue to improve this functionality. But we have also seen announcements that Oracle Analytics will support (or is already supporting) machine learning models created in Oracle Autonomous Databases (OML4SQL). the idea is that machine learning model is trained and created as it is already possible with Oracle Autonomous Database, and is then exposed to Oracle Analytics. This definitely gives opportunity for "more serious" data models created by Data Scientists to be used with Oracle Analytics.
On the other hand, we have seen that Oracle develops its own Python libraries that are part of its Oracle Data Science infrastructure (based on Jupyter notebooks). Oracle has developed Oracle Accelerated Data Science SDK (ADS) is a python based library. And it is worth to know that Python runs machine learning behind the scenes in Oracle Data Visualization too.
In particular, AutoML functionality which is already available in Oracle Data Science could be very interesting. Basic idea of AutoML is to simplify and automate the machine learning process. So we might look for something like that in Oracle Analytics too.