It's been more than two years now from my post Getting started with Machine Learning in Oracle Data Visualization (https://zigavaupot.blogspot.com/2018/03/getting-started-with-machine-learning.html). In that post I tried to explain how Oracle Data Visualization (Oracle Analytics) can be used for predicting churn in a telecom company. For that exercise I used data files which I found on Kaggle. Interesting enough, this data set is no longer available, but still I keep original files.
Test data files are actually New Data datasets (and not Test!) because they don't contain the target feature, therefore they can be only used for prediction and not training. However, in the data preparation same transformations should apply for both datasets.
The idea of the example below is to revisit and try to improve if possible. To be honest, I wasn't paying too much of attention in data analysis and preparation then. As we all know, data preparation step is basically the most important in machine learning process. Therefore I will try to follow some of the guidelines and best practices in data preparation for machine learning and will try to use Data Flows functionality in Oracle Analytics.
Data Analysis
Let's start with the source data analysis. I have the following 8 files:
- Train
- TELCO_Train.csv
- TELCO_Train_AccountInfo.csv
- TELCO_Train_Demographics.csv
- Train_ServicesOptedFor.csv
- Test
- TELCO_Test.csv
- TELCO_Test_AccountInfo.csv
- TELCO_Test_Demographics.csv
- Test_ServicesOptedFor.csv
Original files can be download from my Google Drive.
For the source data analysis, only Train dataset has been used.
TELCO_Train.csv
Once uploaded, this dataset contains the following attributes:
{Row Count} column has been added to two original columns CustomerID and Churn. The main purpose for doing this is in visualisation of data counts.
Data Flows give users possibility to add new columns, transform columns or perform several other operations when data is uploaded and prepared for analysis.
For example, adding a new column called {Row Count} is relatively simple:
As we progress with this data analysis, more columns will be added or transformed for easier presentation of data.
TELCO_Train_AccountInfo.csv
The 2nd dataset contains information about contract types, payment methods, billing types, charges, dates.
After the upload, you might find out that two columns, TotalCharges and DOE, are treated as text, even though the content should be numeric and date respectively.
You can see that Preparation Script is being built as you perform transformations. This is very useful because this script can be applied every time you change source data file.
The following transformations were carried out:
- Create Column: {Row Count}
- Create a new column DOE_DATE
- Edit Column: DOE_DATE with CASE WHEN LENGTH("DOE") = 9 THEN "DOE" ELSE CONCAT('0', "DOE") END
- Convert to Date: DOE_DATE
- Create a new column Total Charges
- Edit Column: Total Charges with IfNull(CAST("TotalCharges" AS Numeric), 0)
TELCO_Train_Demographics.csv
Demographics data file contains demographic data about the customer such as country, state, is customer retired, does he or she have a partner and has maybe dependents. This file also contains data about customer's gender and education.
In this dataset all numeric columns (IsRetired, HasPartner, HasDependents) have been converted to String.
For the Country and State, Enrichment recommendations have been suggested (see list on the right side) and applied to the dataset.
Based on this the following new columns have been created:
- Country_country_name,
- Lon,
- Lat
These enriched data help us to present the map about our customers locations:
We can already conclude, that all customers come from one country and one state. Therefore these 2 columns will not be needed in our prediction model.
Train_ServicesOptedFor.csv
The last file contains information about which services customers actually use. The structure of data is as follows:
- CustomerID
- TypeOfService
- ServiceDetails
We can add {Row Count} to this dataset as well, but more interesting are the original columns.
Data Visualization and Conclusions
In order to get familiar with 4 datasets, let's create some additional visualisations and let's further analyse data.
Basic overview can be seen in the first canvas of the project we created.
We can immediately observe that 3 datasets have the same number of rows whereas Train_ServicesOptedFor dataset contains much higher number of rows. More careful examination shows that there are 9 times more rows in that dataset than in the other three. Each customer in Train_Services_OptedFor has 9 rows, specifying individual services.
We will have to "denormalise" this dataset to have only one column per customer before we merge all 4 datasets into one training dataset.
In the picture above, I am using Narrate view of the project where I can make additional comments (comments are in red).
Oracle Data Visualisation also supports Natural Language Generation. You can observe "textual visualisation" describing the Churn column values distribution, stating that 75% of cases have value "No" and 25% of cases have the value "Yes".
Or, let's check narrative for PaymentMethod:
I am using this same feature to present Account Info and Demographics data:
In the last page, Services, I am interested in the distribution split among particular services and their details. We will have to apply several transformation for this specific dataset later, onehot transformation for example.
Based on this short analysis, we have found quite a lot of useful information which will be used later in data preparation steps.
We now know that:
- some data columns will have to transformed (ie. Total Charges to Numeric or DOE to Date)
- we don't need some of the columns (Country and State),
- one of the datasets will have to be "pivoted" or "denormalised" (Train_ServicesOptedFor),
- some new columns will have to be created (categorical values such as PaymentMethod and some others will have to be transformed using Onehot transformation).
We will apply this in the next blog post, which will talk about Telecom Churn Prediction Case Revisited: Data Preparation.