Oracle Analytics 5.9: Frequent Itemsets
In my recent blog Market Basket Analysis with Oracle Machine Learning and Oracle Analytics, I was discussing an option when you create a Market Basket Analysis model with Oracle Autonomous Database and use the result of that model with Oracle Analytics.
Now, with Oracle Analytics version 5.9, Market Basket Analysis has been included within Oracle Analytics. Not all functionality seems to be available, as only Frequent Itemsets analysis is supported, but I was told Association Rules should follow (this blog is using Oracle Analytics 5.9 Early Release version). This really simplifies the process as business users can now start analysing frequent itemsets, combinations of items that are most frequently sold together.
Frequent Itemsets in 5.9
Transactions data is stored in Oracle Autonomous Data Warehouse (ADW). Originally these data can be found on Dunnhumby Source Files (Carbo-Loading data set):
CARBO_TRANSACTIONS database table contains 5M+ transactions and has the following structure:
- BASKET: basket/transaction number
- UPC: product code.
Creating a Frequent Item-set data set
After we create a connection to our ADW, we need to create a data set based on the original database table.
CARGO_TRANSACTIONS data set can be created as follows:
Based on data set created, we can get some basic statistics:
It seems this data set is quite reasonable to play with.
Creating Frequent Itemsets
We will exploit Data Flow functionality in order to create a new frequent itemset data set. So, let's begin with a new data flow, that we will call Create CARBO_FREQUENT_ITEMSET dataset.
Data flow contains three steps:
Step 1: Add DataIn the first step, we simply read transactions database table, CARBO_TRANSACTIONS in our case.
There is nothing spectacular in this first step. Simply use Add Data and bring database table in. The second step is basically where the magic happens.
Step 2: Database Analytics: Frequent Itemsets
When defining Frequent Itemset data flow step, two groups of attributes need to be defined:
- Outputs and
Under Outputs several attributes can be defined. These outputs will form the saved results.
- ItemsetId is an identifier which is assigned to each and every frequent itemsets.
- SupportPercentage is the key measure that is being observed. It tells us how many this specific itemset occurs in the total number itemsets, expressed as a percentage. SupportPercentage equals to NumberOfOccurances divided by TotalCountOfTransactions.
- NumberOfItems simply tells us how many items are in a specific itemset.
- NumberOfOccurances shows how many times a specific itemset has occured.
- TotalCountOfTransactions shows how many frequent itemsets are there in transactions data.
- Item1, Item2, Item3 and Item4 are items that are included in a specific itemset. We can see that there is a possibility to analyse frequents itemsets with up to four items. In our example, I have unchecked Item 3 and Item 4 since we are going to analyse only combinations of 2 items sold together.
There are six parameters we need to define before running a data flow:
- Transaction Id Column is basically transaction identifier, in our case this is BASKET number.
- Item Value Column is product identifier, in our case UPC.
- Maximum and Minimum Itemset Size define product combination sizes we are interested in. We are going to analyse only combinations of two items.
- Minimum Support Percent filters out all item combinations with SupportPercentage less than the value set by this parameter. We had to go quite low do to low SupportPercentage for our frequent itemsets.
- Top N Frequent Itemset specifies a number of top N itemsets ordered by SupportPercentage. In our case we left the default value of 50.
Step 3: Save Data
In the last step, we save a newly created data set back to the ADW.
We can run the flow. From my experience, data flow is executed in several seconds, which I consider reasonably well, assuming we are dealing 5M+ transactions.
Analysing Frequent Itemsets
When we have run our data flow and created a new frequent itemsets database table, we can complete our data visualisation project.
Oracle Analytics' machine learning has now been extended by another nice functionality, which can be easily used by business users, hence removing the requirement for a data scientist.