Working with Graphs in Oracle Analytics - Subgraph, Shortest Path, Clusters
In my previous post I described, in a bit more details, how to perform graph analysis in the case of Node Ranking. Basically the key tool that you can use in Oracle Analytics is Data Flows. Graph Analytics step in Data Flows enables users to perform four graph analytics operations. Besides Node Ranking these are Sub Graph, Clusters and Shortest Path.
For easier understanding and visualisation we are using the following Dolphins dataset.
Sub Graph operations finds all nodes within specified number of hops of a given node. Using other words, Sub Graph finds all nodes, neighbours of a given node, if we specify the number of hops is one. If number of hops is two, Sub Graph returns all neighbouring nodes of a given node and all neighbours of found neighbours, and so on.
This is for example useful in marketing when we can find who are friends of a customer who has bought a specific product. We might assume that customer presented that product to his friends and is also possible that those friends would talk about it to their friends, who are two hops away.
To find out Sub Graphs of a given node are, Sub Graph operation should be selected in a data flow:
Basically, there are only three mandatory parameters required. Source Vertex is a given node. In this case this is Dolphin 58 from source columns Dolphin1. Dolphin 2 is Destination Column. And Number of Hops defines how many hops are considered in a Sub Graph.
In our case we store only these three columns in data set which is used to visualise a subgraph in a graph.
Clusters is an operation that finds clusters in a graph. In this case, a cluster is defined with connected nodes. Parts of a graph that have no connection with other parts of graph are considered clusters.
Clusters as a graph operation is relatively easy to define.
As you can see, there are only two parameters, Source Column and Destination Column. In our case, these are DOLPHIN1 and DOLPHIN2.
The operator output are ClusterId (randomly created id) and Node_Vertex that belongs to that specific cluster.
The results of graph clusters identification is rather visual:
The shortest path is one of the most known graph analytics problem. The idea is to optimise a travel from point A to point B taking into account cost of the travel, usually defined as weight.
In our case, weight is equal to one, so we are focused only on finding the shortest connection between two dolphins. Otherwise weight can be distance between two towns, cost of travel between to airports, ...
As already indicated, Shortest Path operation requires three parameters:
- starting node, Source Vertex, which is in source column DOLPHIN1 and has value 61;
- end node, Destination Vertex, which is in source column DOLPHIN2 and has value 14;
- Weight Column, which in this case is always equal to 1.
Once operation is executed, path between starting and end node will be stored in a new dataset. However, each path can only be stored as a series of steps. These steps are stored as separate rows in a dataset. Each step of the path has source and destination which then forms a series of steps: A->B, B->D, D-E, for path between A and E. In this case three rows will define that path, with a sequences 1 through 3 for each of the steps respectively.
The output for this operation are the following attributes:
- Path_Sequence has value of "Y" if the specific step in the path is part of the shortest path.
- Source is a source column for the specific step in the path.
- Destination is a destination column of the specific step in the path.
- Step is a step number in the path.
If graph is filter on the value Y in Path_Sequence attribute, Shortest Path is filtered and displayed:
In this short series on Working with Graphs in Oracle Analytics, I think I was able to present some of the graph analytics functionality built in Oracle Analytics.
It is not much, one would say, especially if we compare this feature set with Oracle Graph analytics available with Oracle Databases. Yes, this is probably true, however, is better than nothing and we have been told this is going to evolve.
What is important is that all these analytics are now available to business users and analysts through end-user tool Oracle Analytics, which is very similar to what we can observe with Machine Learning.
Bottom line is that for some business problems, we as business users are able to try to resolve by ourselves, not waiting for IT developers and data scientist to become available to work on our problems. I think there is a lot of value in it.
Previous posts in the series: