Introduction to Oracle Data Science … and how to set it up?

Forword

In the beginning of February, Oracle has announced the availability of Oracle Cloud Infrastructure Data Science Service.

Very simplified, one could say that Oracle Data Science Service is yet another Jupyter notebook for writing machine learning script in Python. And they would be right. However, it is also more.

Oracle has stated that its mission is to Bring together the right Infrastructure, Data Management, and Data Science Tools to make data science more collaborative, scalable, and powerful for every enterprise.

And indeed, with OCI Data Science, you get a fully managed platform that has been built to meet the needs of teams of a modern enterprise. It provides users and development teams a project driven collaborative environment, which enables teams to work together on an end-to-end modelling workflow with self-service resources and data access. OCI Data Science is using Jupyter notebooks to support the lates open source tools such as Python. Tensorflow, Keras, Scikit-learn, MXNet and other.

The key OCI Data Science features that make a difference to other notebook environments are:

Projects are main “containers” organise your work. Every Project can contain more Notebook Sessions.
JupyterLab Notebook Sessions provides users with preinstalled Python libraries for data analysis, preprocessing, modelling, etc.
Model Catalog enables users to store all their machine learning models to a catalog, to make these models become auditable and reproducible.
Accelerated Data Science SDK is Oracle’s library to improve common data science tasks like preprocessing, exploratory analysis, model creation and testing, model deployment much faster, easier and less-error prone.

Setting up OCI Data Science

You need to have access to Oracle public cloud (cloud.oracle.com). You can find very well documented documentation here and if you follow it carefully, you’d be able to configure it is an hour or so. But if you are interested to see how I did it (a bit of a cookbook), then you are more than welcome to continue reading.

In general, there are 4 keys steps you have to follow in order to configure the instance. Once instance is configured, you need to create your first project and notebook session (and there is another “getting started” step) which I will describe in my next blog post.

The four steps in order to configure OCI Data Science are:

Setup a Data Scientists group and assign users to the group
Create a new compartment to own network and data science resources
Create a virtual cloud network (VCN) and subnets
Create OCI identity and access management policies

Setup a Data Scientists group and assign users to the group

This is pretty straight forward. If you are already using Oracle Cloud then you need to create a new group and assign users, data scientists, to that new group. For those who are not regular users or administrator of Oracle cloud, then here is a bit more in detail.

From the main menu, navigate to Identity > Groups.
Click Create Group and enter group name. In my case i named the new group QubixDataScientists.

Now you can create new users and add those users to the new group. In my case, we have already federated users in use, so I have mapped a group of users called QBX_DataScientist (IdP Mapped Groups) to the new group.

Create a new compartment to own network and data science resources

The next step is to create the following resources:

A compartment to own network resources. A Virtual Cloud Network (VCN), a public or private subnet, and other resources such as, an internet gateway or service gateway, a route table, and security lists.
A compartment to own Data Science resources. Projects, notebook sessions, models, and work requests.

Let’s create a new compartment:

Navigate to Identity > Compartments
Click Create Compartment and give a new to a new compartment. New compartment in my case is called QubixSlovenia-DataScience. We will create all OCI Data Science resources in this compartment.

Create a virtual cloud network (VCN) and subnets

To create a notebook session, VCN that contains a subnet is required. Notebook Session will always be created within that subnet. All egress from a notebook session is routed through this subnet. To access data and install additional packages to use in the notebook session, you must configure the subnet with appropriate access.

Each subnet in the VCN must have a CIDR block that provides at least one IP address for each concurrent notebook session that the users can run. Though this depends on the number of users and the load that they create. We recommend that you have a minimum of 12 free IP addresses for AD-specific subnets and a minimum of 32 free IP addresses for regional subnets.

Oracle strongly recommends that each subnet has a CIDR block that provides more than the minimum number of free IP addresses.

Navigate to Networking (in Core Infrastructure section of the main menu).
Create a VCN. In my case I used Network Quickstart and then VCN with Internet Connectivity. This option is much more convenient because the workflow will create everything you need automatically. I am not very good at these type of things, therefore it was an ideal solution in this situations, however I assume system administrators might do much better job here. But at the end it work for me very well.

In the Create a VCN with Internet Connectivity window, the following information needs to be provided:

VCN name,
Compartment - pick your compartment from the list of available compartments,
VCN CIDR block (I left the defaults unchanged)
Public Subnet CIDR block (I left the defaults unchanged)
Private Subnet CIDR blocl (I left the defaults unchanged)
Use DNS Hostnames in this VCN - this has to be checked!

Create OCI identity and access management policies

As the last step, before you can start using Oracle Data Science, you have to create a number of Oracle Cloud Infrastructure Identity and Access Management policies to grant access to Data Science-related and network resources:

Give users access to Data Science-related resources
Give users access to network resources.
Give Data Science service access to network resources.

For the first to policies, navigate to Identity > Policies and click Create Policy. Give a policy a name. In my case I created two new policies:

QubixDataScientists-manage-access
QubixDataScientists-manage-network-access

For each of the two policies, I used the following policy statements respectively:

allow group QubixDataScientists to manage data-science-family in compartment QubixSlovenia-DataScience for QubixDataScientists-manage-access
allow group QubixDataScientists to use virtual-network-family in compartment QubixSlovenia-DataScience for QubixDataScientists-manage-network-access

Additionally, Data Science Service needs to have acceess to network resources granted. Therefore another Policy has to created in the root compartment (!) as follows:

create a new policy QubixDataScientists-service-network-access
enter allow service datascience to use virtual-network-family in compartment QubixSlovenia-DataScience as a policy statement.

Getting started with OCI Data Science

OCI Data Science has now been configured. We will explore what still needs to be done before we can really start using. This is a topic for another blog post.

Žiga Vaupot's Blog

Search This Blog