Skip to main content

Using OCI Object Storage based Raw Data in Oracle Analytics

What is Object Storage?

Let’s start this blog with a definition of Object Storage. Object Storage is an internet-scale, high-performance storage platform that offers reliable and cost-efficient data durability. Object Storage can store an unlimited amount of unstructured data of any content type, including analytic data

Object Storage can also be used as a cold storage layer for the data warehouse by storing data that is used infrequently and then joining it seamlessly with the most recent data by using hybrid tables in Oracle Autonomous Data Warehouse.

More on object storage can be found here.

It seems that Object Storage can be considered as a key elements of Oracle Data Lakehouse architecture.

In this blog post we are looking at setting up an Object Storage bucket with several data files and using those data for analysis in Oracle Analytics. One might argue some better and more reasonable approaches, but we are focusing on particular elements in the first place and we are not looking for absolutely best possible solution.

Setting up Object Storage

Buckets are containers which are used to store objects. These can be any, from text files to images and audio. Beside the content, each object contains metadata about itself. Each object is stored in a bucket.

In our example, we create a new bucket called retail_data:

Bucket is created as a Standard storage type and currently, it is still Private:

We almost good to continue with data upload, however, we need to do a thing or two before we start the load process.

First, let’s edit visibility. At default, bucket is set to Private. We will not set any particular access rules or networks in this example, therefore, let’s just change visibility to Public.

Before we start loading the data files, let’s create a folder for these files. Later, we will use a reference to all the files in this folder.

A new folder is called transactions. In order to start loading the data, navigate to the folder first. In our example, we will load data files manually, however, in some other scenario, these could be done by using REST API or some streaming method, which we will also see in this series.

We can start the load. Our example (Dunnhumby’s Let’s Get Sort-of-Real dataset) is intentionally using a large number of not so small data files (.CSV). There is one file for (almost) each month in 2006 and 2007.

The load runs pretty fast and approx. 100 files are uploaded within a couple of minutes.

Once uploaded, you can review the uploaded files.

The last step in Object Storage setup is optional, but recommended. It is to create a Pre-Authenticated Request (PAR) to access the objects in a bucket. As you can see from the picture below, this can be done by whole bucket, by specific object in a bucket or group of objects with prefix. In our case, the prefix is name of the folder we created earlier, transactions/.

Using PAR is convenient as administrator can set the expiration date, hence having strong control over the access to the objects in the bucket.

Having done that, we have successfully created a bucket, uploaded files and given access to them. Any other transaction data files added later will inherit these settings automatically.

Creating an external table in ADB to access Object Storage files

If we want to access and analyse Object Storage data, there is one way of doing it by using Oracle Autonomous Database. 

What we need to do is to set external database table in Oracle ADB which will read data from files stored in Object Storage. Data can be queried using SQL as any other database table.

In OCI console navigate to Oracle Autonomous Datawarehouse instance and from there open Database Actions. 

In Development section, open SQL (web based SQL Developer):

The following statement creates external table:

BEGIN

DBMS_CLOUD.CREATE_EXTERNAL_TABLE(
table_name => 'RETAIL_TRANSACTIONS',
file_uri_list => 
            'https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/.../b/retail_data/o/*.csv',
        format => json_object('type' value 'csv', 'skipheaders' value '1', 'delimiter' value ',),
column_list => ' SHOP_WEEK VARCHAR2 (6) ,
SHOP_DATE VARCHAR2 (8) ,
SHOP_WEEKDAY NUMBER ,
SHOP_HOUR NUMBER ,
QUANTITY NUMBER ,
SPEND NUMBER ,
PROD_CODE VARCHAR2 (10) ,
PROD_CODE_10 VARCHAR2 (7) ,
PROD_CODE_20 VARCHAR2 (8) ,
PROD_CODE_30 VARCHAR2 (6) ,
PROD_CODE_40 VARCHAR2 (6) ,
CUST_CODE VARCHAR2 (14) ,
CUST_PRICE_SENSITIVITY VARCHAR2 (2) ,
CUST_LIFESTAGE VARCHAR2 (2) ,
BASKET_ID NUMBER ,
BASKET_SIZE VARCHAR2 (1) ,
BASKET_PRICE_SENSITIVITY VARCHAR2 (2) ,
BASKET_TYPE VARCHAR2 (20) ,
BASKET_DOMINANT_MISSION VARCHAR2 (20) ,
STORE_CODE VARCHAR2 (10) ,
STORE_FORMAT VARCHAR2 (2) ,
STORE_REGION VARCHAR2 (3) ');
END;
/

Let’s take a look a bit closer and review SQL statement above:

DBMS_CLOUD.CREATE_EXTERNAL_TABLE procedure is used to create an external table. A new table RETAIL_TRANSACTIONS will be created base on the specification in file_uri_list. This list contains the URL link to the files in object storage. URL consists of two parts:

  • https://objectstorage.eu-frankfurt-1.oraclecloud.com/p/.../b/retail_data/o is obtained when PAR is created. This link is pointing to transactions/ folder.
  • /*.csv is added in order to indicate the reference is for all .CSV files in that folder.

The format part defines the format of files stored in Object Storage, which is in this case CSV. It also defines delimiter and how lines to skip from the top of the files. 

The last part is column_list, which lists all table columns and their formats.

When created, an external table can be queried as any other database table. Data retrieval might be longer, compering to “normal” database tables, which is due to the location of data and characteristics of Object Storage which is usually slower than the database.

Using Object Storage data for analyses in Oracle Analytics Cloud

Bringing a new (external) database table into Oracle Analytics seems to be relatively straight forward. Unexpectedly, I ran into some issues, which at the end resolved (it seems to me) by themselves. 

As mentioned, the process is as usual, it starts with a new dataset creation:

And then continues with data import and profiling (done automatically). But that didn’t go very well. I was able to choose database table from database and import started. 

Then, just before operation finished, I got the following error message:

After a while and after I finally decided to read the message properly, I noticed that the “reject limit” parameter was not set high enough. It is true, I didn’t verify the cause (possibly rows with NULL values), but I had a lot of errors in dataset and consequently, rows were rejected, and once the limit is reached, general error is thrown and operation was cancelled. 

In order to avoid this kind of issue, consider adding the following parameter ’rejectlimit’ value ‘1000’ to the format  clause CREATE_EXTERNAL_TABLE command. format should then look like this:

format => json_object('type' value 'csv', 'skipheaders' value '1', 'delimiter' value ‘,’,                         'rejectlimit' value '1000')

Then also profiling would work and dataset could be used in Oracle Analytics. Of course, detailed investigation of rejected rows is to be performed.

All dataset operations can be also performed without any issues. For example, use recommendation for the SHOP_DATE column.

And of course, once dataset is prepared, creating workbook is also without any issues:


Conclusion

This short demo shows how easy is to use other data sources than just standard database tables and views in analyses. Oracle database package DBMS_CLOUD is really opening Oracle database to any other sources. And once database can connect to external data sources, these can be used in tools like Oracle Analytics as well. At the moment, Oracle Analytics cannot connect to Objects Storage.

This blog is meant to be only the first in the series of blog posts that expand into Oracle Data Lakehouse. In the forthcoming posts we will look at products such as Oracle Streaming, Data Flows, Data Catalog and others to cover the flow of data from different data sources to Oracle Analytics and Oracle Machine Learning. Stay tuned!