Andries Engelbrecht created DRILL-2265:
------------------------------------------

             Summary: Drill data exploration function for complex data types
                 Key: DRILL-2265
                 URL: https://issues.apache.org/jira/browse/DRILL-2265
             Project: Apache Drill
          Issue Type: Improvement
          Components: Functions - Drill
            Reporter: Andries Engelbrecht
            Assignee: Daniel Barclay (Drill/MapR)


Drill data exploration function for complex data types

When dealing with complex data in large volumes it will be extremely useful to 
have a function to collect metadata to provide a better view of the total data 
set.

If JSON is used as an example a data set can have an extremely large volume of 
JSON objects. Each object can have multiple schemas and subschemas with 
multiple nested subschemas as well as arrays. Not all objects will have all of 
the schemas or subschemas. When exploring this data in Drill a SQL dot notation 
is used to navigate the complex subschema structure, and it can become very 
cumbersome to fully understand the total picture of all the data.

A function that can explore the JSON objects in a data set (whether single file 
with multiple objects, single or multilevel directory structure) and provide 
the total structure of all the JSON objects to show all schema, subschema and 
arrays that are available for all the JSON objects. This way a data analyst 
will be able to see within the data set all the schema data that is available. 
Additionally if the function can provide the statistics information to show how 
many of the objects actually contain each of the schemas, subschemas and arrays 
(and data in each), this may indicate to an analyst how valuable or important 
in may be to explore any subschema or array.

To speed up the collection of this data, the function may contain an option to 
set a sample size to only sample a portion of the total volume and project the 
total data set. This is a very common operation being used with prominent RDBMS 
systems today. Additionally for data that changes or grows the metadata 
collection function will need to be run periodically to update the statistics.

To make the metadata more useful the results should be considered to be placed 
in a Drill metadata structure, similar to INFORMATION_SCHEMA, but specifically 
for statistics metadata only to be used by analysts for data exploration. Some 
security considerations should also be deigned to only allow access to users 
with access to the base data.

In addition to the use for data analyst and data exploration the metadata and 
statistics can also be used for Drill internal functions in the future, such as 
query optimization and creation of views.

This example specifically focusses on JSON data, but can similarly be applied 
to other complex data types that may require a very detailed understanding of 
the complex data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to