Hi All,

One more follow up. Perhaps there are two cases here: 1) "normal" columns which 
have evolved over time from one type to another, and 2) data which is 
inherently variable and non-relational.

One can make the case that the power/convenience of SQL comes from having a 
known set of types for "normal" columns, such as the product version example we 
discussed. The "data lake guys" use ETL (usually into Parquet) and HMS schemas 
to enforce and define the types. Clearly works for the data lake use case. 
Impala & Hive were successful for many years using this approach. Even Amazon 
does the same with Lambdas, Glue, etc. 

For Drill, we're talking data integration outside of a data lake. One can argue 
that known types are still useful. So, perhaps we need a light-weight way, 
perhaps part of the storage plugin configuration, to resolve ambiguities. That 
is, tell Drill, "infer the type of most columns, because they are obvious. But, 
for product version, treat it as a VARCHAR."

Then, there is the separate problem of types which are inherently 
non-relational: arbitrary JSON structures, say. Here, it is unclear what it 
means to use SQL to process data which contains a Customer in one row, an Order 
in the next, and a Blog Post in the third. The key is to understand real use 
cases to see what is done in actual practice since the naive conclusion would 
be that SQL is not a good tool for such data.

For the users out there, can you explain how you'd like this to work for your 
application?

Thanks,
- Paul

 

    On Thursday, January 16, 2020, 07:00:20 PM PST, Paul Rogers 
<[email protected]> wrote:  
 
 As a follow-up, to avoid us rehashing the same solutions we've discussed for 
some time, here are just a few:

* Provide a very light-weight schema "hint" only for those columns that need 
the hint. In the example, only provide schema for the ambiguous version column. 
If a column tends to have lots of nulls, tell Drill what type it will be when 
data actually appears. Arina has a good start on this approach.


* Populate the above automagically as queries run. (Ted has long advocated this 
one.)

* Populate the schema by converting existing specs, such as a Swagger spec.

* Make the existing UNION and LIST (repeated UNION) types work fully (whatever 
that means; we'd need type conversion rules for all type combinations.)

* Add true dynamic types. Drill half-way supports vectors that hold Java 
objects. Make this a first-class feature. Auto-convert conflicting fields. Also 
needs type conversion rules for all combinations. (Might be more handy for 
special cases, to heavy-weight for the typical schema-evolution case. Might be 
an easy way to handle complex structures such as images, complex JSON 
structures, etc.)

* (Insert your ideas here.)


Point is, we're not short on ideas. What we need to know is what folks want; 
then we can pick the idea that works best.


Thanks,
- Paul

 

    On Thursday, January 16, 2020, 6:48:22 PM PST, Paul Rogers 
<[email protected]> wrote:  
 
 Hi Jiang,

Thanks for taking the time to explain your use case. In my experience, the 
scenario you describe is not unusual: the desire to integrate application data 
behind a common SQL interface. IMHO, this is where Drill differs from the "data 
lake guys" such as Impala and Presto. Would be helpful if you can help us get 
the solution right.


 You have hit upon one of the key challenges in making this model work: the 
question of how to use SQL to work with a column that has a varying data type. 
Your product version use case is a classic example.


In the non-relational world, types are messy. That's why we have Python: we can 
write code that forces the values to one type or another, we can do "if" 
statements based on types, or we can use "duck typing" to do dynamic method 
dispatch.


But, you want to use the convenience (and speed) of SQL. We want to hide these 
details somehow. So we have to be clever.

Hive (then Presto, Impala and even Drill, when used with HMS) solved the 
problem by requiring a schema stored in HMS. Simple idea, but HMS has become 
heavyweight. With HMS, we can declare that our version field is a VARCHAR and 
conversions can be done at read time (schema-on-read.) Problem solved. HMS 
works best for files. It does not work for the app integration use case: 
reading data from ad-hoc files, accessing APIs and so on.

Schemas really help with distributed systems. Suppose I say SELECT a + b, SQL 
can work out that a is an INT, b is a BIGINT and the result should be a BIGINT. 
Work can be distributed across many independent nodes and all will make the 
same type decisions. Data can then be merged and types will agree. Basic stuff. 
In fact, types are part of the very definition of the relational theory on 
which SQL is based. 


But, if data types vary (no schema, as in Drill without HMS), things get messy. 
Drill can distribute filters. Suppose I say WHERE version = 10. Drillbit 1 
reads the INT-valued version fields, Drillbit 2 reads the VARCHAR valued 
fields. How do we make sure that both nodes make the same decisions?

Later, when data is merged, should there be a common type? What if I say ORDER 
BY version. What is the intended result? Sort the INT values before (or after) 
VARCHAR? Convert VARCHAR to INT (or visa-versa)?

Another factor is speed. Python is slow because it does dynamic type handling 
on every operation. Java and C are fast because they are statically typed. 
Similarly, Impala is fast because of static types. Drill tries to be fast by 
having a fixed vector type for each column. But, Drill also tries to be 
flexible, where is when things start to get "interesting."

The crude-but-effective solution, without a schema, is to require users to 
explicitly include the CAST and other statements in every query. It seems this 
was standard Hadoop practice before HMS came along. But, since you want to use 
SQL, you presumably want to shield users from this complexity.

See? You have nailed the key challenge we must solve to make the app 
integration idea work.


We can make up lots of things we *could* do. They pretty much fall into three 
buckets:

* Give the user or app developer a way to declare the type so all nodes do the 
same thing. Declare the type either per-query (with a CAST) or per-table (using 
some kind of schema or view.)

* Discover or impose a uniform set of conversion rules that work in all cases. 
(Always convert to VARCHAR, say.)

* Embrace dynamic types. Add dynamic types to Drill to make it more 
Python-like, with "duck typing" to decide, say, that "+" means different things 
for different type combinations. Accept the resulting performance hit.


Before we brainstorm specific alternatives, it would be very helpful to 
understand what you actually *want* to do. What was the intent when the type 
changed? How do you want to users to work with such data in a SQL context? How 
would your ideal user experience look?


Thanks,
- Paul

 

    On Thursday, January 16, 2020, 4:47:01 PM PST, Jiang Wu 
<[email protected]> wrote:  
 
 Moving the topic on non-relational data to this dedicated thread.  First a
bit of context based on our use case:

* We want to do ad-hoc analyze data coming from diverse sources like APIs,
document stores, and relational stores.
* Data are not limited to relational structures, e.g. API returning complex
object collections.
* Data may change its structure over time, e.g. due to implementation
upgrades.
* We want to use high level declarative query languages such as SQL.

Various techniques exist to tackle non-relational data analysis such as
mapping to a relational schema or run custom code in a distributed compute
cluster (map-reduce, spark jobs, etc) on blob data.  These have their
drawbacks like data latency and effort on structure transformation, and
query latency and cost computing on blob data.

We built a columnar data store for non-relational data without pre-defined
schema.  For querying this data, technologies like Drill made it almost
possible to directly work with non-relational data using array and map data
types.  However, we feel more can be done to truly make non-relational data
a first class citizen:

1) functions on array and map -- e.g. sizeOf(person.addresses) where
person.addresses is an array.  Using FLATTEN is not the same as working
with complex objects directly,
2) heterogenous types -- better handling of heterogeneous data types within
the same column, e.g. product.version started as numbers, but some are
strings.  Treating every value as a String is a workaround.
3) better storage plugin support for complex types -- we had to re-generate
from our columnar vectors into objects to give to Drill, rather than
feeding vectors directly.

I don't think any of these are easy to do.  Much research and thinking will
be needed for a cohesive solution.

-- Jiang
      

Reply via email to