Hi Paul. Thank you for all the analysis. Let me digest the information and i will come back with my views.
— Jiang On Fri, Jan 17, 2020 at 11:02 AM Paul Rogers <[email protected]> wrote: > Hi All, > > One more follow up. Perhaps there are two cases here: 1) "normal" columns > which have evolved over time from one type to another, and 2) data which is > inherently variable and non-relational. > > One can make the case that the power/convenience of SQL comes from having > a known set of types for "normal" columns, such as the product version > example we discussed. The "data lake guys" use ETL (usually into Parquet) > and HMS schemas to enforce and define the types. Clearly works for the data > lake use case. Impala & Hive were successful for many years using this > approach. Even Amazon does the same with Lambdas, Glue, etc. > > For Drill, we're talking data integration outside of a data lake. One can > argue that known types are still useful. So, perhaps we need a light-weight > way, perhaps part of the storage plugin configuration, to resolve > ambiguities. That is, tell Drill, "infer the type of most columns, because > they are obvious. But, for product version, treat it as a VARCHAR." > > Then, there is the separate problem of types which are inherently > non-relational: arbitrary JSON structures, say. Here, it is unclear what it > means to use SQL to process data which contains a Customer in one row, an > Order in the next, and a Blog Post in the third. The key is to understand > real use cases to see what is done in actual practice since the naive > conclusion would be that SQL is not a good tool for such data. > > For the users out there, can you explain how you'd like this to work for > your application? > > Thanks, > - Paul > > > > On Thursday, January 16, 2020, 07:00:20 PM PST, Paul Rogers < > [email protected]> wrote: > > As a follow-up, to avoid us rehashing the same solutions we've discussed > for some time, here are just a few: > > * Provide a very light-weight schema "hint" only for those columns that > need the hint. In the example, only provide schema for the ambiguous > version column. If a column tends to have lots of nulls, tell Drill what > type it will be when data actually appears. Arina has a good start on this > approach. > > > * Populate the above automagically as queries run. (Ted has long advocated > this one.) > > * Populate the schema by converting existing specs, such as a Swagger spec. > > * Make the existing UNION and LIST (repeated UNION) types work fully > (whatever that means; we'd need type conversion rules for all type > combinations.) > > * Add true dynamic types. Drill half-way supports vectors that hold Java > objects. Make this a first-class feature. Auto-convert conflicting fields. > Also needs type conversion rules for all combinations. (Might be more handy > for special cases, to heavy-weight for the typical schema-evolution case. > Might be an easy way to handle complex structures such as images, complex > JSON structures, etc.) > > * (Insert your ideas here.) > > > Point is, we're not short on ideas. What we need to know is what folks > want; then we can pick the idea that works best. > > > Thanks, > - Paul > > > > On Thursday, January 16, 2020, 6:48:22 PM PST, Paul Rogers > <[email protected]> wrote: > > Hi Jiang, > > Thanks for taking the time to explain your use case. In my experience, the > scenario you describe is not unusual: the desire to integrate application > data behind a common SQL interface. IMHO, this is where Drill differs from > the "data lake guys" such as Impala and Presto. Would be helpful if you can > help us get the solution right. > > > You have hit upon one of the key challenges in making this model work: > the question of how to use SQL to work with a column that has a varying > data type. Your product version use case is a classic example. > > > In the non-relational world, types are messy. That's why we have Python: > we can write code that forces the values to one type or another, we can do > "if" statements based on types, or we can use "duck typing" to do dynamic > method dispatch. > > > But, you want to use the convenience (and speed) of SQL. We want to hide > these details somehow. So we have to be clever. > > Hive (then Presto, Impala and even Drill, when used with HMS) solved the > problem by requiring a schema stored in HMS. Simple idea, but HMS has > become heavyweight. With HMS, we can declare that our version field is a > VARCHAR and conversions can be done at read time (schema-on-read.) Problem > solved. HMS works best for files. It does not work for the app integration > use case: reading data from ad-hoc files, accessing APIs and so on. > > Schemas really help with distributed systems. Suppose I say SELECT a + b, > SQL can work out that a is an INT, b is a BIGINT and the result should be a > BIGINT. Work can be distributed across many independent nodes and all will > make the same type decisions. Data can then be merged and types will agree. > Basic stuff. In fact, types are part of the very definition of the > relational theory on which SQL is based. > > > But, if data types vary (no schema, as in Drill without HMS), things get > messy. Drill can distribute filters. Suppose I say WHERE version = 10. > Drillbit 1 reads the INT-valued version fields, Drillbit 2 reads the > VARCHAR valued fields. How do we make sure that both nodes make the same > decisions? > > Later, when data is merged, should there be a common type? What if I say > ORDER BY version. What is the intended result? Sort the INT values before > (or after) VARCHAR? Convert VARCHAR to INT (or visa-versa)? > > Another factor is speed. Python is slow because it does dynamic type > handling on every operation. Java and C are fast because they are > statically typed. Similarly, Impala is fast because of static types. Drill > tries to be fast by having a fixed vector type for each column. But, Drill > also tries to be flexible, where is when things start to get "interesting." > > The crude-but-effective solution, without a schema, is to require users to > explicitly include the CAST and other statements in every query. It seems > this was standard Hadoop practice before HMS came along. But, since you > want to use SQL, you presumably want to shield users from this complexity. > > See? You have nailed the key challenge we must solve to make the app > integration idea work. > > > We can make up lots of things we *could* do. They pretty much fall into > three buckets: > > * Give the user or app developer a way to declare the type so all nodes do > the same thing. Declare the type either per-query (with a CAST) or > per-table (using some kind of schema or view.) > > * Discover or impose a uniform set of conversion rules that work in all > cases. (Always convert to VARCHAR, say.) > > * Embrace dynamic types. Add dynamic types to Drill to make it more > Python-like, with "duck typing" to decide, say, that "+" means different > things for different type combinations. Accept the resulting performance > hit. > > > Before we brainstorm specific alternatives, it would be very helpful to > understand what you actually *want* to do. What was the intent when the > type changed? How do you want to users to work with such data in a SQL > context? How would your ideal user experience look? > > > Thanks, > - Paul > > > > On Thursday, January 16, 2020, 4:47:01 PM PST, Jiang Wu > <[email protected]> wrote: > > Moving the topic on non-relational data to this dedicated thread. First a > bit of context based on our use case: > > * We want to do ad-hoc analyze data coming from diverse sources like APIs, > document stores, and relational stores. > * Data are not limited to relational structures, e.g. API returning complex > object collections. > * Data may change its structure over time, e.g. due to implementation > upgrades. > * We want to use high level declarative query languages such as SQL. > > Various techniques exist to tackle non-relational data analysis such as > mapping to a relational schema or run custom code in a distributed compute > cluster (map-reduce, spark jobs, etc) on blob data. These have their > drawbacks like data latency and effort on structure transformation, and > query latency and cost computing on blob data. > > We built a columnar data store for non-relational data without pre-defined > schema. For querying this data, technologies like Drill made it almost > possible to directly work with non-relational data using array and map data > types. However, we feel more can be done to truly make non-relational data > a first class citizen: > > 1) functions on array and map -- e.g. sizeOf(person.addresses) where > person.addresses is an array. Using FLATTEN is not the same as working > with complex objects directly, > 2) heterogenous types -- better handling of heterogeneous data types within > the same column, e.g. product.version started as numbers, but some are > strings. Treating every value as a String is a workaround. > 3) better storage plugin support for complex types -- we had to re-generate > from our columnar vectors into objects to give to Drill, rather than > feeding vectors directly. > > I don't think any of these are easy to do. Much research and thinking will > be needed for a cohesive solution. > > -- Jiang >
