Thanks for the explanation, very helpful. There are two parts to the problem. On the one hand, you want to read an ever-changing set of JSON files. Your example with "c5" is exactly the kind of "cannot predict the future" issues that can trip up Drill (or, I would argue, any tool that tries to convert JSON to records.)
One thing we have discussed (but not implemented) in Drill is the ability to read a JSON object as a Java map. That is, store the non-relational JSON structure as a more JSON-like Java data structure. The value of your "c" column would then be a Java Map which sometimes contains a column "c5" which is also a Java Map. This would allow Drill to handle any arbitrary JSON without the need of a schema. So, we we have a solution, right? Not so fast. We now get to the other part of the problem. Parquet is columnar; but it assumes that a file has a unified schema. Even if Drill could say, "here is your ragged collection of JSON objects", you'd still need to unify the schema to write the Parquet file. You are presumably creating Parquet to be consumed by other query tools. So, whatever tool consumes the Parquet now has the same issue as we had with JSON, but now across Parquet files: if some files have one schema, some files another schema, most tools (other than Drill), won't even get started; they need a consistent schema. Hive solves this by allowing you to specify an (evolving) schema. You tell Hive that c.c5.d and c.c5.e are part of your schema. Hive-compliant tools know to fill in nulls when the columns don't exist in some particular Parquet file. So, taking this whole-system view we see that, in order to use Parquet as a relational data source, you will need to know the schema of the data in a form that your desired query tool understands. Thus, you need Drill to help you build Parquet files that satisfy that schema. This brings us to the other solution we've discussed: asking you to provide a schema (or allow Drill to infer it). That way, even when Drill reads your "horses" record, Drill will know to create a c.c5 column as a map. That is, Drill will have the information to map your complex JSON into a relational-like structure. Thus, when Drill writes the Parquet file, it will write a consistent schema. the same one you must provide to non-Drill query tools. Then, Drill won't write Parquet that depends on the columns that did or did not show up in a single ETL run. So, two possible (future) solutions: 1) Java Maps (but wild and crazy Parquet schemas), or 2) declared schema to unify JSON files. Will one or the other work for your use case? Or, is there some better solution you might suggest? And, what is the target tool for the Parquet files you are creating? How will you handle schema evolution with those tools? Thanks, - Paul On Friday, February 14, 2020, 7:15:15 AM PST, userdrill.mail...@laposte.net.INVALID <userdrill.mail...@laposte.net.invalid> wrote: Hi, Thanks for all the details. Come back to one use case : the context is the transformation into Parquet of JSONs containing billions of records and for which each record have the global same schema but can have some specificities. Simplified example: {"a":"horses","b":"28","c":{"c1":"black","c2":"blue"}} {"a":"rabbit","b":"14","c":{"c1":"green" ,"c4":"vanilla"}} {"a":"cow" ,"b":"28","c":{"c1":"blue" ,"c3":"black" ,"c5":{"d":"2","e":"3"}}} ... We need to transform the JSON into Parquet. So OK,for columns a and b (in this example) but for c (we don't/can't know all the possibilities and it's growing up continuously. So the solution is to read "c" as TEXT and report the use/treatment of the content. So in these example, destination Parquet will have 3 columns a : VARCHAR (example: 'horses') b : INT (example: 14 c : VARCHAR (example: '{"c1":"blue","c3":"black","c5":{"d":"2","e":"3"}}' We can't do that with drill because the "discover/alignement" of the "c" part of the json is too heavy in terms of resources and request crashes So we currently use a Spark solution as Spark allow to specify a schema when reading a file. Hope that can help or give ideas, Regards, > Hi, > > Welcome to the Drill mailing list. > > You are right. Drill is a SQL engine. It works best when the JSON input files > represent rows > and columns. > > Of course, JSON itself can represent arbitrary data structures: you can use > it to serialize > any Java structure you want. Relational tables and columns represent a small > subset of what > JSON can do. Drill's goal is to read relational data encoded in JSON, not to > somehow magically > convert any arbitrary data structure into tables and columns. > > As described in our book, Learning Apache Drill, even seemingly trivial JSON > can violate relational > rules. For example: > > {a: 10} {a: 10.1} > > Since Drill infers types, and must guess the type on the first row, Drill > will guess BIGINT. > Then, the very next row shows that that was a poor guess and Drill will raise > an error. I > like to call this the "Drill cannot predict the future" problem. > > > The team is working on a long-term project to allow you to specify a schema > to resolve ambiguities. > For example, you can tell Drill that column a above is a DOUBLE despite what > the data might > say. You can deal with schema evolution to say that column b (which does not > appear above, > but might appear in newer files) is an array of BIGINT. And so on. > > Drill also supports LATERAL JOIN and FLATTEN to handle nested tables: > > {name: "fred", orders: [ {date: "Jan 1", amount: 12.34}, {date: "Jan 12", > amount: 23.45}]} > > The schema, however, will not transform arbitrary JSON into tables and > columns. Some things > are better done in an ETL step where you can use the full power of a > declarative language > (Java or Scala in Spark, say) to convert wild & crazy JSON into relational > form. > > We are actively designing the schema feature. May I ask your use case? Would > be super-helpful > to understand the shape of your input data and how you want to map that into > SQL. > > One final note: it is possible to write a custom format plugin. If you will > query the same > wild & crazy JSON shape multiple times, you can write a plugin to do the > mapping as Drill > reads the data. Not the simplest path, but possible. > > > Thanks, > - Paul > > > >> On Wednesday, February 5, 2020, 1:30:14 PM PST, >>userdrill.mail...@laposte.net.INVALID >> <userdrill.mail...@laposte.net.invalid> wrote: >> >> Hi, >> >> Some JSON file are complex and containing differents "tree struct". >> If these file are big it will take too much time for drill to align the >> structures (and even >> worse sometimes fail). >> In spark it's possible to force a schema when reading a file to avoid long >> or useless treatment >> of align and eventually dismiss field and force type (like into string to >> avoid going down >> into the structure) >> >> Is there any possibility in drill to specify at read an explicit schema ? >> >> Thanks for any information