Hi, Thanks for all the details.
Come back to one use case : the context is the transformation into Parquet of JSONs containing billions of records and for which each record have the global same schema but can have some specificities. Simplified example: {"a":"horses","b":"28","c":{"c1":"black","c2":"blue"}} {"a":"rabbit","b":"14","c":{"c1":"green" ,"c4":"vanilla"}} {"a":"cow" ,"b":"28","c":{"c1":"blue" ,"c3":"black" ,"c5":{"d":"2","e":"3"}}} ... We need to transform the JSON into Parquet. So OK,for columns a and b (in this example) but for c (we don't/can't know all the possibilities and it's growing up continuously. So the solution is to read "c" as TEXT and report the use/treatment of the content. So in these example, destination Parquet will have 3 columns a : VARCHAR (example: 'horses') b : INT (example: 14 c : VARCHAR (example: '{"c1":"blue","c3":"black","c5":{"d":"2","e":"3"}}' We can't do that with drill because the "discover/alignement" of the "c" part of the json is too heavy in terms of resources and request crashes So we currently use a Spark solution as Spark allow to specify a schema when reading a file. Hope that can help or give ideas, Regards, > Hi, > > Welcome to the Drill mailing list. > > You are right. Drill is a SQL engine. It works best when the JSON input files > represent rows > and columns. > > Of course, JSON itself can represent arbitrary data structures: you can use > it to serialize > any Java structure you want. Relational tables and columns represent a small > subset of what > JSON can do. Drill's goal is to read relational data encoded in JSON, not to > somehow magically > convert any arbitrary data structure into tables and columns. > > As described in our book, Learning Apache Drill, even seemingly trivial JSON > can violate relational > rules. For example: > > {a: 10} {a: 10.1} > > Since Drill infers types, and must guess the type on the first row, Drill > will guess BIGINT. > Then, the very next row shows that that was a poor guess and Drill will raise > an error. I > like to call this the "Drill cannot predict the future" problem. > > > The team is working on a long-term project to allow you to specify a schema > to resolve ambiguities. > For example, you can tell Drill that column a above is a DOUBLE despite what > the data might > say. You can deal with schema evolution to say that column b (which does not > appear above, > but might appear in newer files) is an array of BIGINT. And so on. > > Drill also supports LATERAL JOIN and FLATTEN to handle nested tables: > > {name: "fred", orders: [ {date: "Jan 1", amount: 12.34}, {date: "Jan 12", > amount: 23.45}]} > > The schema, however, will not transform arbitrary JSON into tables and > columns. Some things > are better done in an ETL step where you can use the full power of a > declarative language > (Java or Scala in Spark, say) to convert wild & crazy JSON into relational > form. > > We are actively designing the schema feature. May I ask your use case? Would > be super-helpful > to understand the shape of your input data and how you want to map that into > SQL. > > One final note: it is possible to write a custom format plugin. If you will > query the same > wild & crazy JSON shape multiple times, you can write a plugin to do the > mapping as Drill > reads the data. Not the simplest path, but possible. > > > Thanks, > - Paul > > > >> On Wednesday, February 5, 2020, 1:30:14 PM PST, >> userdrill.mail...@laposte.net.INVALID >> <userdrill.mail...@laposte.net.invalid> wrote: >> >> Hi, >> >> Some JSON file are complex and containing differents "tree struct". >> If these file are big it will take too much time for drill to align the >> structures (and even >> worse sometimes fail). >> In spark it's possible to force a schema when reading a file to avoid long >> or useless treatment >> of align and eventually dismiss field and force type (like into string to >> avoid going down >> into the structure) >> >> Is there any possibility in drill to specify at read an explicit schema ? >> >> Thanks for any information