Re: Requesting json file with schema

userdrill.mail...@laposte.net.INVALID Fri, 14 Feb 2020 07:15:49 -0800

Hi,

Thanks for all the details.


Come back to one use case : the context is the transformation into Parquet of 
JSONs containing billions 
of records and for which each record have the global same schema but can have 
some specificities.
Simplified example:
{"a":"horses","b":"28","c":{"c1":"black","c2":"blue"}}
{"a":"rabbit","b":"14","c":{"c1":"green"                         
,"c4":"vanilla"}}
{"a":"cow"   ,"b":"28","c":{"c1":"blue"             ,"c3":"black"               
,"c5":{"d":"2","e":"3"}}}
...

We need to transform the JSON into Parquet.
So OK,for columns a and b (in this example) but for c (we don't/can't know all 
the possibilities and 
it's growing up continuously. So the solution is to read "c" as TEXT and report 
the use/treatment of the content.
So in these example, destination Parquet will have 3 columns
a : VARCHAR (example: 'horses')
b : INT     (example: 14
c : VARCHAR (example: '{"c1":"blue","c3":"black","c5":{"d":"2","e":"3"}}'

We can't do that with drill because the "discover/alignement" of the "c" part 
of the json is too heavy in 
terms of resources and request crashes

So we currently use a Spark solution as Spark allow to specify a schema when 
reading a file.

Hope that can help or give ideas,

Regards,

> Hi,
> 
> Welcome to the Drill mailing list.
> 
> You are right. Drill is a SQL engine. It works best when the JSON input files 
> represent rows
> and columns.
> 
> Of course, JSON itself can represent arbitrary data structures: you can use 
> it to serialize
> any Java structure you want. Relational tables and columns represent a small 
> subset of what
> JSON can do. Drill's goal is to read relational data encoded in JSON, not to 
> somehow magically
> convert any arbitrary data structure into tables and columns.
> 
> As described in our book, Learning Apache Drill, even seemingly trivial JSON 
> can violate relational
> rules. For example:
> 
> {a: 10} {a: 10.1}
> 
> Since Drill infers types, and must guess the type on the first row, Drill 
> will guess BIGINT.
> Then, the very next row shows that that was a poor guess and Drill will raise 
> an error. I
> like to call this the "Drill cannot predict the future" problem.
> 
> 
> The team is working on a long-term project to allow you to specify a schema 
> to resolve ambiguities.
> For example, you can tell Drill that column a above is a DOUBLE despite what 
> the data might
> say. You can deal with schema evolution to say that column b (which does not 
> appear above,
> but might appear in newer files) is an array of BIGINT. And so on.
> 
> Drill also supports LATERAL JOIN and FLATTEN to handle nested tables:
> 
> {name: "fred", orders: [ {date: "Jan 1", amount: 12.34}, {date: "Jan 12", 
> amount: 23.45}]}
> 
> The schema, however, will not transform arbitrary JSON into tables and 
> columns. Some things
> are better done in an ETL step where you can use the full power of a 
> declarative language
> (Java or Scala in Spark, say) to convert wild & crazy JSON into relational 
> form.
> 
> We are actively designing the schema feature. May I ask your use case? Would 
> be super-helpful
> to understand the shape of your input data and how you want to map that into 
> SQL.
> 
> One final note: it is possible to write a custom format plugin. If you will 
> query the same
> wild & crazy JSON shape multiple times, you can write a plugin to do the 
> mapping as Drill
> reads the data. Not the simplest path, but possible.
> 
> 
> Thanks,
> - Paul
> 
>  
> 
>>     On Wednesday, February 5, 2020, 1:30:14 PM PST, 
>> userdrill.mail...@laposte.net.INVALID
>> <userdrill.mail...@laposte.net.invalid> wrote:  
>>  
>>  Hi,
>> 
>> Some JSON file are complex and containing differents "tree struct".
>> If these file are big it will take too much time for drill to align the 
>> structures (and even
>> worse sometimes fail).
>> In spark it's possible to force a schema when reading a file to avoid long 
>> or useless treatment
>> of align and eventually dismiss field and force type (like into string to 
>> avoid going down
>> into the structure)
>> 
>> Is there any possibility in drill to specify at read an explicit schema ?
>> 
>> Thanks for any information

Re: Requesting json file with schema

Reply via email to