Re: Requesting json file with schema

Paul Rogers Fri, 14 Feb 2020 15:07:33 -0800

Thanks for the explanation, very helpful.

There are two parts to the problem. On the one hand, you want to read an 
ever-changing set of JSON files. Your example with "c5" is exactly the kind of 
"cannot predict the future" issues that can trip up Drill (or, I would argue, 
any tool that tries to convert JSON to records.)

One thing we have discussed (but not implemented) in Drill is the ability to 
read a JSON object as a Java map. That is, store the non-relational JSON 
structure as a more JSON-like Java data structure. The value of your "c" column 
would then be a Java Map which sometimes contains a column "c5" which is also a 
Java Map.

This would allow Drill to handle any arbitrary JSON without the need of a 
schema. So, we we have a solution, right?

Not so fast. We now get to the other part of the problem. Parquet is columnar; 
but it assumes that a file has a unified schema. Even if Drill could say, "here 
is your ragged collection of JSON objects", you'd still need to unify the 
schema to write the Parquet file.

You are presumably creating Parquet to be consumed by other query tools. So, 
whatever tool consumes the Parquet now has the same issue as we had with JSON, 
but now across Parquet files: if some files have one schema, some files another 
schema, most tools (other than Drill), won't even get started; they need a 
consistent schema.

Hive solves this by allowing you to specify an (evolving) schema. You tell Hive 
that c.c5.d and c.c5.e are part of your schema. Hive-compliant tools know to 
fill in nulls when the columns don't exist in some particular Parquet file.

So, taking this whole-system view we see that, in order to use Parquet as a 
relational data source, you will need to know the schema of the data in a form 
that your desired query tool understands. Thus, you need Drill to help you 
build Parquet files that satisfy that schema.

This brings us to the other solution we've discussed: asking you to provide a 
schema (or allow Drill to infer it). That way, even when Drill reads your 
"horses" record, Drill will know to create a c.c5 column as a map. That is, 
Drill will have the information to map your complex JSON into a relational-like 
structure.

Thus, when Drill writes the Parquet file, it will write a consistent schema. 
the same one you must provide to non-Drill query tools. Then, Drill won't write 
Parquet that depends on the columns that did or did not show up in a single ETL 
run.

So, two possible (future) solutions: 1) Java Maps (but wild and crazy Parquet 
schemas), or 2) declared schema to unify JSON files. Will one or the other work 
for your use case?

Or, is there some better solution you might suggest?

And, what is the target tool for the Parquet files you are creating? How will 
you handle schema evolution with those tools?

Thanks,
- Paul

    On Friday, February 14, 2020, 7:15:15 AM PST, 
userdrill.mail...@laposte.net.INVALID <userdrill.mail...@laposte.net.invalid> 
wrote:  

 Hi,

Thanks for all the details.

Come back to one use case : the context is the transformation into Parquet of 
JSONs containing billions 
of records and for which each record have the global same schema but can have 
some specificities.
Simplified example:
{"a":"horses","b":"28","c":{"c1":"black","c2":"blue"}}
{"a":"rabbit","b":"14","c":{"c1":"green"                        
,"c4":"vanilla"}}
{"a":"cow"  ,"b":"28","c":{"c1":"blue"            ,"c3":"black"              
,"c5":{"d":"2","e":"3"}}}
...

We need to transform the JSON into Parquet.
So OK,for columns a and b (in this example) but for c (we don't/can't know all 
the possibilities and 
it's growing up continuously. So the solution is to read "c" as TEXT and report 
the use/treatment of the content.
So in these example, destination Parquet will have 3 columns
a : VARCHAR (example: 'horses')
b : INT    (example: 14
c : VARCHAR (example: '{"c1":"blue","c3":"black","c5":{"d":"2","e":"3"}}'

We can't do that with drill because the "discover/alignement" of the "c" part 
of the json is too heavy in 
terms of resources and request crashes

So we currently use a Spark solution as Spark allow to specify a schema when 
reading a file.

Hope that can help or give ideas,

Regards,

> Hi,
> 
> Welcome to the Drill mailing list.
> 
> You are right. Drill is a SQL engine. It works best when the JSON input files 
> represent rows
> and columns.
> 
> Of course, JSON itself can represent arbitrary data structures: you can use 
> it to serialize
> any Java structure you want. Relational tables and columns represent a small 
> subset of what
> JSON can do. Drill's goal is to read relational data encoded in JSON, not to 
> somehow magically
> convert any arbitrary data structure into tables and columns.
> 
> As described in our book, Learning Apache Drill, even seemingly trivial JSON 
> can violate relational
> rules. For example:
> 
> {a: 10} {a: 10.1}
> 
> Since Drill infers types, and must guess the type on the first row, Drill 
> will guess BIGINT.
> Then, the very next row shows that that was a poor guess and Drill will raise 
> an error. I
> like to call this the "Drill cannot predict the future" problem.
> 
> 
> The team is working on a long-term project to allow you to specify a schema 
> to resolve ambiguities.
> For example, you can tell Drill that column a above is a DOUBLE despite what 
> the data might
> say. You can deal with schema evolution to say that column b (which does not 
> appear above,
> but might appear in newer files) is an array of BIGINT. And so on.
> 
> Drill also supports LATERAL JOIN and FLATTEN to handle nested tables:
> 
> {name: "fred", orders: [ {date: "Jan 1", amount: 12.34}, {date: "Jan 12", 
> amount: 23.45}]}
> 
> The schema, however, will not transform arbitrary JSON into tables and 
> columns. Some things
> are better done in an ETL step where you can use the full power of a 
> declarative language
> (Java or Scala in Spark, say) to convert wild & crazy JSON into relational 
> form.
> 
> We are actively designing the schema feature. May I ask your use case? Would 
> be super-helpful
> to understand the shape of your input data and how you want to map that into 
> SQL.
> 
> One final note: it is possible to write a custom format plugin. If you will 
> query the same
> wild & crazy JSON shape multiple times, you can write a plugin to do the 
> mapping as Drill
> reads the data. Not the simplest path, but possible.
> 
> 
> Thanks,
> - Paul
> 
>  
> 
>>    On Wednesday, February 5, 2020, 1:30:14 PM PST, 
>>userdrill.mail...@laposte.net.INVALID
>> <userdrill.mail...@laposte.net.invalid> wrote:  
>>  
>>  Hi,
>> 
>> Some JSON file are complex and containing differents "tree struct".
>> If these file are big it will take too much time for drill to align the 
>> structures (and even
>> worse sometimes fail).
>> In spark it's possible to force a schema when reading a file to avoid long 
>> or useless treatment
>> of align and eventually dismiss field and force type (like into string to 
>> avoid going down
>> into the structure)
>> 
>> Is there any possibility in drill to specify at read an explicit schema ?
>> 
>> Thanks for any information

Re: Requesting json file with schema

Reply via email to