Hi,

Welcome to the Drill mailing list.

You are right. Drill is a SQL engine. It works best when the JSON input files 
represent rows and columns.

Of course, JSON itself can represent arbitrary data structures: you can use it 
to serialize any Java structure you want. Relational tables and columns 
represent a small subset of what JSON can do. Drill's goal is to read 
relational data encoded in JSON, not to somehow magically convert any arbitrary 
data structure into tables and columns.

As described in our book, Learning Apache Drill, even seemingly trivial JSON 
can violate relational rules. For example:

{a: 10} {a: 10.1}

Since Drill infers types, and must guess the type on the first row, Drill will 
guess BIGINT. Then, the very next row shows that that was a poor guess and 
Drill will raise an error. I like to call this the "Drill cannot predict the 
future" problem.


The team is working on a long-term project to allow you to specify a schema to 
resolve ambiguities. For example, you can tell Drill that column a above is a 
DOUBLE despite what the data might say. You can deal with schema evolution to 
say that column b (which does not appear above, but might appear in newer 
files) is an array of BIGINT. And so on.

Drill also supports LATERAL JOIN and FLATTEN to handle nested tables:

{name: "fred", orders: [ {date: "Jan 1", amount: 12.34}, {date: "Jan 12", 
amount: 23.45}]}

The schema, however, will not transform arbitrary JSON into tables and columns. 
Some things are better done in an ETL step where you can use the full power of 
a declarative language (Java or Scala in Spark, say) to convert wild & crazy 
JSON into relational form.

We are actively designing the schema feature. May I ask your use case? Would be 
super-helpful to understand the shape of your input data and how you want to 
map that into SQL.

One final note: it is possible to write a custom format plugin. If you will 
query the same wild & crazy JSON shape multiple times, you can write a plugin 
to do the mapping as Drill reads the data. Not the simplest path, but possible.


Thanks,
- Paul

 

    On Wednesday, February 5, 2020, 1:30:14 PM PST, 
[email protected] <[email protected]> 
wrote:  
 
 Hi,

Some JSON file are complex and containing differents "tree struct".
If these file are big it will take too much time for drill to align the 
structures (and even worse sometimes fail).
In spark it's possible to force a schema when reading a file to avoid long or 
useless treatment of align and eventually dismiss field and force type (like 
into string to avoid going down into the structure)

Is there any possibility in drill to specify at read an explicit schema ?

Thanks for any information
  

Reply via email to