Re: Requesting json file with schema

2020-02-23 Thread Paul Rogers
Hi,

Sorry for the delay in responding. Thank you for the helpful background 
information - very helpful indeed. Here are some thoughts about how we could 
extend Drill to help with your use case.


Your challenge seems rather open-ended: you don't know the format of the 
incoming data and don't know how the data will be used. I would guess that this 
presents a large challenge when doing ETL as you don't have the two most 
important pieces of information. Real life is messy. 


Thanks much for the PostgreSQL link. We've recently discussed introducing a 
similar feature in Drill which one could, with some humor, call "let JSON be 
JSON." The idea would be, as in PostreSQL, to simply represent JSON as text and 
allow the user to work with JSON using JSON-oriented functions. The PostreSQL 
link suggest that this is, in fact, a workable approach (though, as you not, 
doing so is slower than converting JSON to a relational structure.) I took the 
liberty of filing DRILL-7598 [1] to request this feature.


Today, however, Drill attempts to map JSON into a relational model so that the 
user can use SQL operations to work on the data. [2] The Drill approach works 
well when the JSON is the output of a relational model (a dump of a relational 
table or query, say.) The approach does not work for "native" JSON in all its 
complexity. JSON is a superset of the relational model and so not all JSON 
files map to tables and columns.

To solve your use case, Drill would need to adopt a solution similar to 
PostgreSQL. In fact, Drill already has some of the pieces (such as the 
CONVERT_TO/CONVERT_FROM operations [3]), but even these attempt to convert JSON 
to or from the relational model. What we need, so solve the general use case, 
are the kind of native JSON functions which PostgreSQL provides.

Fortunately, since Drill would store JSON as a VARCHAR, no work would be needed 
in the Drill "core". All that is needed is someone to provide a set of Drill 
functions (UDFs) to call out to some JSON library to perform the desired 
operations. Anyone looking for a starter project? This one would be  a great 
one.

All that said, I don't think any of this would solve your ETL case. If you 
convert the data to Parquet, it is probably because your consumers want to read 
the data directly from Parquet. Storing JSON in Parquet is probably not what 
consumers expect as it simply pushes ETL complexity onto them. So, you do need 
to solve the problem of converting JSON to a relational model, which can then 
be stored in "plain" Parquet files.

Your e-mail (and several other ongoing discussions) prompted me to write up a 
description of the challenges in Drill's dynamic schema model and how we might 
improve the solution, starting with the scan operator [4]. The idea is to 
provide a way for you to give Drill just enough hints to overcome any 
ambiguities in the JSON. Drill has a "provided schema" feature [5], which, at 
present, is used only for text files (and recently with limited support in 
Avro.) We are working on a project to add such support for JSON. 


You offer a good suggestion: allow the JSON reader to read chunks of JSON as 
text which can be manipulated by those future JSON functions. In your example, 
column "c" would be read as JSON text; Drill would not attempt to parse it into 
a relational structure. As it turns out, the "new" JSON reader we're working on 
originally had a feature to do just that, but we took it out because we were 
not sure it was needed. Sounds like we should restore it as part of our 
"provided schema" support. It could work this way: if you CREATE SCHEMA with 
column "c" as VARCHAR (maybe with a hint to read as JSON), the JSON parser 
would read the entire nested structure as JSON without trying to parse it into 
a relational structure. I filed DRILL-7597 [6] to add the "parse as JSON" 
feature.

If Drill were to include PostgreSQL-like JSON functions, and the ability to 
read selected JSON fields a JSON, would this address your use case? What else 
might be required?

Also, your e-mail mentions "json file with schema." How would you want to 
provide the schema? Would the provided schema feature, described above, work 
for you or is there a better solution?


Thanks,
- Paul

[1] https://issues.apache.org/jira/browse/DRILL-7598

[2] https://drill.apache.org/docs/json-data-model/

[3] 
https://drill.apache.org/docs/data-type-conversion/#convert_to-and-convert_from


[4] 
https://github.com/paul-rogers/drill/wiki/Toward-a-Workable-Dynamic-Schema-Model

[5] https://drill.apache.org/docs/create-or-replace-schema/

[6] https://issues.apache.org/jira/browse/DRILL-7597


On Monday, February 17, 2020, 4:35:31 AM PST, 
userdrill.mail...@laposte.net.INVALID  
wrote:  
 
 Hi, 

For our particular case here, we just want to prepare a dataset (in Parquet 
Format) which will be reused by different users.
And we don't know what they will do with it.
So we prefer define the subparts with different definit

Re: Re: Requesting json file with schema

2020-02-17 Thread userdrill.mail...@laposte.net.INVALID
Hi, 

For our particular case here, we just want to prepare a dataset (in Parquet 
Format) which will be reused by different users.
And we don't know what they will do with it.
So we prefer define the subparts with different definition just as text to let 
the final user do what he want we the data.
He can drop this data, search with a regexp, reparse as json or anything else.
Target tool is not known at our step and can be multiple (request drill on it, 
ingestion on database or other system like ElasticSearch, compute on it with 
Spark or tools writing in differents language...)

PostgreSQL has a practical way to manipulate the json data.
You can read: https://www.postgresql.org/docs/12/datatype-json.html
Ok, it is not necessarily optimized in terms of processing speed (because of 
reparsing on ask) but it's a possibility that works ever.

Here, our problem is to avoid Drill to try to understand and unify the part "c" 
of the exemple and simply read it as TEXT.
And after when using dataset, have the possibility to do something like:
~ ... WHEN convert_fromJSON(mytable.c).c3 = 'black'... ~

To illustrate, to build the parquet we currenlty do (in Spark) something like:
  val partialSchema = StructType(Array(
StructField("a",StringType,true)
  , StructField("b",IntegerType,true)
  , StructField("c",StringType,true) // -- struct JSON read as string to avoid 
problem due to too much difference between record on these part
  ))
  val djson = spark.read.format("json").schema(partialSchema).load(input)
  djson.createOrReplaceTempView("input_table")
  spark.sql("SELECT a, b, c FROM 
input_table").write.format("parquet").save(output);

In our case the unification is not possible and even if it's possible at time 
t, it will not be the same at t+1.
The idea here is to have the possibility to limit the scope of the "unified" 
JSON part at a choosen depht level.

Don't know if it make senses compared to usual uses of Apache Drill, but it's 
part of our problematic.

Regard,

> Thanks for the explanation, very helpful.
> 
> There are two parts to the problem. On the one hand, you want to read an 
> ever-changing set
> of JSON files. Your example with "c5" is exactly the kind of "cannot predict 
> the future" issues
> that can trip up Drill (or, I would argue, any tool that tries to convert 
> JSON to records.)
> 
> One thing we have discussed (but not implemented) in Drill is the ability to 
> read a JSON object
> as a Java map. That is, store the non-relational JSON structure as a more 
> JSON-like Java data
> structure. The value of your "c" column would then be a Java Map which 
> sometimes contains
> a column "c5" which is also a Java Map.
> 
> This would allow Drill to handle any arbitrary JSON without the need of a 
> schema. So, we we
> have a solution, right?
> 
> 
> Not so fast. We now get to the other part of the problem. Parquet is 
> columnar; but it assumes
> that a file has a unified schema. Even if Drill could say, "here is your 
> ragged collection
> of JSON objects", you'd still need to unify the schema to write the Parquet 
> file.
> 
> You are presumably creating Parquet to be consumed by other query tools. So, 
> whatever tool
> consumes the Parquet now has the same issue as we had with JSON, but now 
> across Parquet files:
> if some files have one schema, some files another schema, most tools (other 
> than Drill), won't
> even get started; they need a consistent schema.
> 
> Hive solves this by allowing you to specify an (evolving) schema. You tell 
> Hive that c.c5.d
> and c.c5.e are part of your schema. Hive-compliant tools know to fill in 
> nulls when the columns
> don't exist in some particular Parquet file.
> 
> So, taking this whole-system view we see that, in order to use Parquet as a 
> relational data
> source, you will need to know the schema of the data in a form that your 
> desired query tool
> understands. Thus, you need Drill to help you build Parquet files that 
> satisfy that schema.
> 
> 
> This brings us to the other solution we've discussed: asking you to provide a 
> schema (or allow
> Drill to infer it). That way, even when Drill reads your "horses" record, 
> Drill will know
> to create a c.c5 column as a map. That is, Drill will have the information to 
> map your complex
> JSON into a relational-like structure.
> 
> Thus, when Drill writes the Parquet file, it will write a consistent schema. 
> the same one
> you must provide to non-Drill query tools. Then, Drill won't write Parquet 
> that depends on
> the columns that did or did not show up in a single ETL run.
> 
> So, two possible (future) solutions: 1) Java Maps (but wild and crazy Parquet 
> schemas), or
> 2) declared schema to unify JSON files. Will one or the other work for your 
> use case?
> 
> Or, is there some better solution you might suggest?
> 
> And, what is the target tool for the Parquet files you are creating? How will 
> you handle schema
> evolution with those tools?
> 
> 
> Thanks,
> - Paul

Re: Requesting json file with schema

2020-02-14 Thread Paul Rogers
Thanks for the explanation, very helpful.

There are two parts to the problem. On the one hand, you want to read an 
ever-changing set of JSON files. Your example with "c5" is exactly the kind of 
"cannot predict the future" issues that can trip up Drill (or, I would argue, 
any tool that tries to convert JSON to records.)

One thing we have discussed (but not implemented) in Drill is the ability to 
read a JSON object as a Java map. That is, store the non-relational JSON 
structure as a more JSON-like Java data structure. The value of your "c" column 
would then be a Java Map which sometimes contains a column "c5" which is also a 
Java Map.

This would allow Drill to handle any arbitrary JSON without the need of a 
schema. So, we we have a solution, right?


Not so fast. We now get to the other part of the problem. Parquet is columnar; 
but it assumes that a file has a unified schema. Even if Drill could say, "here 
is your ragged collection of JSON objects", you'd still need to unify the 
schema to write the Parquet file.

You are presumably creating Parquet to be consumed by other query tools. So, 
whatever tool consumes the Parquet now has the same issue as we had with JSON, 
but now across Parquet files: if some files have one schema, some files another 
schema, most tools (other than Drill), won't even get started; they need a 
consistent schema.

Hive solves this by allowing you to specify an (evolving) schema. You tell Hive 
that c.c5.d and c.c5.e are part of your schema. Hive-compliant tools know to 
fill in nulls when the columns don't exist in some particular Parquet file.

So, taking this whole-system view we see that, in order to use Parquet as a 
relational data source, you will need to know the schema of the data in a form 
that your desired query tool understands. Thus, you need Drill to help you 
build Parquet files that satisfy that schema.


This brings us to the other solution we've discussed: asking you to provide a 
schema (or allow Drill to infer it). That way, even when Drill reads your 
"horses" record, Drill will know to create a c.c5 column as a map. That is, 
Drill will have the information to map your complex JSON into a relational-like 
structure.

Thus, when Drill writes the Parquet file, it will write a consistent schema. 
the same one you must provide to non-Drill query tools. Then, Drill won't write 
Parquet that depends on the columns that did or did not show up in a single ETL 
run.

So, two possible (future) solutions: 1) Java Maps (but wild and crazy Parquet 
schemas), or 2) declared schema to unify JSON files. Will one or the other work 
for your use case?

Or, is there some better solution you might suggest?

And, what is the target tool for the Parquet files you are creating? How will 
you handle schema evolution with those tools?


Thanks,
- Paul

 

On Friday, February 14, 2020, 7:15:15 AM PST, 
userdrill.mail...@laposte.net.INVALID  
wrote:  
 
 Hi,

Thanks for all the details.

Come back to one use case : the context is the transformation into Parquet of 
JSONs containing billions 
of records and for which each record have the global same schema but can have 
some specificities.
Simplified example:
{"a":"horses","b":"28","c":{"c1":"black","c2":"blue"}}
{"a":"rabbit","b":"14","c":{"c1":"green"                        
,"c4":"vanilla"}}
{"a":"cow"  ,"b":"28","c":{"c1":"blue"            ,"c3":"black"              
,"c5":{"d":"2","e":"3"}}}
...

We need to transform the JSON into Parquet.
So OK,for columns a and b (in this example) but for c (we don't/can't know all 
the possibilities and 
it's growing up continuously. So the solution is to read "c" as TEXT and report 
the use/treatment of the content.
So in these example, destination Parquet will have 3 columns
a : VARCHAR (example: 'horses')
b : INT    (example: 14
c : VARCHAR (example: '{"c1":"blue","c3":"black","c5":{"d":"2","e":"3"}}'

We can't do that with drill because the "discover/alignement" of the "c" part 
of the json is too heavy in 
terms of resources and request crashes

So we currently use a Spark solution as Spark allow to specify a schema when 
reading a file.

Hope that can help or give ideas,

Regards,

> Hi,
> 
> Welcome to the Drill mailing list.
> 
> You are right. Drill is a SQL engine. It works best when the JSON input files 
> represent rows
> and columns.
> 
> Of course, JSON itself can represent arbitrary data structures: you can use 
> it to serialize
> any Java structure you want. Relational tables and columns represent a small 
> subset of what
> JSON can do. Drill's goal is to read relational data encoded in JSON, not to 
> somehow magically
> convert any arbitrary data structure into tables and columns.
> 
> As described in our book, Learning Apache Drill, even seemingly trivial JSON 
> can violate relational
> rules. For example:
> 
> {a: 10} {a: 10.1}
> 
> Since Drill infers types, and must guess the type on the first row, Drill 
> will guess BIGINT.
> Then, the very next row sh

Re: Requesting json file with schema

2020-02-14 Thread userdrill.mail...@laposte.net.INVALID
Hi,

Thanks for all the details.

Come back to one use case : the context is the transformation into Parquet of 
JSONs containing billions 
of records and for which each record have the global same schema but can have 
some specificities.
Simplified example:
{"a":"horses","b":"28","c":{"c1":"black","c2":"blue"}}
{"a":"rabbit","b":"14","c":{"c1":"green" 
,"c4":"vanilla"}}
{"a":"cow"   ,"b":"28","c":{"c1":"blue" ,"c3":"black"   
,"c5":{"d":"2","e":"3"}}}
...

We need to transform the JSON into Parquet.
So OK,for columns a and b (in this example) but for c (we don't/can't know all 
the possibilities and 
it's growing up continuously. So the solution is to read "c" as TEXT and report 
the use/treatment of the content.
So in these example, destination Parquet will have 3 columns
a : VARCHAR (example: 'horses')
b : INT (example: 14
c : VARCHAR (example: '{"c1":"blue","c3":"black","c5":{"d":"2","e":"3"}}'

We can't do that with drill because the "discover/alignement" of the "c" part 
of the json is too heavy in 
terms of resources and request crashes

So we currently use a Spark solution as Spark allow to specify a schema when 
reading a file.

Hope that can help or give ideas,

Regards,

> Hi,
> 
> Welcome to the Drill mailing list.
> 
> You are right. Drill is a SQL engine. It works best when the JSON input files 
> represent rows
> and columns.
> 
> Of course, JSON itself can represent arbitrary data structures: you can use 
> it to serialize
> any Java structure you want. Relational tables and columns represent a small 
> subset of what
> JSON can do. Drill's goal is to read relational data encoded in JSON, not to 
> somehow magically
> convert any arbitrary data structure into tables and columns.
> 
> As described in our book, Learning Apache Drill, even seemingly trivial JSON 
> can violate relational
> rules. For example:
> 
> {a: 10} {a: 10.1}
> 
> Since Drill infers types, and must guess the type on the first row, Drill 
> will guess BIGINT.
> Then, the very next row shows that that was a poor guess and Drill will raise 
> an error. I
> like to call this the "Drill cannot predict the future" problem.
> 
> 
> The team is working on a long-term project to allow you to specify a schema 
> to resolve ambiguities.
> For example, you can tell Drill that column a above is a DOUBLE despite what 
> the data might
> say. You can deal with schema evolution to say that column b (which does not 
> appear above,
> but might appear in newer files) is an array of BIGINT. And so on.
> 
> Drill also supports LATERAL JOIN and FLATTEN to handle nested tables:
> 
> {name: "fred", orders: [ {date: "Jan 1", amount: 12.34}, {date: "Jan 12", 
> amount: 23.45}]}
> 
> The schema, however, will not transform arbitrary JSON into tables and 
> columns. Some things
> are better done in an ETL step where you can use the full power of a 
> declarative language
> (Java or Scala in Spark, say) to convert wild & crazy JSON into relational 
> form.
> 
> We are actively designing the schema feature. May I ask your use case? Would 
> be super-helpful
> to understand the shape of your input data and how you want to map that into 
> SQL.
> 
> One final note: it is possible to write a custom format plugin. If you will 
> query the same
> wild & crazy JSON shape multiple times, you can write a plugin to do the 
> mapping as Drill
> reads the data. Not the simplest path, but possible.
> 
> 
> Thanks,
> - Paul
> 
>  
> 
>> On Wednesday, February 5, 2020, 1:30:14 PM PST, 
>> userdrill.mail...@laposte.net.INVALID
>>  wrote:  
>>  
>>  Hi,
>> 
>> Some JSON file are complex and containing differents "tree struct".
>> If these file are big it will take too much time for drill to align the 
>> structures (and even
>> worse sometimes fail).
>> In spark it's possible to force a schema when reading a file to avoid long 
>> or useless treatment
>> of align and eventually dismiss field and force type (like into string to 
>> avoid going down
>> into the structure)
>> 
>> Is there any possibility in drill to specify at read an explicit schema ?
>> 
>> Thanks for any information


Re: Requesting json file with schema

2020-02-05 Thread Paul Rogers
Hi,

Welcome to the Drill mailing list.

You are right. Drill is a SQL engine. It works best when the JSON input files 
represent rows and columns.

Of course, JSON itself can represent arbitrary data structures: you can use it 
to serialize any Java structure you want. Relational tables and columns 
represent a small subset of what JSON can do. Drill's goal is to read 
relational data encoded in JSON, not to somehow magically convert any arbitrary 
data structure into tables and columns.

As described in our book, Learning Apache Drill, even seemingly trivial JSON 
can violate relational rules. For example:

{a: 10} {a: 10.1}

Since Drill infers types, and must guess the type on the first row, Drill will 
guess BIGINT. Then, the very next row shows that that was a poor guess and 
Drill will raise an error. I like to call this the "Drill cannot predict the 
future" problem.


The team is working on a long-term project to allow you to specify a schema to 
resolve ambiguities. For example, you can tell Drill that column a above is a 
DOUBLE despite what the data might say. You can deal with schema evolution to 
say that column b (which does not appear above, but might appear in newer 
files) is an array of BIGINT. And so on.

Drill also supports LATERAL JOIN and FLATTEN to handle nested tables:

{name: "fred", orders: [ {date: "Jan 1", amount: 12.34}, {date: "Jan 12", 
amount: 23.45}]}

The schema, however, will not transform arbitrary JSON into tables and columns. 
Some things are better done in an ETL step where you can use the full power of 
a declarative language (Java or Scala in Spark, say) to convert wild & crazy 
JSON into relational form.

We are actively designing the schema feature. May I ask your use case? Would be 
super-helpful to understand the shape of your input data and how you want to 
map that into SQL.

One final note: it is possible to write a custom format plugin. If you will 
query the same wild & crazy JSON shape multiple times, you can write a plugin 
to do the mapping as Drill reads the data. Not the simplest path, but possible.


Thanks,
- Paul

 

On Wednesday, February 5, 2020, 1:30:14 PM PST, 
userdrill.mail...@laposte.net.INVALID  
wrote:  
 
 Hi,

Some JSON file are complex and containing differents "tree struct".
If these file are big it will take too much time for drill to align the 
structures (and even worse sometimes fail).
In spark it's possible to force a schema when reading a file to avoid long or 
useless treatment of align and eventually dismiss field and force type (like 
into string to avoid going down into the structure)

Is there any possibility in drill to specify at read an explicit schema ?

Thanks for any information