Re: Re: Requesting json file with schema

[email protected] Mon, 17 Feb 2020 04:36:09 -0800

Hi, 

For our particular case here, we just want to prepare a dataset (in Parquet 
Format) which will be reused by different users.
And we don't know what they will do with it.
So we prefer define the subparts with different definition just as text to let 
the final user do what he want we the data.
He can drop this data, search with a regexp, reparse as json or anything else.
Target tool is not known at our step and can be multiple (request drill on it, 
ingestion on database or other system like ElasticSearch, compute on it with 
Spark or tools writing in differents language...)


PostgreSQL has a practical way to manipulate the json data.
You can read: https://www.postgresql.org/docs/12/datatype-json.html
Ok, it is not necessarily optimized in terms of processing speed (because of 
reparsing on ask) but it's a possibility that works ever.

Here, our problem is to avoid Drill to try to understand and unify the part "c" 
of the exemple and simply read it as TEXT.
And after when using dataset, have the possibility to do something like:
~ ... WHEN convert_fromJSON(mytable.c).c3 = 'black'... ~

To illustrate, to build the parquet we currenlty do (in Spark) something like:
  val partialSchema = StructType(Array(
    StructField("a",StringType,true)
  , StructField("b",IntegerType,true)
  , StructField("c",StringType,true) // -- struct JSON read as string to avoid 
problem due to too much difference between record on these part
      ))
  val djson = spark.read.format("json").schema(partialSchema).load(input)
  djson.createOrReplaceTempView("input_table")
  spark.sql("SELECT a, b, c FROM 
input_table").write.format("parquet").save(output);

In our case the unification is not possible and even if it's possible at time 
t, it will not be the same at t+1.
The idea here is to have the possibility to limit the scope of the "unified" 
JSON part at a choosen depht level.

Don't know if it make senses compared to usual uses of Apache Drill, but it's 
part of our problematic.

Regard,

> Thanks for the explanation, very helpful.
> 
> There are two parts to the problem. On the one hand, you want to read an 
> ever-changing set
> of JSON files. Your example with "c5" is exactly the kind of "cannot predict 
> the future" issues
> that can trip up Drill (or, I would argue, any tool that tries to convert 
> JSON to records.)
> 
> One thing we have discussed (but not implemented) in Drill is the ability to 
> read a JSON object
> as a Java map. That is, store the non-relational JSON structure as a more 
> JSON-like Java data
> structure. The value of your "c" column would then be a Java Map which 
> sometimes contains
> a column "c5" which is also a Java Map.
> 
> This would allow Drill to handle any arbitrary JSON without the need of a 
> schema. So, we we
> have a solution, right?
> 
> 
> Not so fast. We now get to the other part of the problem. Parquet is 
> columnar; but it assumes
> that a file has a unified schema. Even if Drill could say, "here is your 
> ragged collection
> of JSON objects", you'd still need to unify the schema to write the Parquet 
> file.
> 
> You are presumably creating Parquet to be consumed by other query tools. So, 
> whatever tool
> consumes the Parquet now has the same issue as we had with JSON, but now 
> across Parquet files:
> if some files have one schema, some files another schema, most tools (other 
> than Drill), won't
> even get started; they need a consistent schema.
> 
> Hive solves this by allowing you to specify an (evolving) schema. You tell 
> Hive that c.c5.d
> and c.c5.e are part of your schema. Hive-compliant tools know to fill in 
> nulls when the columns
> don't exist in some particular Parquet file.
> 
> So, taking this whole-system view we see that, in order to use Parquet as a 
> relational data
> source, you will need to know the schema of the data in a form that your 
> desired query tool
> understands. Thus, you need Drill to help you build Parquet files that 
> satisfy that schema.
> 
> 
> This brings us to the other solution we've discussed: asking you to provide a 
> schema (or allow
> Drill to infer it). That way, even when Drill reads your "horses" record, 
> Drill will know
> to create a c.c5 column as a map. That is, Drill will have the information to 
> map your complex
> JSON into a relational-like structure.
> 
> Thus, when Drill writes the Parquet file, it will write a consistent schema. 
> the same one
> you must provide to non-Drill query tools. Then, Drill won't write Parquet 
> that depends on
> the columns that did or did not show up in a single ETL run.
> 
> So, two possible (future) solutions: 1) Java Maps (but wild and crazy Parquet 
> schemas), or
> 2) declared schema to unify JSON files. Will one or the other work for your 
> use case?
> 
> Or, is there some better solution you might suggest?
> 
> And, what is the target tool for the Parquet files you are creating? How will 
> you handle schema
> evolution with those tools?
> 
> 
> Thanks,
> - Paul
> 
>  
> 
>>     On Friday, February 14, 2020, 7:15:15 AM PST, 
>> [email protected] <[email protected]>
>> wrote:  
>>  
>>  Hi,
>> 
>> Thanks for all the details.
>> 
>> Come back to one use case : the context is the transformation into Parquet 
>> of JSONs containing
>> billions 
>> of records and for which each record have the global same schema but can 
>> have some specificities.
>> Simplified example:
>> {"a":"horses","b":"28","c":{"c1":"black","c2":"blue"}}
>> {"a":"rabbit","b":"14","c":{"c1":"green"                        
>> ,"c4":"vanilla"}}
>> {"a":"cow"  ,"b":"28","c":{"c1":"blue"            ,"c3":"black"             
>> ,"c5":{"d":"2","e":"3"}}}
>> ...
>> 
>> We need to transform the JSON into Parquet.
>> So OK,for columns a and b (in this example) but for c (we don't/can't know 
>> all the possibilities
>> and 
>> it's growing up continuously. So the solution is to read "c" as TEXT and 
>> report the use/treatment
>> of the content.
>> So in these example, destination Parquet will have 3 columns
>> a : VARCHAR (example: 'horses')
>> b : INT    (example: 14
>> c : VARCHAR (example: '{"c1":"blue","c3":"black","c5":{"d":"2","e":"3"}}'
>> 
>> We can't do that with drill because the "discover/alignement" of the "c" 
>> part of the json
>> is too heavy in 
>> terms of resources and request crashes
>> 
>> So we currently use a Spark solution as Spark allow to specify a schema when 
>> reading a file.
>> 
>> Hope that can help or give ideas,
>> 
>> Regards,
>> 
>>> Hi,
>>> 
>>> Welcome to the Drill mailing list.
>>> 
>>> You are right. Drill is a SQL engine. It works best when the JSON input 
>>> files represent rows
>>> and columns.
>>> 
>>> Of course, JSON itself can represent arbitrary data structures: you can use 
>>> it to serialize
>>> any Java structure you want. Relational tables and columns represent a 
>>> small subset of what
>>> JSON can do. Drill's goal is to read relational data encoded in JSON, not 
>>> to somehow magically
>>> convert any arbitrary data structure into tables and columns.
>>> 
>>> As described in our book, Learning Apache Drill, even seemingly trivial 
>>> JSON can violate relational
>>> rules. For example:
>>> 
>>> {a: 10} {a: 10.1}
>>> 
>>> Since Drill infers types, and must guess the type on the first row, Drill 
>>> will guess BIGINT.
>>> Then, the very next row shows that that was a poor guess and Drill will 
>>> raise an error. I
>>> like to call this the "Drill cannot predict the future" problem.
>>> 
>>> 
>>> The team is working on a long-term project to allow you to specify a schema 
>>> to resolve ambiguities.
>>> For example, you can tell Drill that column a above is a DOUBLE despite 
>>> what the data might
>>> say. You can deal with schema evolution to say that column b (which does 
>>> not appear above,
>>> but might appear in newer files) is an array of BIGINT. And so on.
>>> 
>>> Drill also supports LATERAL JOIN and FLATTEN to handle nested tables:
>>> 
>>> {name: "fred", orders: [ {date: "Jan 1", amount: 12.34}, {date: "Jan 12", 
>>> amount: 23.45}]}
>>> 
>>> The schema, however, will not transform arbitrary JSON into tables and 
>>> columns. Some things
>>> are better done in an ETL step where you can use the full power of a 
>>> declarative language
>>> (Java or Scala in Spark, say) to convert wild & crazy JSON into relational 
>>> form.
>>> 
>>> We are actively designing the schema feature. May I ask your use case? 
>>> Would be super-helpful
>>> to understand the shape of your input data and how you want to map that 
>>> into SQL.
>>> 
>>> One final note: it is possible to write a custom format plugin. If you will 
>>> query the same
>>> wild & crazy JSON shape multiple times, you can write a plugin to do the 
>>> mapping as Drill
>>> reads the data. Not the simplest path, but possible.
>>> 
>>> 
>>> Thanks,
>>> - Paul
>>> 
>>>  
>>> 
>>>>    On Wednesday, February 5, 2020, 1:30:14 PM PST, 
>>>> [email protected]
>>>> <[email protected]> wrote:  
>>>>  
>>>>  Hi,
>>>> 
>>>> Some JSON file are complex and containing differents "tree struct".
>>>> If these file are big it will take too much time for drill to align the 
>>>> structures (and even
>>>> worse sometimes fail).
>>>> In spark it's possible to force a schema when reading a file to avoid long 
>>>> or useless treatment
>>>> of align and eventually dismiss field and force type (like into string to 
>>>> avoid going down
>>>> into the structure)
>>>> 
>>>> Is there any possibility in drill to specify at read an explicit schema ?
>>>> 
>>>> Thanks for any information

Re: Re: Requesting json file with schema

Reply via email to