Re: Requesting json file with schema
Hi, Sorry for the delay in responding. Thank you for the helpful background information - very helpful indeed. Here are some thoughts about how we could extend Drill to help with your use case. Your challenge seems rather open-ended: you don't know the format of the incoming data and don't know how the data will be used. I would guess that this presents a large challenge when doing ETL as you don't have the two most important pieces of information. Real life is messy. Thanks much for the PostgreSQL link. We've recently discussed introducing a similar feature in Drill which one could, with some humor, call "let JSON be JSON." The idea would be, as in PostreSQL, to simply represent JSON as text and allow the user to work with JSON using JSON-oriented functions. The PostreSQL link suggest that this is, in fact, a workable approach (though, as you not, doing so is slower than converting JSON to a relational structure.) I took the liberty of filing DRILL-7598 [1] to request this feature. Today, however, Drill attempts to map JSON into a relational model so that the user can use SQL operations to work on the data. [2] The Drill approach works well when the JSON is the output of a relational model (a dump of a relational table or query, say.) The approach does not work for "native" JSON in all its complexity. JSON is a superset of the relational model and so not all JSON files map to tables and columns. To solve your use case, Drill would need to adopt a solution similar to PostgreSQL. In fact, Drill already has some of the pieces (such as the CONVERT_TO/CONVERT_FROM operations [3]), but even these attempt to convert JSON to or from the relational model. What we need, so solve the general use case, are the kind of native JSON functions which PostgreSQL provides. Fortunately, since Drill would store JSON as a VARCHAR, no work would be needed in the Drill "core". All that is needed is someone to provide a set of Drill functions (UDFs) to call out to some JSON library to perform the desired operations. Anyone looking for a starter project? This one would be a great one. All that said, I don't think any of this would solve your ETL case. If you convert the data to Parquet, it is probably because your consumers want to read the data directly from Parquet. Storing JSON in Parquet is probably not what consumers expect as it simply pushes ETL complexity onto them. So, you do need to solve the problem of converting JSON to a relational model, which can then be stored in "plain" Parquet files. Your e-mail (and several other ongoing discussions) prompted me to write up a description of the challenges in Drill's dynamic schema model and how we might improve the solution, starting with the scan operator [4]. The idea is to provide a way for you to give Drill just enough hints to overcome any ambiguities in the JSON. Drill has a "provided schema" feature [5], which, at present, is used only for text files (and recently with limited support in Avro.) We are working on a project to add such support for JSON. You offer a good suggestion: allow the JSON reader to read chunks of JSON as text which can be manipulated by those future JSON functions. In your example, column "c" would be read as JSON text; Drill would not attempt to parse it into a relational structure. As it turns out, the "new" JSON reader we're working on originally had a feature to do just that, but we took it out because we were not sure it was needed. Sounds like we should restore it as part of our "provided schema" support. It could work this way: if you CREATE SCHEMA with column "c" as VARCHAR (maybe with a hint to read as JSON), the JSON parser would read the entire nested structure as JSON without trying to parse it into a relational structure. I filed DRILL-7597 [6] to add the "parse as JSON" feature. If Drill were to include PostgreSQL-like JSON functions, and the ability to read selected JSON fields a JSON, would this address your use case? What else might be required? Also, your e-mail mentions "json file with schema." How would you want to provide the schema? Would the provided schema feature, described above, work for you or is there a better solution? Thanks, - Paul [1] https://issues.apache.org/jira/browse/DRILL-7598 [2] https://drill.apache.org/docs/json-data-model/ [3] https://drill.apache.org/docs/data-type-conversion/#convert_to-and-convert_from [4] https://github.com/paul-rogers/drill/wiki/Toward-a-Workable-Dynamic-Schema-Model [5] https://drill.apache.org/docs/create-or-replace-schema/ [6] https://issues.apache.org/jira/browse/DRILL-7597 On Monday, February 17, 2020, 4:35:31 AM PST, userdrill.mail...@laposte.net.INVALID wrote: Hi, For our particular case here, we just want to prepare a dataset (in Parquet Format) which will be reused by different users. And we don't know what they will do with it. So we prefer define the subparts with different definit
Re: Re: Requesting json file with schema
Hi, For our particular case here, we just want to prepare a dataset (in Parquet Format) which will be reused by different users. And we don't know what they will do with it. So we prefer define the subparts with different definition just as text to let the final user do what he want we the data. He can drop this data, search with a regexp, reparse as json or anything else. Target tool is not known at our step and can be multiple (request drill on it, ingestion on database or other system like ElasticSearch, compute on it with Spark or tools writing in differents language...) PostgreSQL has a practical way to manipulate the json data. You can read: https://www.postgresql.org/docs/12/datatype-json.html Ok, it is not necessarily optimized in terms of processing speed (because of reparsing on ask) but it's a possibility that works ever. Here, our problem is to avoid Drill to try to understand and unify the part "c" of the exemple and simply read it as TEXT. And after when using dataset, have the possibility to do something like: ~ ... WHEN convert_fromJSON(mytable.c).c3 = 'black'... ~ To illustrate, to build the parquet we currenlty do (in Spark) something like: val partialSchema = StructType(Array( StructField("a",StringType,true) , StructField("b",IntegerType,true) , StructField("c",StringType,true) // -- struct JSON read as string to avoid problem due to too much difference between record on these part )) val djson = spark.read.format("json").schema(partialSchema).load(input) djson.createOrReplaceTempView("input_table") spark.sql("SELECT a, b, c FROM input_table").write.format("parquet").save(output); In our case the unification is not possible and even if it's possible at time t, it will not be the same at t+1. The idea here is to have the possibility to limit the scope of the "unified" JSON part at a choosen depht level. Don't know if it make senses compared to usual uses of Apache Drill, but it's part of our problematic. Regard, > Thanks for the explanation, very helpful. > > There are two parts to the problem. On the one hand, you want to read an > ever-changing set > of JSON files. Your example with "c5" is exactly the kind of "cannot predict > the future" issues > that can trip up Drill (or, I would argue, any tool that tries to convert > JSON to records.) > > One thing we have discussed (but not implemented) in Drill is the ability to > read a JSON object > as a Java map. That is, store the non-relational JSON structure as a more > JSON-like Java data > structure. The value of your "c" column would then be a Java Map which > sometimes contains > a column "c5" which is also a Java Map. > > This would allow Drill to handle any arbitrary JSON without the need of a > schema. So, we we > have a solution, right? > > > Not so fast. We now get to the other part of the problem. Parquet is > columnar; but it assumes > that a file has a unified schema. Even if Drill could say, "here is your > ragged collection > of JSON objects", you'd still need to unify the schema to write the Parquet > file. > > You are presumably creating Parquet to be consumed by other query tools. So, > whatever tool > consumes the Parquet now has the same issue as we had with JSON, but now > across Parquet files: > if some files have one schema, some files another schema, most tools (other > than Drill), won't > even get started; they need a consistent schema. > > Hive solves this by allowing you to specify an (evolving) schema. You tell > Hive that c.c5.d > and c.c5.e are part of your schema. Hive-compliant tools know to fill in > nulls when the columns > don't exist in some particular Parquet file. > > So, taking this whole-system view we see that, in order to use Parquet as a > relational data > source, you will need to know the schema of the data in a form that your > desired query tool > understands. Thus, you need Drill to help you build Parquet files that > satisfy that schema. > > > This brings us to the other solution we've discussed: asking you to provide a > schema (or allow > Drill to infer it). That way, even when Drill reads your "horses" record, > Drill will know > to create a c.c5 column as a map. That is, Drill will have the information to > map your complex > JSON into a relational-like structure. > > Thus, when Drill writes the Parquet file, it will write a consistent schema. > the same one > you must provide to non-Drill query tools. Then, Drill won't write Parquet > that depends on > the columns that did or did not show up in a single ETL run. > > So, two possible (future) solutions: 1) Java Maps (but wild and crazy Parquet > schemas), or > 2) declared schema to unify JSON files. Will one or the other work for your > use case? > > Or, is there some better solution you might suggest? > > And, what is the target tool for the Parquet files you are creating? How will > you handle schema > evolution with those tools? > > > Thanks, > - Paul
Re: Requesting json file with schema
Thanks for the explanation, very helpful. There are two parts to the problem. On the one hand, you want to read an ever-changing set of JSON files. Your example with "c5" is exactly the kind of "cannot predict the future" issues that can trip up Drill (or, I would argue, any tool that tries to convert JSON to records.) One thing we have discussed (but not implemented) in Drill is the ability to read a JSON object as a Java map. That is, store the non-relational JSON structure as a more JSON-like Java data structure. The value of your "c" column would then be a Java Map which sometimes contains a column "c5" which is also a Java Map. This would allow Drill to handle any arbitrary JSON without the need of a schema. So, we we have a solution, right? Not so fast. We now get to the other part of the problem. Parquet is columnar; but it assumes that a file has a unified schema. Even if Drill could say, "here is your ragged collection of JSON objects", you'd still need to unify the schema to write the Parquet file. You are presumably creating Parquet to be consumed by other query tools. So, whatever tool consumes the Parquet now has the same issue as we had with JSON, but now across Parquet files: if some files have one schema, some files another schema, most tools (other than Drill), won't even get started; they need a consistent schema. Hive solves this by allowing you to specify an (evolving) schema. You tell Hive that c.c5.d and c.c5.e are part of your schema. Hive-compliant tools know to fill in nulls when the columns don't exist in some particular Parquet file. So, taking this whole-system view we see that, in order to use Parquet as a relational data source, you will need to know the schema of the data in a form that your desired query tool understands. Thus, you need Drill to help you build Parquet files that satisfy that schema. This brings us to the other solution we've discussed: asking you to provide a schema (or allow Drill to infer it). That way, even when Drill reads your "horses" record, Drill will know to create a c.c5 column as a map. That is, Drill will have the information to map your complex JSON into a relational-like structure. Thus, when Drill writes the Parquet file, it will write a consistent schema. the same one you must provide to non-Drill query tools. Then, Drill won't write Parquet that depends on the columns that did or did not show up in a single ETL run. So, two possible (future) solutions: 1) Java Maps (but wild and crazy Parquet schemas), or 2) declared schema to unify JSON files. Will one or the other work for your use case? Or, is there some better solution you might suggest? And, what is the target tool for the Parquet files you are creating? How will you handle schema evolution with those tools? Thanks, - Paul On Friday, February 14, 2020, 7:15:15 AM PST, userdrill.mail...@laposte.net.INVALID wrote: Hi, Thanks for all the details. Come back to one use case : the context is the transformation into Parquet of JSONs containing billions of records and for which each record have the global same schema but can have some specificities. Simplified example: {"a":"horses","b":"28","c":{"c1":"black","c2":"blue"}} {"a":"rabbit","b":"14","c":{"c1":"green" ,"c4":"vanilla"}} {"a":"cow" ,"b":"28","c":{"c1":"blue" ,"c3":"black" ,"c5":{"d":"2","e":"3"}}} ... We need to transform the JSON into Parquet. So OK,for columns a and b (in this example) but for c (we don't/can't know all the possibilities and it's growing up continuously. So the solution is to read "c" as TEXT and report the use/treatment of the content. So in these example, destination Parquet will have 3 columns a : VARCHAR (example: 'horses') b : INT (example: 14 c : VARCHAR (example: '{"c1":"blue","c3":"black","c5":{"d":"2","e":"3"}}' We can't do that with drill because the "discover/alignement" of the "c" part of the json is too heavy in terms of resources and request crashes So we currently use a Spark solution as Spark allow to specify a schema when reading a file. Hope that can help or give ideas, Regards, > Hi, > > Welcome to the Drill mailing list. > > You are right. Drill is a SQL engine. It works best when the JSON input files > represent rows > and columns. > > Of course, JSON itself can represent arbitrary data structures: you can use > it to serialize > any Java structure you want. Relational tables and columns represent a small > subset of what > JSON can do. Drill's goal is to read relational data encoded in JSON, not to > somehow magically > convert any arbitrary data structure into tables and columns. > > As described in our book, Learning Apache Drill, even seemingly trivial JSON > can violate relational > rules. For example: > > {a: 10} {a: 10.1} > > Since Drill infers types, and must guess the type on the first row, Drill > will guess BIGINT. > Then, the very next row sh
Re: Requesting json file with schema
Hi, Thanks for all the details. Come back to one use case : the context is the transformation into Parquet of JSONs containing billions of records and for which each record have the global same schema but can have some specificities. Simplified example: {"a":"horses","b":"28","c":{"c1":"black","c2":"blue"}} {"a":"rabbit","b":"14","c":{"c1":"green" ,"c4":"vanilla"}} {"a":"cow" ,"b":"28","c":{"c1":"blue" ,"c3":"black" ,"c5":{"d":"2","e":"3"}}} ... We need to transform the JSON into Parquet. So OK,for columns a and b (in this example) but for c (we don't/can't know all the possibilities and it's growing up continuously. So the solution is to read "c" as TEXT and report the use/treatment of the content. So in these example, destination Parquet will have 3 columns a : VARCHAR (example: 'horses') b : INT (example: 14 c : VARCHAR (example: '{"c1":"blue","c3":"black","c5":{"d":"2","e":"3"}}' We can't do that with drill because the "discover/alignement" of the "c" part of the json is too heavy in terms of resources and request crashes So we currently use a Spark solution as Spark allow to specify a schema when reading a file. Hope that can help or give ideas, Regards, > Hi, > > Welcome to the Drill mailing list. > > You are right. Drill is a SQL engine. It works best when the JSON input files > represent rows > and columns. > > Of course, JSON itself can represent arbitrary data structures: you can use > it to serialize > any Java structure you want. Relational tables and columns represent a small > subset of what > JSON can do. Drill's goal is to read relational data encoded in JSON, not to > somehow magically > convert any arbitrary data structure into tables and columns. > > As described in our book, Learning Apache Drill, even seemingly trivial JSON > can violate relational > rules. For example: > > {a: 10} {a: 10.1} > > Since Drill infers types, and must guess the type on the first row, Drill > will guess BIGINT. > Then, the very next row shows that that was a poor guess and Drill will raise > an error. I > like to call this the "Drill cannot predict the future" problem. > > > The team is working on a long-term project to allow you to specify a schema > to resolve ambiguities. > For example, you can tell Drill that column a above is a DOUBLE despite what > the data might > say. You can deal with schema evolution to say that column b (which does not > appear above, > but might appear in newer files) is an array of BIGINT. And so on. > > Drill also supports LATERAL JOIN and FLATTEN to handle nested tables: > > {name: "fred", orders: [ {date: "Jan 1", amount: 12.34}, {date: "Jan 12", > amount: 23.45}]} > > The schema, however, will not transform arbitrary JSON into tables and > columns. Some things > are better done in an ETL step where you can use the full power of a > declarative language > (Java or Scala in Spark, say) to convert wild & crazy JSON into relational > form. > > We are actively designing the schema feature. May I ask your use case? Would > be super-helpful > to understand the shape of your input data and how you want to map that into > SQL. > > One final note: it is possible to write a custom format plugin. If you will > query the same > wild & crazy JSON shape multiple times, you can write a plugin to do the > mapping as Drill > reads the data. Not the simplest path, but possible. > > > Thanks, > - Paul > > > >> On Wednesday, February 5, 2020, 1:30:14 PM PST, >> userdrill.mail...@laposte.net.INVALID >> wrote: >> >> Hi, >> >> Some JSON file are complex and containing differents "tree struct". >> If these file are big it will take too much time for drill to align the >> structures (and even >> worse sometimes fail). >> In spark it's possible to force a schema when reading a file to avoid long >> or useless treatment >> of align and eventually dismiss field and force type (like into string to >> avoid going down >> into the structure) >> >> Is there any possibility in drill to specify at read an explicit schema ? >> >> Thanks for any information
Re: Requesting json file with schema
Hi, Welcome to the Drill mailing list. You are right. Drill is a SQL engine. It works best when the JSON input files represent rows and columns. Of course, JSON itself can represent arbitrary data structures: you can use it to serialize any Java structure you want. Relational tables and columns represent a small subset of what JSON can do. Drill's goal is to read relational data encoded in JSON, not to somehow magically convert any arbitrary data structure into tables and columns. As described in our book, Learning Apache Drill, even seemingly trivial JSON can violate relational rules. For example: {a: 10} {a: 10.1} Since Drill infers types, and must guess the type on the first row, Drill will guess BIGINT. Then, the very next row shows that that was a poor guess and Drill will raise an error. I like to call this the "Drill cannot predict the future" problem. The team is working on a long-term project to allow you to specify a schema to resolve ambiguities. For example, you can tell Drill that column a above is a DOUBLE despite what the data might say. You can deal with schema evolution to say that column b (which does not appear above, but might appear in newer files) is an array of BIGINT. And so on. Drill also supports LATERAL JOIN and FLATTEN to handle nested tables: {name: "fred", orders: [ {date: "Jan 1", amount: 12.34}, {date: "Jan 12", amount: 23.45}]} The schema, however, will not transform arbitrary JSON into tables and columns. Some things are better done in an ETL step where you can use the full power of a declarative language (Java or Scala in Spark, say) to convert wild & crazy JSON into relational form. We are actively designing the schema feature. May I ask your use case? Would be super-helpful to understand the shape of your input data and how you want to map that into SQL. One final note: it is possible to write a custom format plugin. If you will query the same wild & crazy JSON shape multiple times, you can write a plugin to do the mapping as Drill reads the data. Not the simplest path, but possible. Thanks, - Paul On Wednesday, February 5, 2020, 1:30:14 PM PST, userdrill.mail...@laposte.net.INVALID wrote: Hi, Some JSON file are complex and containing differents "tree struct". If these file are big it will take too much time for drill to align the structures (and even worse sometimes fail). In spark it's possible to force a schema when reading a file to avoid long or useless treatment of align and eventually dismiss field and force type (like into string to avoid going down into the structure) Is there any possibility in drill to specify at read an explicit schema ? Thanks for any information