Hi John, Thank you for your support, we are trying to build the most useful tool for analytics across data sources and always glad to hear we are on the right track.
I am a little confused about your question. If you point Drill at a file with that JSON in it, it will read it as a single record. You mention wanting to flatten the data out and put in Parquet files. Have you tried working with the FLATTEN function in Drill? [1] Drill does not currently support something like recursive flatten, each level of flattening requires an explicit call to the flatten function. So I'm not sure if you will be able to do exactly what you want if the documents really can have arbitrary nesting depths. Parquet also lacks support for recursive data structure definitions, the metadata requires a complete schema explicitly giving each level of nesting be provided when you start writing the file (drill will do this automatically for you during a CTAS statement, but it will just provide whatever levels of nesting it read out of your JSON as the parquet schema). How much you want to flatten is going to depend on the kind of analysis you need to do. There are a lot of different list in this dataset at various levels of nesting. I think you are likely going to want to flatten out at least the `entry` array, although I'm not quite sure how analysis across these lists full 'comment' fields would be in your case. It might make sense to store these as lists and flatten them in different queries invoking analysis of only some of the lists. I actually just answered another question about flattening a complex JSON structure this morning, you may find my comments over there useful for learning about Drill. [2] [1] - https://drill.apache.org/docs/flatten/ [2] - http://mail-archives.apache.org/mod_mbox/drill-user/201601.mbox/%3CCAMpYv7C3CqY6D8x5CC3H955n4CSDTuqY3a8PfZwT1m2dhEyN7w%40mail.gmail.com%3E On Fri, Jan 8, 2016 at 11:47 AM, John Radin <[email protected]> wrote: > Hello All- > > First off, I just wanted to thank you all for this great project. Given > the scale and heterogenuity of modern data sources, drill has killer use > cases. > > I did want to inquire about a use case I have been researching where I > think Drill could be very useful in my ETL pipeline. I just want to > articulate it and get some opinions. > > I have an HDFS directory of the following json file format: > > https://www.hl7.org/fhir/bundle-transaction.json.html > > The issue is that I would like to treat each individual file as a record, > since each one corresponds to one entity of interest (only one patient > resource per bundle). I'm curious to how Drill differs from Apache Spark > (which I am currently using) on this. I've found Apache Spark's off the > shelf methods ineffective in this respect and my attempts use > sc.wholeTextFiles() and subsequent RDD mapping operations to be very > inefficient/memory intensive. > > Given that a bundle can contain an arbitrary # of resources AND arbitrary > nesting depth of those resources, it is challenging to find a way to > flatten them effectively and ideally save them in parquet file(s). > > Any advice or pointers as to whether Drill might be a solution to my use > case would most appreciated! > > Cheers, > John >
