Re: Apache Drill-Querying One JSON File Per Record Option

Jason Altekruse Thu, 14 Jan 2016 14:17:22 -0800

Hi John,

Thank you for your support, we are trying to build the most useful tool for
analytics across data sources and always glad to hear we are on the right
track.

I am a little confused about your question. If you point Drill at a file
with that JSON in it, it will read it as a single record.

You mention wanting to flatten the data out and put in Parquet files. Have
you tried working with the FLATTEN function in Drill? [1]

Drill does not currently support something like recursive flatten, each
level of flattening requires an explicit call to the flatten function. So
I'm not sure if you will be able to do exactly what you want if the
documents really can have arbitrary nesting depths. Parquet also lacks
support for recursive data structure definitions, the metadata requires a
complete schema explicitly giving each level of nesting be provided when
you start writing the file (drill will do this automatically for you during
a CTAS statement, but it will just provide whatever levels of nesting it
read out of your JSON as the parquet schema).

How much you want to flatten is going to depend on the kind of analysis you
need to do. There are a lot of different list in this dataset at various
levels of nesting. I think you are likely going to want to flatten out at
least the `entry` array, although I'm not quite sure how analysis across
these lists full 'comment' fields would be in your case. It might make
sense to store these as lists and flatten them in different queries
invoking analysis of only some of the lists.

I actually just answered another question about flattening a complex JSON
structure this morning, you may find my comments over there useful for
learning about Drill. [2]

[1] - https://drill.apache.org/docs/flatten/
[2] -
http://mail-archives.apache.org/mod_mbox/drill-user/201601.mbox/%3CCAMpYv7C3CqY6D8x5CC3H955n4CSDTuqY3a8PfZwT1m2dhEyN7w%40mail.gmail.com%3E

On Fri, Jan 8, 2016 at 11:47 AM, John Radin <[email protected]> wrote:

> Hello All-
>
> First off, I just wanted to thank you all for this great project.  Given
> the scale and heterogenuity of modern data sources, drill has killer use
> cases.
>
> I did want to inquire about a use case I have been researching where I
> think Drill could be very useful in my ETL pipeline.  I just want to
> articulate it and get some opinions.
>
> I have an HDFS directory of the following json file format:
>
> https://www.hl7.org/fhir/bundle-transaction.json.html
>
> The issue is that I would like to treat each individual file as a record,
> since each one corresponds to one entity of interest (only one patient
> resource per bundle).  I'm curious to how Drill differs from Apache Spark
> (which I am currently using) on this.  I've found Apache Spark's off the
> shelf methods ineffective in this respect and my attempts use
> sc.wholeTextFiles() and subsequent RDD mapping operations to be very
> inefficient/memory intensive.
>
> Given that a bundle can contain an arbitrary # of resources AND arbitrary
> nesting depth of those resources, it is challenging to find a way to
> flatten them effectively and ideally save them in parquet file(s).
>
> Any advice or pointers as to whether Drill might be a solution to my use
> case would most appreciated!
>
> Cheers,
> John
>

Re: Apache Drill-Querying One JSON File Per Record Option

Reply via email to