Hello All- First off, I just wanted to thank you all for this great project. Given the scale and heterogenuity of modern data sources, drill has killer use cases.
I did want to inquire about a use case I have been researching where I think Drill could be very useful in my ETL pipeline. I just want to articulate it and get some opinions. I have an HDFS directory of the following json file format: https://www.hl7.org/fhir/bundle-transaction.json.html The issue is that I would like to treat each individual file as a record, since each one corresponds to one entity of interest (only one patient resource per bundle). I'm curious to how Drill differs from Apache Spark (which I am currently using) on this. I've found Apache Spark's off the shelf methods ineffective in this respect and my attempts use sc.wholeTextFiles() and subsequent RDD mapping operations to be very inefficient/memory intensive. Given that a bundle can contain an arbitrary # of resources AND arbitrary nesting depth of those resources, it is challenging to find a way to flatten them effectively and ideally save them in parquet file(s). Any advice or pointers as to whether Drill might be a solution to my use case would most appreciated! Cheers, John
