Hi, I'm new to Drill and Parquet and the following are questions/observations I made during my initial discovery phase.
I'm sharing them here for other newbies but also to see if some of these concerns are invalid or based on misunderstanding. I made no list of the things that I like of what I have seen but that list would be a lot longer than the following :). *Misc. observations:* - *Foreign key lookups (joins)* - Coming from the traditional RDBM world I have a hard time wrapping my head around how this can efficient/fast - Broadcasting is something that I need to understand a lot better before committing :) - Will it looking up a single value all files if not pruned? - *Rows being loaded before filtering *- In some cases whole rows are loaded before filtering is done (User defined functions indicate this) - This seems to sacrifices many of the "column" trades from Parquet - Partially helped by pruning and pre-selection (automatic for Parquet files since latest 1.1 release) - *Count(*) can be expensive* - Document states: "slow on some formats that do not support row count..." - I'm using Parquet and it seems to apply there - Hint: It seems like using "count(columns[0])" instead of "count(*)" may help - Slow count()ing seems like such a bad trade for an analytic solution. - *Updating parquet files* - Seems like adding individual rows is inefficient - Update / Insert/ Deleted seems to be scheduled for Drill 1.2 - What are best practices dealing with streaming data? - *Unique constraints* - Ensuring uniqueness seems to be defined outside-the-scope-of-drill (Parquet) - *Views* - Are parquet based views materialized and automatically updated? - *ISO date support* - Seems strange that iso dates are not directly supported (T between date and time and a trailing timezone indicator) - Came across this and somewhat agreed: "For example, the new extended JSON support ($date) will parse a date such as '2015-01-01T00:22:00Z' and convert it to the local time." - *Histograms / Hyperloglog* - Some analytics stores, like Druid, support histograms and HyperLogLog for fast counting and cardinality estimations - Why is this missing in Drill, is it planned? - Can it be achieved on top of Parquet - *Some stories of SQL idiosyncrasies* - Found this in the mailing archives and it made me smile: "Finally it worked. And the only thing I had to do was writing t2 join t1 instead of t1 join t2. I've changed nothing else. And this really seems weird." - SQL support will surely mature over time (Like not being able to include aliases in group by clause) - *Using S3... really?* - Is it efficient or according to best practices to use S3 as a "data source"? - How efficiency is column scanning over S3? (Parquet ) - *Roadmap* - I only found the Drill roadmap in one presentation on Slideshare (failed to save the link, sorry) - Issue tracker in Jira provides roadmap indications :) - Is the roadmap available? - Mailgroups (and the standards Apache interface) - Any plans to use Google groups or something a tiny bit more friendly? - *Datta types* - is there an effective way to store UUIDs in Parquest (Parquet question really and the answer seems to be no... not directly) - *Nested flatten* - There are currently some limitations to working with multiple nested structures - issue: https://issues.apache.org/jira/browse/DRILL-2783 I look forward to working with Drill and hope it will be a suitable match for our project. (Sorry for not mentioning all the really great things I feel I came across) Thank you all for the effort. Regards, -Stefan
