Hi,

I'm new to Drill and Parquet and the following are questions/observations I
made during my initial discovery phase.

I'm sharing them here for other newbies but also to see if some of these
concerns are invalid or based on misunderstanding.

I made no list of the things that I like of what I have seen but that list
would be a lot longer than the following :).

*Misc. observations:*

   - *Foreign key lookups (joins)*
   - Coming from the traditional RDBM world I have a hard time wrapping my
   head around how this can efficient/fast
   - Broadcasting is something that I need to understand a lot better
   before committing :)
   - Will it looking up a single value all files if not pruned?

   -
*Rows being loaded before filtering *- In some cases whole rows are loaded
   before filtering is done (User defined functions indicate this)
   - This seems to sacrifices many of the "column" trades from Parquet
   - Partially helped by pruning and pre-selection (automatic for Parquet
   files since latest 1.1 release)

   - *Count(*) can be expensive*
   - Document states: "slow on some formats that do not support row
   count..." - I'm using Parquet and it seems to apply there
   - Hint: It seems like using  "count(columns[0])" instead of "count(*)"
   may help - Slow count()ing seems like such a bad trade for an analytic
   solution.
   - *Updating parquet files*
   - Seems like adding individual rows is inefficient
   - Update / Insert/ Deleted seems to be scheduled for Drill 1.2
   - What are best practices dealing with streaming data?

   - *Unique constraints*
   - Ensuring uniqueness seems to be defined outside-the-scope-of-drill
   (Parquet)

   - *Views*
   - Are parquet based views materialized and automatically updated?

   - *ISO date support*
   - Seems strange that iso dates are not directly supported (T between
   date and time and a trailing timezone indicator)
   - Came across this and somewhat agreed: "For example, the new extended
   JSON support ($date) will parse a date such as '2015-01-01T00:22:00Z'
   and convert it to the local time."

   - *Histograms / Hyperloglog*
   - Some analytics stores, like Druid, support histograms and HyperLogLog
   for fast counting and cardinality estimations
   - Why is this missing in Drill, is it planned?
   - Can it be achieved on top of Parquet

   - *Some stories of SQL idiosyncrasies* - Found this in the mailing
   archives and it made me smile: "Finally it worked. And the only thing I had
   to do was writing t2 join t1 instead of t1 join t2. I've changed nothing
   else. And this really seems weird." - SQL support will surely mature
   over time (Like not being able to include aliases in group by clause)
   - *Using S3... really?* - Is it efficient or according to best practices
   to use S3 as a "data source"? - How efficiency is column scanning over S3?
   (Parquet )
   - *Roadmap* - I only found the Drill roadmap in one presentation on
   Slideshare (failed to save the link, sorry) - Issue tracker in Jira
   provides roadmap indications :) - Is the roadmap available?
   - Mailgroups (and the standards Apache interface) - Any plans to use
   Google groups or something a tiny bit more friendly?
   - *Datta types* - is there an effective way to store UUIDs in Parquest
   (Parquet question really and the answer seems to be no... not directly)
   - *Nested flatten* - There are currently some limitations to working
   with multiple nested structures - issue:
   https://issues.apache.org/jira/browse/DRILL-2783

I look forward to working with Drill and hope it will be a suitable match
for our project. (Sorry for not mentioning all the really great things I
feel I came across)

Thank you all for the effort.

Regards,
-Stefan

Reply via email to