Hi, Thanks for this list. I think it would be helpful if you could create some JIRA tickets around these bugs at https://issues.apache.org/jira/browse/DRILL/ <https://issues.apache.org/jira/browse/DRILL/>. It’s easier to track issues that way than by email.
- A > On Jul 11, 2015, at 11:40 AM, Stefán Baxter <[email protected]> wrote: > > Hi, > > I'm new to Drill and Parquet and the following are questions/observations I > made during my initial discovery phase. > > I'm sharing them here for other newbies but also to see if some of these > concerns are invalid or based on misunderstanding. > > I made no list of the things that I like of what I have seen but that list > would be a lot longer than the following :). > > *Misc. observations:* > > - *Foreign key lookups (joins)* > - Coming from the traditional RDBM world I have a hard time wrapping my > head around how this can efficient/fast > - Broadcasting is something that I need to understand a lot better > before committing :) > - Will it looking up a single value all files if not pruned? > > - > *Rows being loaded before filtering *- In some cases whole rows are loaded > before filtering is done (User defined functions indicate this) > - This seems to sacrifices many of the "column" trades from Parquet > - Partially helped by pruning and pre-selection (automatic for Parquet > files since latest 1.1 release) > > - *Count(*) can be expensive* > - Document states: "slow on some formats that do not support row > count..." - I'm using Parquet and it seems to apply there > - Hint: It seems like using "count(columns[0])" instead of "count(*)" > may help - Slow count()ing seems like such a bad trade for an analytic > solution. > - *Updating parquet files* > - Seems like adding individual rows is inefficient > - Update / Insert/ Deleted seems to be scheduled for Drill 1.2 > - What are best practices dealing with streaming data? > > - *Unique constraints* > - Ensuring uniqueness seems to be defined outside-the-scope-of-drill > (Parquet) > > - *Views* > - Are parquet based views materialized and automatically updated? > > - *ISO date support* > - Seems strange that iso dates are not directly supported (T between > date and time and a trailing timezone indicator) > - Came across this and somewhat agreed: "For example, the new extended > JSON support ($date) will parse a date such as '2015-01-01T00:22:00Z' > and convert it to the local time." > > - *Histograms / Hyperloglog* > - Some analytics stores, like Druid, support histograms and HyperLogLog > for fast counting and cardinality estimations > - Why is this missing in Drill, is it planned? > - Can it be achieved on top of Parquet > > - *Some stories of SQL idiosyncrasies* - Found this in the mailing > archives and it made me smile: "Finally it worked. And the only thing I had > to do was writing t2 join t1 instead of t1 join t2. I've changed nothing > else. And this really seems weird." - SQL support will surely mature > over time (Like not being able to include aliases in group by clause) > - *Using S3... really?* - Is it efficient or according to best practices > to use S3 as a "data source"? - How efficiency is column scanning over S3? > (Parquet ) > - *Roadmap* - I only found the Drill roadmap in one presentation on > Slideshare (failed to save the link, sorry) - Issue tracker in Jira > provides roadmap indications :) - Is the roadmap available? > - Mailgroups (and the standards Apache interface) - Any plans to use > Google groups or something a tiny bit more friendly? > - *Datta types* - is there an effective way to store UUIDs in Parquest > (Parquet question really and the answer seems to be no... not directly) > - *Nested flatten* - There are currently some limitations to working > with multiple nested structures - issue: > https://issues.apache.org/jira/browse/DRILL-2783 > > I look forward to working with Drill and hope it will be a suitable match > for our project. (Sorry for not mentioning all the really great things I > feel I came across) > > Thank you all for the effort. > > Regards, > -Stefan
