Re: Various ramblings of a newbie

andrew Sat, 11 Jul 2015 13:28:49 -0700

Hi,

Thanks for this list. I think it would be helpful if you could create some JIRA 
tickets around these bugs at https://issues.apache.org/jira/browse/DRILL/ 
<https://issues.apache.org/jira/browse/DRILL/>. It’s easier to track issues 
that way than by email.


- A


> On Jul 11, 2015, at 11:40 AM, Stefán Baxter <[email protected]> wrote:
> 
> Hi,
> 
> I'm new to Drill and Parquet and the following are questions/observations I
> made during my initial discovery phase.
> 
> I'm sharing them here for other newbies but also to see if some of these
> concerns are invalid or based on misunderstanding.
> 
> I made no list of the things that I like of what I have seen but that list
> would be a lot longer than the following :).
> 
> *Misc. observations:*
> 
>   - *Foreign key lookups (joins)*
>   - Coming from the traditional RDBM world I have a hard time wrapping my
>   head around how this can efficient/fast
>   - Broadcasting is something that I need to understand a lot better
>   before committing :)
>   - Will it looking up a single value all files if not pruned?
> 
>   -
> *Rows being loaded before filtering *- In some cases whole rows are loaded
>   before filtering is done (User defined functions indicate this)
>   - This seems to sacrifices many of the "column" trades from Parquet
>   - Partially helped by pruning and pre-selection (automatic for Parquet
>   files since latest 1.1 release)
> 
>   - *Count(*) can be expensive*
>   - Document states: "slow on some formats that do not support row
>   count..." - I'm using Parquet and it seems to apply there
>   - Hint: It seems like using  "count(columns[0])" instead of "count(*)"
>   may help - Slow count()ing seems like such a bad trade for an analytic
>   solution.
>   - *Updating parquet files*
>   - Seems like adding individual rows is inefficient
>   - Update / Insert/ Deleted seems to be scheduled for Drill 1.2
>   - What are best practices dealing with streaming data?
> 
>   - *Unique constraints*
>   - Ensuring uniqueness seems to be defined outside-the-scope-of-drill
>   (Parquet)
> 
>   - *Views*
>   - Are parquet based views materialized and automatically updated?
> 
>   - *ISO date support*
>   - Seems strange that iso dates are not directly supported (T between
>   date and time and a trailing timezone indicator)
>   - Came across this and somewhat agreed: "For example, the new extended
>   JSON support ($date) will parse a date such as '2015-01-01T00:22:00Z'
>   and convert it to the local time."
> 
>   - *Histograms / Hyperloglog*
>   - Some analytics stores, like Druid, support histograms and HyperLogLog
>   for fast counting and cardinality estimations
>   - Why is this missing in Drill, is it planned?
>   - Can it be achieved on top of Parquet
> 
>   - *Some stories of SQL idiosyncrasies* - Found this in the mailing
>   archives and it made me smile: "Finally it worked. And the only thing I had
>   to do was writing t2 join t1 instead of t1 join t2. I've changed nothing
>   else. And this really seems weird." - SQL support will surely mature
>   over time (Like not being able to include aliases in group by clause)
>   - *Using S3... really?* - Is it efficient or according to best practices
>   to use S3 as a "data source"? - How efficiency is column scanning over S3?
>   (Parquet )
>   - *Roadmap* - I only found the Drill roadmap in one presentation on
>   Slideshare (failed to save the link, sorry) - Issue tracker in Jira
>   provides roadmap indications :) - Is the roadmap available?
>   - Mailgroups (and the standards Apache interface) - Any plans to use
>   Google groups or something a tiny bit more friendly?
>   - *Datta types* - is there an effective way to store UUIDs in Parquest
>   (Parquet question really and the answer seems to be no... not directly)
>   - *Nested flatten* - There are currently some limitations to working
>   with multiple nested structures - issue:
>   https://issues.apache.org/jira/browse/DRILL-2783
> 
> I look forward to working with Drill and hope it will be a suitable match
> for our project. (Sorry for not mentioning all the really great things I
> feel I came across)
> 
> Thank you all for the effort.
> 
> Regards,
> -Stefan

Re: Various ramblings of a newbie

Reply via email to