Re: Various ramblings of a newbie

Stefán Baxter Sat, 11 Jul 2015 18:54:06 -0700

Hi Jacques,

and thank you for answering swiftly and clearly :).


Some additional questions did arise (see inline):

>    - *Foreign key lookups (joins)*
>

I'm  guessing my fk_lookup scenario would/could benefit from using other
storage options for that.
Currently most of this is in Postgres and a think I saw some mention of
supporting traditional data sources soon :)

>    - Partially helped by pruning and pre-selection (automatic for Parquet
> >    files since latest 1.1 release)
> >
>
> We do some of this.  More could be done. There are number of open JIRAs on
> this topic.
>
> Yeah, I saw the one involving metadata caching. That seem quite important.

>
> >    - *Count(*) can be expensive*


*Rows being loaded before filtering *- In some cases whole rows are loaded
>    before filtering is done (User defined functions indicate this)
>    - This seems to sacrifices many of the "column" trades from Parquet
>

  | Yes, this is a cost that can be optimized.  (We have to leave some room
to
  | optimize Drill after 1.0, right :D )  That being said, we've built a
custom
  | Parquet reader that transforms directly from the columnar disk
  | representation into our in-memory columnar representation.  This is
several
  | times faster than the traditional Parquet reader.  In most cases, this

It this custom Parquet reader enabled/available?
Would it work with remote storage, like S3? (I'm guessing not)

isn't a big issue for workloads.If you generate your Parquet files using
> Drill, Drill should be quick to
> return count(*). However, we've seen some systems generate Parquet files
> without setting the metadata of the number of records for each file.  This
> would degrade performance as it would require a full scan.  If you provide
> the output of the parquet tools head, we should be able to diagnose why
> this is a problem for your files.
>

Thank you, I will take you up on that if the problem prevails after I have
stopped making so many novice mistakes :)


> >    - What are best practices dealing with streaming data?
> >
>
> Can you expound on your use case? It really depends on what you mean by
> streaming data.
>

We are using Druid and ingest data into it from Kafka/RabbitMQ. It handles
segment creation (parquet equivalent) and mixes together new/fresh data,
not stored that way, and historical data that is stored in segments with
regular interval.
I do realize that Drill is not the workflow/ingestion tool but I wonder if
there are any guidelines to mixing json/other files with parquet and
especially the transition period from file->parquet to avoid duplicate
results or missing portions.
This may all become clear as I examine other tools that are suited for the
ingestion but it seems like Drill should have something since it has
directory based queries and seems to cater to these kind of things.


> >    - *Views*
> >    - Are parquet based views materialized and automatically updated?
>
> Views are logical only and are executed each time a query above is run.
> The good news is that the view and the query utilizing it are optimized as
> a single relational plan so that we do only the work that is necessary for
> the actual output of the final query.
>

CTAS could also be used, I guess, to create daily, monthly aggregations for
historical (non changing) data.
Can it be used to add to table or does it require to create the whole table
every time?
I'm guessing that I'm asking the wrong question and that with a directory
based approach I would just add the new roll-up/aggregation table/file to
it's proper place. (If manual)

Will the PARTION_BY clause prevent the table creation from deleting other
files-in-the-table if there is no overlap in the partition_by fields?

   - *Histograms / Hyperloglog*
> >    - Some analytics stores, like Druid, support histograms and
> HyperLogLog
> >    for fast counting and cardinality estimations
> >    - Why is this missing in Drill, is it planned?
> >
>
> Just haven't gotten to it yet.  We will.
>

I saw something on this in the Parquet community and think this must be an
"in tandem" kin'o'thing.


> Not right now.  Parquet does support fixed width binary fields so you could
> store a 16 byte field that held the UUID.  That would be extremely
> efficient.  Drill doesn't yet support generating a fixed width field for
> Parquet but it is something that will be added in the future.  Drill should
> read the field no problem (as opaque VARBINARY)
>

Can you please detail the difference and the potential gain once
fixed-width is supported?


> >    - *Nested flatten* - There are currently some limitations to working
> >    with multiple nested structures - issue:
> >    https://issues.apache.org/jira/browse/DRILL-2783
>
>
> This is an enhancement that no one has gotten to yet.  Make sure to vote
> for it (and get your friends to vote for it) and we'll probably get to it
> sooner.
>

Yeah, this must be a hot topic (I'm rooting for this one!)

Jacques
>

Thank you again for the prompt and clear answers.

I'm quite impressed with both Drill and Parquet and look forward to dig
deeper :).

Regards,
 -Stefan

Re: Various ramblings of a newbie

Reply via email to