Re: Various ramblings of a newbie

Jacques Nadeau Mon, 13 Jul 2015 08:33:22 -0700

> I'm  guessing my fk_lookup scenario would/could benefit from using other
> storage options for that.
> Currently most of this is in Postgres and a think I saw some mention of
> supporting traditional data sources soon :)



Agreed.



> Yeah, I saw the one involving metadata caching. That seem quite important.
>

Yep.

It this custom Parquet reader enabled/available?
>

Yes, Drill automatically uses it when it can.

We are using Druid and ingest data into it from Kafka/RabbitMQ. It handles
> segment creation (parquet equivalent) and mixes together new/fresh data,
> not stored that way, and historical data that is stored in segments with
> regular interval.
> I do realize that Drill is not the workflow/ingestion tool but I wonder if
> there are any guidelines to mixing json/other files with parquet and
> especially the transition period from file->parquet to avoid duplicate
> results or missing portions.
> This may all become clear as I examine other tools that are suited for the
> ingestion but it seems like Drill should have something since it has
> directory based queries and seems to cater to these kind of things.
>

 I don't think there is a well defined recipe yet.  It would be great if
others could chime in on what has worked well for them.

CTAS could also be used, I guess, to create daily, monthly aggregations for
> historical (non changing) data.
> Can it be used to add to table or does it require to create the whole table
> every time?
> I'm guessing that I'm asking the wrong question and that with a directory
> based approach I would just add the new roll-up/aggregation table/file to
> it's proper place. (If manual)
>
> Will the PARTION_BY clause prevent the table creation from deleting other
> files-in-the-table if there is no overlap in the partition_by fields?


We don't currently have an append table functionality.  Luckily, the
directory based table referencing makes this very easy anyway.  You can do
CTAS into `mytable/part1` today, then do it again tomorrow into
`mytable/part2`.  If you query `mytable`, you'll get all the data.  This
works even when using PARTITION_BY since PARTITION_BY does not rely on
directory structure.




> Can you please detail the difference and the potential gain once
> fixed-width is supported?
>

I wouldn't worry about it.  We're talking about a performance and memory
impact but unless you have data that is entirely composed of UUIDs, I can't
imagine it would be a noticeable impact.  (The difference is Drill's
internal representation.  Right now, Drill will hold an extra four byte
value in memory for the length of each value and constantly need to read
that value when manipulating the data.  If we support fixed length internal
representation, we wouldn't need to maintain this length lookup.)



> Yeah, this must be a hot topic (I'm rooting for this one!)
>

I still see zero votes on the JIRA issue.  You should make sure to vote for
this issue.

Re: Various ramblings of a newbie

Reply via email to