Re: Various ramblings of a newbie

Stefán Baxter Mon, 13 Jul 2015 10:33:51 -0700

Hi and thanks,

Regarding "/part2":
I think that append table would allow for a "cleaner" setup. Adding data
once a day would lead to a fairly messy directory structure (perhaps
irrelevant).


We are dealing with multi tenancy and Partition by sounds like a good way
for that.

I'm guessing Partition by a) tenant, series, year, month or b) series,
tenant, year, month  will be the way to go (series = "datasource").

Are there any best practices dealing with that?

Regards,
 -Stefan


On Mon, Jul 13, 2015 at 3:32 PM, Jacques Nadeau <[email protected]> wrote:

> > I'm  guessing my fk_lookup scenario would/could benefit from using other
> > storage options for that.
> > Currently most of this is in Postgres and a think I saw some mention of
> > supporting traditional data sources soon :)
>
>
> Agreed.
>
>
>
> > Yeah, I saw the one involving metadata caching. That seem quite
> important.
> >
>
> Yep.
>
> It this custom Parquet reader enabled/available?
> >
>
> Yes, Drill automatically uses it when it can.
>
> We are using Druid and ingest data into it from Kafka/RabbitMQ. It handles
> > segment creation (parquet equivalent) and mixes together new/fresh data,
> > not stored that way, and historical data that is stored in segments with
> > regular interval.
> > I do realize that Drill is not the workflow/ingestion tool but I wonder
> if
> > there are any guidelines to mixing json/other files with parquet and
> > especially the transition period from file->parquet to avoid duplicate
> > results or missing portions.
> > This may all become clear as I examine other tools that are suited for
> the
> > ingestion but it seems like Drill should have something since it has
> > directory based queries and seems to cater to these kind of things.
> >
>
>  I don't think there is a well defined recipe yet.  It would be great if
> others could chime in on what has worked well for them.
>
> CTAS could also be used, I guess, to create daily, monthly aggregations for
> > historical (non changing) data.
> > Can it be used to add to table or does it require to create the whole
> table
> > every time?
> > I'm guessing that I'm asking the wrong question and that with a directory
> > based approach I would just add the new roll-up/aggregation table/file to
> > it's proper place. (If manual)
> >
> > Will the PARTION_BY clause prevent the table creation from deleting other
> > files-in-the-table if there is no overlap in the partition_by fields?
>
>
> We don't currently have an append table functionality.  Luckily, the
> directory based table referencing makes this very easy anyway.  You can do
> CTAS into `mytable/part1` today, then do it again tomorrow into
> `mytable/part2`.  If you query `mytable`, you'll get all the data.  This
> works even when using PARTITION_BY since PARTITION_BY does not rely on
> directory structure.
>
>
>
>
> > Can you please detail the difference and the potential gain once
> > fixed-width is supported?
> >
>
> I wouldn't worry about it.  We're talking about a performance and memory
> impact but unless you have data that is entirely composed of UUIDs, I can't
> imagine it would be a noticeable impact.  (The difference is Drill's
> internal representation.  Right now, Drill will hold an extra four byte
> value in memory for the length of each value and constantly need to read
> that value when manipulating the data.  If we support fixed length internal
> representation, we wouldn't need to maintain this length lookup.)
>
>
>
> > Yeah, this must be a hot topic (I'm rooting for this one!)
> >
>
> I still see zero votes on the JIRA issue.  You should make sure to vote for
> this issue.
>

Re: Various ramblings of a newbie

Reply via email to