Hi and thanks, Regarding "/part2": I think that append table would allow for a "cleaner" setup. Adding data once a day would lead to a fairly messy directory structure (perhaps irrelevant).
We are dealing with multi tenancy and Partition by sounds like a good way for that. I'm guessing Partition by a) tenant, series, year, month or b) series, tenant, year, month will be the way to go (series = "datasource"). Are there any best practices dealing with that? Regards, -Stefan On Mon, Jul 13, 2015 at 3:32 PM, Jacques Nadeau <[email protected]> wrote: > > I'm guessing my fk_lookup scenario would/could benefit from using other > > storage options for that. > > Currently most of this is in Postgres and a think I saw some mention of > > supporting traditional data sources soon :) > > > Agreed. > > > > > Yeah, I saw the one involving metadata caching. That seem quite > important. > > > > Yep. > > It this custom Parquet reader enabled/available? > > > > Yes, Drill automatically uses it when it can. > > We are using Druid and ingest data into it from Kafka/RabbitMQ. It handles > > segment creation (parquet equivalent) and mixes together new/fresh data, > > not stored that way, and historical data that is stored in segments with > > regular interval. > > I do realize that Drill is not the workflow/ingestion tool but I wonder > if > > there are any guidelines to mixing json/other files with parquet and > > especially the transition period from file->parquet to avoid duplicate > > results or missing portions. > > This may all become clear as I examine other tools that are suited for > the > > ingestion but it seems like Drill should have something since it has > > directory based queries and seems to cater to these kind of things. > > > > I don't think there is a well defined recipe yet. It would be great if > others could chime in on what has worked well for them. > > CTAS could also be used, I guess, to create daily, monthly aggregations for > > historical (non changing) data. > > Can it be used to add to table or does it require to create the whole > table > > every time? > > I'm guessing that I'm asking the wrong question and that with a directory > > based approach I would just add the new roll-up/aggregation table/file to > > it's proper place. (If manual) > > > > Will the PARTION_BY clause prevent the table creation from deleting other > > files-in-the-table if there is no overlap in the partition_by fields? > > > We don't currently have an append table functionality. Luckily, the > directory based table referencing makes this very easy anyway. You can do > CTAS into `mytable/part1` today, then do it again tomorrow into > `mytable/part2`. If you query `mytable`, you'll get all the data. This > works even when using PARTITION_BY since PARTITION_BY does not rely on > directory structure. > > > > > > Can you please detail the difference and the potential gain once > > fixed-width is supported? > > > > I wouldn't worry about it. We're talking about a performance and memory > impact but unless you have data that is entirely composed of UUIDs, I can't > imagine it would be a noticeable impact. (The difference is Drill's > internal representation. Right now, Drill will hold an extra four byte > value in memory for the length of each value and constantly need to read > that value when manipulating the data. If we support fixed length internal > representation, we wouldn't need to maintain this length lookup.) > > > > > Yeah, this must be a hot topic (I'm rooting for this one!) > > > > I still see zero votes on the JIRA issue. You should make sure to vote for > this issue. >
