Hi Jacques, and thank you for answering swiftly and clearly :).
Some additional questions did arise (see inline): > - *Foreign key lookups (joins)* > I'm guessing my fk_lookup scenario would/could benefit from using other storage options for that. Currently most of this is in Postgres and a think I saw some mention of supporting traditional data sources soon :) > - Partially helped by pruning and pre-selection (automatic for Parquet > > files since latest 1.1 release) > > > > We do some of this. More could be done. There are number of open JIRAs on > this topic. > > Yeah, I saw the one involving metadata caching. That seem quite important. > > > - *Count(*) can be expensive* *Rows being loaded before filtering *- In some cases whole rows are loaded > before filtering is done (User defined functions indicate this) > - This seems to sacrifices many of the "column" trades from Parquet > | Yes, this is a cost that can be optimized. (We have to leave some room to | optimize Drill after 1.0, right :D ) That being said, we've built a custom | Parquet reader that transforms directly from the columnar disk | representation into our in-memory columnar representation. This is several | times faster than the traditional Parquet reader. In most cases, this It this custom Parquet reader enabled/available? Would it work with remote storage, like S3? (I'm guessing not) isn't a big issue for workloads.If you generate your Parquet files using > Drill, Drill should be quick to > return count(*). However, we've seen some systems generate Parquet files > without setting the metadata of the number of records for each file. This > would degrade performance as it would require a full scan. If you provide > the output of the parquet tools head, we should be able to diagnose why > this is a problem for your files. > Thank you, I will take you up on that if the problem prevails after I have stopped making so many novice mistakes :) > > - What are best practices dealing with streaming data? > > > > Can you expound on your use case? It really depends on what you mean by > streaming data. > We are using Druid and ingest data into it from Kafka/RabbitMQ. It handles segment creation (parquet equivalent) and mixes together new/fresh data, not stored that way, and historical data that is stored in segments with regular interval. I do realize that Drill is not the workflow/ingestion tool but I wonder if there are any guidelines to mixing json/other files with parquet and especially the transition period from file->parquet to avoid duplicate results or missing portions. This may all become clear as I examine other tools that are suited for the ingestion but it seems like Drill should have something since it has directory based queries and seems to cater to these kind of things. > > - *Views* > > - Are parquet based views materialized and automatically updated? > > Views are logical only and are executed each time a query above is run. > The good news is that the view and the query utilizing it are optimized as > a single relational plan so that we do only the work that is necessary for > the actual output of the final query. > CTAS could also be used, I guess, to create daily, monthly aggregations for historical (non changing) data. Can it be used to add to table or does it require to create the whole table every time? I'm guessing that I'm asking the wrong question and that with a directory based approach I would just add the new roll-up/aggregation table/file to it's proper place. (If manual) Will the PARTION_BY clause prevent the table creation from deleting other files-in-the-table if there is no overlap in the partition_by fields? - *Histograms / Hyperloglog* > > - Some analytics stores, like Druid, support histograms and > HyperLogLog > > for fast counting and cardinality estimations > > - Why is this missing in Drill, is it planned? > > > > Just haven't gotten to it yet. We will. > I saw something on this in the Parquet community and think this must be an "in tandem" kin'o'thing. > Not right now. Parquet does support fixed width binary fields so you could > store a 16 byte field that held the UUID. That would be extremely > efficient. Drill doesn't yet support generating a fixed width field for > Parquet but it is something that will be added in the future. Drill should > read the field no problem (as opaque VARBINARY) > Can you please detail the difference and the potential gain once fixed-width is supported? > > - *Nested flatten* - There are currently some limitations to working > > with multiple nested structures - issue: > > https://issues.apache.org/jira/browse/DRILL-2783 > > > This is an enhancement that no one has gotten to yet. Make sure to vote > for it (and get your friends to vote for it) and we'll probably get to it > sooner. > Yeah, this must be a hot topic (I'm rooting for this one!) Jacques > Thank you again for the prompt and clear answers. I'm quite impressed with both Drill and Parquet and look forward to dig deeper :). Regards, -Stefan
