Hi, We are slowly gaining some Drill/Parquet familiarity as we research it as a potential replacement/addition for/to Druid (which we also like a lot).
We have, as stated earlier, come across many things that we like regarding Drill/Parquet and the "speed to value" is a killer aspect when dealing with file based data. There are several things we need to understand better before we continue and all assistance/feedback is appreciated for the following items. *Combining fresh (JSON/etc.) and historical (Parquet) data* - Is there a way to mix file types with directory queries - parquet for processed data and JSON for fresh (badge) data waiting to be turned into Parquet files - Is there a recommended way to deal with streaming/fresh data - I know that other are other tools available in this domain but I wonder what would bee suitable for a "pure Drill" approach *Performance and setup:* - Under what circumstances does Drill distribute the workload to different Drill-bits - HDFS vs. S3 - Benefits of each approach (We were going the HDFS route but S3 seems to be less operational "hassle") - What is the ideal segment size when using S3? - I have seen the HDFS config discussion and wonder what the S3 equivalent is - Recommended setup or basic guidelines - Are there any basic "rules" when it comes to machine count/configuration vs. volume and load? - Any "gotchas" regarding performance that we should be aware of? *Drill & Parquet:* - What version of Parquet are you using? - What big-ish changes are required in Parquet to make Drill perform better? - How much effect are bloom filters expected to have on performance? - Are you using the page indexing - Is Histograms and HyperLogLog scheduled (I do not find it in their Jira) - When will Drill specific changes be merged upstream into Parquet? - Are their any new features (that matter) in Parquet that you have not started using? *Drill Features (And yes, We will surely vote for these):* - Update table vs. Create table - add new data to existing Parquet structure (CTAS variant to add data to existing files with same Partition by structure) - JDB/ODBC datasources - for dimension information from legacy systems - We would be using Parquet+Cassandra (for now) unless you recommended something else - Survive unexpected EOL (incomplete files) - disregard last in-complete JSON/CSV entry to allow querying of open log files that are being appended to by another process - (Perhaps a better way exist but I have been running this on live-log files with good success :) ) I guess this is it for now :). All the best, -Stefan
