Hi again, I was overlooking the handy UNION operator when I was noting the combination part in my previous email. (Feel free to ignore it)
Regards, -Stefan On Wed, Jul 15, 2015 at 3:56 PM, Stefán Baxter <[email protected]> wrote: > Hi, > > We are slowly gaining some Drill/Parquet familiarity as we research it as > a potential replacement/addition for/to Druid (which we also like a lot). > > We have, as stated earlier, come across many things that we like regarding > Drill/Parquet and the "speed to value" is a killer aspect when dealing with > file based data. > > There are several things we need to understand better before we continue > and all assistance/feedback is appreciated for the following items. > > *Combining fresh (JSON/etc.) and historical (Parquet) data* > > - Is there a way to mix file types with directory queries > - parquet for processed data and JSON for fresh (badge) data waiting > to be turned into Parquet files > > > - Is there a recommended way to deal with streaming/fresh data > - I know that other are other tools available in this domain but I > wonder what would bee suitable for a "pure Drill" approach > > *Performance and setup:* > > - Under what circumstances does Drill distribute the workload to > different Drill-bits > > - HDFS vs. S3 > - Benefits of each approach (We were going the HDFS route but S3 > seems to be less operational "hassle") > > - What is the ideal segment size when using S3? > - I have seen the HDFS config discussion and wonder what the S3 > equivalent is > > - Recommended setup or basic guidelines > - Are there any basic "rules" when it comes to machine > count/configuration vs. volume and load? > > - Any "gotchas" regarding performance that we should be aware of? > > > *Drill & Parquet:* > > - What version of Parquet are you using? > > - What big-ish changes are required in Parquet to make Drill perform > better? > - How much effect are bloom filters expected to have on performance? > - Are you using the page indexing > - Is Histograms and HyperLogLog scheduled (I do not find it in their > Jira) > > - When will Drill specific changes be merged upstream into Parquet? > > - Are their any new features (that matter) in Parquet that you have > not started using? > > > *Drill Features (And yes, We will surely vote for these):* > > - Update table vs. Create table > - add new data to existing Parquet structure (CTAS variant to add data > to existing files with same Partition by structure) > > - JDB/ODBC datasources > - for dimension information from legacy systems > - We would be using Parquet+Cassandra (for now) unless you > recommended something else > > - Survive unexpected EOL (incomplete files) > - disregard last in-complete JSON/CSV entry to allow querying of open > log files that are being appended to by another process > - (Perhaps a better way exist but I have been running this on live-log > files with good success :) ) > > > I guess this is it for now :). > > All the best, > -Stefan > >
