Digging deeper

Stefán Baxter Wed, 15 Jul 2015 08:58:26 -0700

Hi,

We are slowly gaining some Drill/Parquet familiarity as we research it as a
potential replacement/addition for/to Druid (which we also like a lot).


We have, as stated earlier, come across many things that we like regarding
Drill/Parquet and the "speed to value" is a killer aspect when dealing with
file based data.

There are several things we need to understand better before we continue
and all assistance/feedback is appreciated for the following items.

*Combining fresh (JSON/etc.) and historical (Parquet) data*

   - Is there a way to mix file types with directory queries
   - parquet for processed data and JSON for fresh (badge) data waiting to
   be turned into Parquet files


   - Is there a recommended way to deal with streaming/fresh data
   - I know that other are other tools available in this domain but I
   wonder what would bee suitable for a "pure Drill" approach

*Performance and setup:*

   - Under what circumstances does Drill distribute the workload to
   different Drill-bits

   - HDFS vs. S3
   - Benefits of each approach (We were going the HDFS  route but S3 seems
   to be less operational "hassle")

   - What is the ideal segment size when using S3?
   - I have seen the HDFS config discussion and wonder what the S3
   equivalent is

   - Recommended setup or basic guidelines
   - Are there any basic "rules" when it comes to machine
   count/configuration vs. volume and load?

   - Any "gotchas" regarding performance that we should be aware of?


*Drill & Parquet:*

   - What version of Parquet are you using?

   - What big-ish changes are required in Parquet to make Drill perform
   better?
   - How much effect are bloom filters expected to have on performance?
   - Are you using the page indexing
   - Is Histograms and HyperLogLog scheduled (I do not find it in their
   Jira)

   - When will Drill specific changes be merged upstream into Parquet?

   - Are their any new features (that matter) in Parquet that you have not
   started using?


*Drill Features (And yes, We will surely vote for these):*

   - Update table vs. Create table
   - add new data to existing Parquet structure (CTAS variant to add data
   to existing files with same Partition by structure)

   - JDB/ODBC datasources
   - for dimension information from legacy systems
   - We would be using Parquet+Cassandra  (for now) unless you recommended
   something else

   - Survive unexpected EOL (incomplete files)
   - disregard last in-complete JSON/CSV entry to allow querying of open
   log files that are being appended to by another process
   - (Perhaps a better way exist but I have been running this on live-log
   files with good success :) )


I guess this is it for now :).

All the best,
 -Stefan

Digging deeper

Reply via email to