Re: Digging deeper

Stefán Baxter Wed, 15 Jul 2015 09:36:37 -0700

Hi again,

I was overlooking the handy UNION operator when I was noting the
combination part in my previous email.
(Feel free to ignore it)


Regards,
 -Stefan

On Wed, Jul 15, 2015 at 3:56 PM, Stefán Baxter <[email protected]>
wrote:

> Hi,
>
> We are slowly gaining some Drill/Parquet familiarity as we research it as
> a potential replacement/addition for/to Druid (which we also like a lot).
>
> We have, as stated earlier, come across many things that we like regarding
> Drill/Parquet and the "speed to value" is a killer aspect when dealing with
> file based data.
>
> There are several things we need to understand better before we continue
> and all assistance/feedback is appreciated for the following items.
>
> *Combining fresh (JSON/etc.) and historical (Parquet) data*
>
>    - Is there a way to mix file types with directory queries
>    - parquet for processed data and JSON for fresh (badge) data waiting
>    to be turned into Parquet files
>
>
>    - Is there a recommended way to deal with streaming/fresh data
>    - I know that other are other tools available in this domain but I
>    wonder what would bee suitable for a "pure Drill" approach
>
> *Performance and setup:*
>
>    - Under what circumstances does Drill distribute the workload to
>    different Drill-bits
>
>    - HDFS vs. S3
>    - Benefits of each approach (We were going the HDFS  route but S3
>    seems to be less operational "hassle")
>
>    - What is the ideal segment size when using S3?
>    - I have seen the HDFS config discussion and wonder what the S3
>    equivalent is
>
>    - Recommended setup or basic guidelines
>    - Are there any basic "rules" when it comes to machine
>    count/configuration vs. volume and load?
>
>    - Any "gotchas" regarding performance that we should be aware of?
>
>
> *Drill & Parquet:*
>
>    - What version of Parquet are you using?
>
>    - What big-ish changes are required in Parquet to make Drill perform
>    better?
>    - How much effect are bloom filters expected to have on performance?
>    - Are you using the page indexing
>    - Is Histograms and HyperLogLog scheduled (I do not find it in their
>    Jira)
>
>    - When will Drill specific changes be merged upstream into Parquet?
>
>    - Are their any new features (that matter) in Parquet that you have
>    not started using?
>
>
> *Drill Features (And yes, We will surely vote for these):*
>
>    - Update table vs. Create table
>    - add new data to existing Parquet structure (CTAS variant to add data
>    to existing files with same Partition by structure)
>
>    - JDB/ODBC datasources
>    - for dimension information from legacy systems
>    - We would be using Parquet+Cassandra  (for now) unless you
>    recommended something else
>
>    - Survive unexpected EOL (incomplete files)
>    - disregard last in-complete JSON/CSV entry to allow querying of open
>    log files that are being appended to by another process
>    - (Perhaps a better way exist but I have been running this on live-log
>    files with good success :) )
>
>
> I guess this is it for now :).
>
> All the best,
>  -Stefan
>
>

Re: Digging deeper

Reply via email to