Re: A incomplete map to a maze - or - the jurney down the rabbit hole

rahul challapalli Mon, 03 Aug 2015 21:57:20 -0700

As Jinfeng mentioned, directory based partition pruning should work. You
might also be interested in DRILL-3333
<https://issues.apache.org/jira/browse/DRILL-3333> which allows you to auto
partition data when using CTAS.


- Rahul

On Mon, Aug 3, 2015 at 5:03 PM, Jinfeng Ni <[email protected]> wrote:

> For the issues about partition pruning over directories, could you please
> provide more detail information? Drill should do partition pruning based on
> directory. If it does not work the way you want,  probably there is a bug
> in the code.  We would appreciate if you can provide more detail, so that
> we could re-produce the problem and get it fix asap.
>
> Regards,
>
> Jinfeng
>
>
> On Mon, Aug 3, 2015 at 4:44 PM, Stefán Baxter <[email protected]>
> wrote:
>
> > Hi,
> >
> > I have been meaning to write a blog post regarding our Drill experiments
> > but I though I might share some thoughts here first.
> > Hopefully some newbie can benefit from this and perhaps it sheds some
> light
> > on what drives newcomers in this community (or at least some part of
> them).
> >
> > A bit of a back story. I work for a small startup that has been
> developing
> > it's application for *two long* but we hope to lunch our product in
> > Q1/2016.
> > We have been using Druid as a real-time analytic store and, even though
> > Druid does extremely well what it does, we are now forced to replace it
> to
> > allow for functionality that we can not be without.
> >
> > We started to evaluate several solutions and we are now left with a
> handful
> > that we like. We have not had the time to evaluate all of them to
> extremes
> > but we feel comfortable with the general approach we have taken.
> >
> > There were many things for us to consider and I want to run through them
> > here to seek advice from those here that know better. (I apologize for
> the
> > long read)
> >
> > *Initial questions we had:*
> >
> >    - Do we want to operate our own (distributed) file system or rely on
> >    something like S3?
> >    - Are there other options?
> >
> >    - Do we favor "bringing the computation to the data" or "bring the
> data
> >    to the computation"?
> >    - For us Drill+S3 is a "bring data to the computation" solution while
> >    Hadoop+Hive+MapReduce is a "bringing the computation to the data"
> > solution
> >
> >    - How can we effectively merge streaming data with historical data and
> >    process both at the same time?
> >    - Something that Druid does very well
> >
> >    - What is the best way to merge data from different sources in a
> >    polyglot data / format / sources environment?
> >
> >    - How can we minimize the effect of large number of small tenants?
> >    - Warming up tenant specific indexes in Elastic Search can, for
> example,
> >    be costly (What a great solution though)
> >
> >    - How can we evict long-tail data from fast storage and gradually move
> >    it into deep storage over time ?
> >    - ... while avoiding a lengthy warm-up period
> >    - ... while using deep storage also as backup even for fresh data
> >
> >    - Can we cache aggregates or would we just be building a cache that
> >    next-to-nothing would ever hit?
> >    - aggregation for days, weeks, months with all the mostly used spins
> >    - historical data does not change here so one level of complexity is
> >    absent
> >
> >    - Can we achieve sub-second results dealing with our tenant specific
> >    data sets
> >    - We will never be FB/Twitter but we still got some sizable chunks of
> >    data :)
> >
> >    - Is there a stack available that already works like this?
> >    - MapR / BDAS etc. etc.
> >
> >    - Can we build this using "pure" Apache components?
> >    - We would like to have the option of going without
> support/license-fees
> >    but we also like the option of professional services with SLAs and
> paid
> >    subscriptions
> >
> >    - Is this all a premature optimization?
> >    - aka. "let's just use Elastic Search for now"
> >
> >
> > *Storage format*
> >
> > We decided early on that the we liked Parquet <
> https://parquet.apache.org/
> > >
> > enough to commit to it for our processed/historical data.
> > We know that ORC <https://orc.apache.org/> is also great but the we
> > believe
> > that Parquet will, with its  support for nested structures, broad appeal
> > and interoperability become a de-facto standard before long.
> > (We really like ORC as well :) )
> >
> > Parquet has so many things going for it so I jump to what we see as
> current
> > cons:
> >
> >    - Parquet is does not compress as well as ORC
> >    - Seems to makes up for it in retrieval speed (
> >
> >
> https://developer.ibm.com/hadoop/blog/2014/09/19/big-sql-3-0-file-formats-usage-performance/
> >    )
> >
> >    - Parquet is still missing some things that can make it even better
> for
> >    analytics
> >    - Like: Bloom filters <
> https://issues.apache.org/jira/browse/PARQUET-41
> > >,
> >    HyperLogLog <https://issues.apache.org/jira/browse/PARQUET-42>
> >     and indexing/sorting
> >
> >    - Hive has some problems dealing with data in nested structures in
> >    Parquet files (but that may have changed)
> >
> >    - It's ineffective to gradually add data to parquet files (or so we
> >    believe) so streaming data should be batched into Parquet files
> >
> > Our streaming data is mixed schema (partially schemed and partially
> schema
> > less) so using Avro <https://avro.apache.org/> or Protobuff
> > <https://developers.google.com/protocol-buffers/> is doable but a tricky
> > while JSON is very doable but freakishly verbose.
> > Drill will, for example, allow us to join JSON data with Parquet data and
> > in our tests we have been merging log information, in JSON format, with
> > historical/older data stored in Parqet.
> > Querying log files like this is sub-optimal, even though they are stored
> on
> > PCIE-flash, so we still look for viable alternatives.
> >
> > Worst case scenario for us is to continue to work with this log approach
> > but then using Avro.
> >
> > Relevant Drill points:
> >
> >    - +  Drill knows all the file formats we like to use (and then some)
> >    - -   Drill still misses RDBMs access (via JDBC) but that will show up
> >    soon enough (for us)
> >
> >
> > *Query engine*
> >
> > We like SQL.
> >
> > It's the key to so many systems, tools and resources that selecting a
> > proprietary query language is not an option (any more).
> > The query engines we know (of) in this space are:
> >
> >    - Presto <https://prestodb.io/>
> >    - Apache Drill <https://drill.apache.org/>
> >    - Apache Spark <http://spark.apache.org/>
> >
> > They all have in common the ability to run distributed queries and merge
> > data from different sources using different format, which is great!
> > Spark does, in addition to this, offer a range of capabilities
> > (ML/Graph/etc.) that are appealing to us.
> >
> > We have not decided yet what to do but Drill is very appealing to us and
> we
> > have it running in the "production" system used for our dev. partners.
> >
> > Relevant Drill points:
> >
> >    - + Drill does SQL well and is getting better at dealing with
> >    evolving/dirty schema (aka. JSON curve balls)
> >    - + Drill is getting better and better in window functions and various
> >    aggregations
> >    - + Drill can be accessed via JDBC/ODBC and it has a Spark connection
> as
> >    well
> >    - + Drill does User Defined Functions (Could be a bit more accessible
> >    but hey, they are there!)
> >
> > *Storage:*
> >
> > It's very appealing to use Hive <https://hive.apache.org/> and
> Hadoop/HDFS
> > <https://hadoop.apache.org/> to store our data and Hive can point to
> data
> > in S3 as well as in other storage locations.
> > Many solutions know how to use the Hive catalog and some optimization
> seems
> > to be available the utilizes it.
> >
> > Still we are not convinced that we want to build a hadoop
> > <https://hadoop.apache.org/> cluster (yeah, it's mostly prejudice)
> > We don't want to use S3 directly or be forced to live in the EC2
> ecosystem
> > for it to make any operation sense. (yeah, mostly we are cheep)
> >
> > Ideally we would like something like this:
> >
> >    - Use S3 (or any other cost effective alternative) as deep storage
> >
> >    - Cache new and in-demand-data locally to minimize the S3 traffic as
> >    much as possible (Large tired cache: t1:Flash, t2:SSD and t3:HDD)
> >
> >    - We would like to use a file system that plays nice with the Parquet
> >    file format and dos the scans/reads required to effectively work with
> > the
> >    columnar nature of it.
> >    we are still not sure if HDFS is the ideal match for parquet from this
> >    perspective but we think that both Gluster <http://www.gluster.org/>
> > and
> >    Ceph <http://ceph.com/> could be (even though Ceph is a Blob/Object
> >    store like S3 underneath)
> >
> > Relevant Drill Points (for file system based operations):
> >
> >    - - All Drillbits (nodes) require access to the same data for
> >    distribution/planning to work ("bring the data to the computation")
> >    planner that knows about "single-node-data" and can plan for it would
> be
> >    ideal for alternatives like the one we have in mind
> >
> >    - - Partition pruning does not work with directories.
> >    Partitioning data based on time span into a sub-directory structure
> has
> >    no benefits (Drill goes through them all, always)
> >
> > We were quite optimistic when we came across Tachyon
> > <http://tachyon-project.org/> and got it to work with Drill (see earlier
> > email).
> > It seemed like the ideal bridge between Drill and S3 that would allow us
> to
> > do multi-tiered caching and smart sharing of content between Drillbits
> > (nodes) without fetching it multiple times from S3.
> > But Tachyon currently has a single drawback that we can not accept. It
> > stored all files in S3 in a proprietary way that does not reflect the
> "file
> > system" that it fronts.
> > Also; Accessing Tachyon seems to be a API only exercise and NFS access,
> for
> > example, does not exist.  (In that way Tachyon is wicket fast and slow
> (as
> > in dim) at the same time)
> >
> > *To oversimplify - the choice is now between the beaten path
> > (MapR/BDAS/etc.) or ... *
> >
> > *A)* the 12" mix - feat. file system only:
> >
> >    - Query engine: *Drill*
> >    - Streaming/log data format: *Avro* / *JSON*
> >    - Historical Data Format: *Parquet*
> >    - Extra dimensions / enrichment information: *RDBM* / *Cassandra* /
> >    *Solr* / *More*
> >    - File System: *S3 + Tachyon* / *Gluster* / *Ceph*
> >
> > *B)* the 6" mix - feat. Johnny Cash:
> >
> >    - Query engine: *Drill / Presto / Spark / Spark + Drill*
> >    - Streaming/log data format: *Avro* / *JSON*
> >    - Historical Data Format: *Parquet*
> >    - Extra dimensions / enrichment information: *RDBM* / *Cassandra* /
> >    *Solr* / *More*
> >    - File System:
> > *Hadoop + Hive *
> >
> >
> > *General Drill observations that are relevant to us (partly redundant):*
> >
> >    - REST API does only return JSON
> >    - Uncompressed and incorrectly typed (count(*) returns a number in
> >    quotes)
> >    - the result set is always known and returning Avro/Protobuf could be
> >    beneficial
> >
> >    - Schema discovery is still a bit immature / fragile / temperamental
> >    - This is understandable and will change
> >
> >    - Partition pruning does not work for directories
> >    - See point above (Storage)
> >
> >    - Schema information is not available for data in the file system
> >    - This will likely change (at least the schema is created on query
> time)
> >
> >    - Metadata queries are not available (data discovery queries)
> >    - Discovering fields, cardinality and composition for data that meets
> >    certain criteria
> >    - This is essential in building good UI for tenant specific data (data
> >    that varies between tenants)
> >
> > I will, in the near future, try to involve this into a more informative
> > blog post and share it for those starting down this path.
> >
> > There are is so much more going on in/around this space (Tez
> > <https://tez.apache.org/>, Flink <https://flink.apache.org/> ...)
> >
> > All input is welcomed.
> >
> > Best regards,
> >  -Stefan Baxter
> >
> > PS. I'm a bit dyslexic and sometimes the little bastards sneak though.
> Just
> > laugh and shake your head when you come across them :)
> >
>

Re: A incomplete map to a maze - or - the jurney down the rabbit hole

Reply via email to