Re: A incomplete map to a maze - or - the jurney down the rabbit hole

Jinfeng Ni Tue, 04 Aug 2015 09:27:38 -0700

The second query, which has an expression involving UDF in the where
clause, is a bit more complex than what we tested.


I agree that in general partition pruning should work for expression
involving functions.

I posted your second query to DRILL-3560, which you opened days ago for the
partition pruning issue.

Thanks for providing this helpful information!

Regards,

Jinfeng


On Tue, Aug 4, 2015 at 4:32 AM, Stefán Baxter <[email protected]>
wrote:

> Hi,
>
> You are right, partition pruning works using simple queries (I will adjust
> our queries accordingly):
>
>    - select count(*) from dfs.tmp.`/analytics/processed/test/events` as t
>    where dir0 = '2014';
>
> I tried to run the following query:
>
>    - select count(*) from dfs.tmp.`/analytics/processed/test/events` as t
>    where dir0 = dirInRange(cast('2015-04-10' as
> timestamp),cast('2015-07-11'
>    as timestamp),COALESCE(dir0,'-')) and dir1 =
> dirInRange(cast('2015-04-10'
>    as timestamp),cast('2015-07-11' as
>    timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'));
>
> This later query uses a UDF (dirInRange) that operates solely on
> static+directory information (see my earlier post on the same subject) but
> that does not work.
>
> Using a non-static-value for directory evaluation/planning does not work
> and can be quite limiting since periods can, for example, overlap years and
> months and then filters on dir0/dir1/dir2 combinations can become quite
> verbose (but works :).
>
> I understand the limitation and apologize for making a false claim but it
> would be great if partition pruning would work with functions like these.
>
> Best regards,
>  -Stefan
>
>
> On Tue, Aug 4, 2015 at 4:56 AM, rahul challapalli <
> [email protected]> wrote:
>
> > As Jinfeng mentioned, directory based partition pruning should work. You
> > might also be interested in DRILL-3333
> > <https://issues.apache.org/jira/browse/DRILL-3333> which allows you to
> > auto
> > partition data when using CTAS.
> >
> > - Rahul
> >
> > On Mon, Aug 3, 2015 at 5:03 PM, Jinfeng Ni <[email protected]>
> wrote:
> >
> > > For the issues about partition pruning over directories, could you
> please
> > > provide more detail information? Drill should do partition pruning
> based
> > on
> > > directory. If it does not work the way you want,  probably there is a
> bug
> > > in the code.  We would appreciate if you can provide more detail, so
> that
> > > we could re-produce the problem and get it fix asap.
> > >
> > > Regards,
> > >
> > > Jinfeng
> > >
> > >
> > > On Mon, Aug 3, 2015 at 4:44 PM, Stefán Baxter <
> [email protected]
> > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I have been meaning to write a blog post regarding our Drill
> > experiments
> > > > but I though I might share some thoughts here first.
> > > > Hopefully some newbie can benefit from this and perhaps it sheds some
> > > light
> > > > on what drives newcomers in this community (or at least some part of
> > > them).
> > > >
> > > > A bit of a back story. I work for a small startup that has been
> > > developing
> > > > it's application for *two long* but we hope to lunch our product in
> > > > Q1/2016.
> > > > We have been using Druid as a real-time analytic store and, even
> though
> > > > Druid does extremely well what it does, we are now forced to replace
> it
> > > to
> > > > allow for functionality that we can not be without.
> > > >
> > > > We started to evaluate several solutions and we are now left with a
> > > handful
> > > > that we like. We have not had the time to evaluate all of them to
> > > extremes
> > > > but we feel comfortable with the general approach we have taken.
> > > >
> > > > There were many things for us to consider and I want to run through
> > them
> > > > here to seek advice from those here that know better. (I apologize
> for
> > > the
> > > > long read)
> > > >
> > > > *Initial questions we had:*
> > > >
> > > >    - Do we want to operate our own (distributed) file system or rely
> on
> > > >    something like S3?
> > > >    - Are there other options?
> > > >
> > > >    - Do we favor "bringing the computation to the data" or "bring the
> > > data
> > > >    to the computation"?
> > > >    - For us Drill+S3 is a "bring data to the computation" solution
> > while
> > > >    Hadoop+Hive+MapReduce is a "bringing the computation to the data"
> > > > solution
> > > >
> > > >    - How can we effectively merge streaming data with historical data
> > and
> > > >    process both at the same time?
> > > >    - Something that Druid does very well
> > > >
> > > >    - What is the best way to merge data from different sources in a
> > > >    polyglot data / format / sources environment?
> > > >
> > > >    - How can we minimize the effect of large number of small tenants?
> > > >    - Warming up tenant specific indexes in Elastic Search can, for
> > > example,
> > > >    be costly (What a great solution though)
> > > >
> > > >    - How can we evict long-tail data from fast storage and gradually
> > move
> > > >    it into deep storage over time ?
> > > >    - ... while avoiding a lengthy warm-up period
> > > >    - ... while using deep storage also as backup even for fresh data
> > > >
> > > >    - Can we cache aggregates or would we just be building a cache
> that
> > > >    next-to-nothing would ever hit?
> > > >    - aggregation for days, weeks, months with all the mostly used
> spins
> > > >    - historical data does not change here so one level of complexity
> is
> > > >    absent
> > > >
> > > >    - Can we achieve sub-second results dealing with our tenant
> specific
> > > >    data sets
> > > >    - We will never be FB/Twitter but we still got some sizable chunks
> > of
> > > >    data :)
> > > >
> > > >    - Is there a stack available that already works like this?
> > > >    - MapR / BDAS etc. etc.
> > > >
> > > >    - Can we build this using "pure" Apache components?
> > > >    - We would like to have the option of going without
> > > support/license-fees
> > > >    but we also like the option of professional services with SLAs and
> > > paid
> > > >    subscriptions
> > > >
> > > >    - Is this all a premature optimization?
> > > >    - aka. "let's just use Elastic Search for now"
> > > >
> > > >
> > > > *Storage format*
> > > >
> > > > We decided early on that the we liked Parquet <
> > > https://parquet.apache.org/
> > > > >
> > > > enough to commit to it for our processed/historical data.
> > > > We know that ORC <https://orc.apache.org/> is also great but the we
> > > > believe
> > > > that Parquet will, with its  support for nested structures, broad
> > appeal
> > > > and interoperability become a de-facto standard before long.
> > > > (We really like ORC as well :) )
> > > >
> > > > Parquet has so many things going for it so I jump to what we see as
> > > current
> > > > cons:
> > > >
> > > >    - Parquet is does not compress as well as ORC
> > > >    - Seems to makes up for it in retrieval speed (
> > > >
> > > >
> > >
> >
> https://developer.ibm.com/hadoop/blog/2014/09/19/big-sql-3-0-file-formats-usage-performance/
> > > >    )
> > > >
> > > >    - Parquet is still missing some things that can make it even
> better
> > > for
> > > >    analytics
> > > >    - Like: Bloom filters <
> > > https://issues.apache.org/jira/browse/PARQUET-41
> > > > >,
> > > >    HyperLogLog <https://issues.apache.org/jira/browse/PARQUET-42>
> > > >     and indexing/sorting
> > > >
> > > >    - Hive has some problems dealing with data in nested structures in
> > > >    Parquet files (but that may have changed)
> > > >
> > > >    - It's ineffective to gradually add data to parquet files (or so
> we
> > > >    believe) so streaming data should be batched into Parquet files
> > > >
> > > > Our streaming data is mixed schema (partially schemed and partially
> > > schema
> > > > less) so using Avro <https://avro.apache.org/> or Protobuff
> > > > <https://developers.google.com/protocol-buffers/> is doable but a
> > tricky
> > > > while JSON is very doable but freakishly verbose.
> > > > Drill will, for example, allow us to join JSON data with Parquet data
> > and
> > > > in our tests we have been merging log information, in JSON format,
> with
> > > > historical/older data stored in Parqet.
> > > > Querying log files like this is sub-optimal, even though they are
> > stored
> > > on
> > > > PCIE-flash, so we still look for viable alternatives.
> > > >
> > > > Worst case scenario for us is to continue to work with this log
> > approach
> > > > but then using Avro.
> > > >
> > > > Relevant Drill points:
> > > >
> > > >    - +  Drill knows all the file formats we like to use (and then
> some)
> > > >    - -   Drill still misses RDBMs access (via JDBC) but that will
> show
> > up
> > > >    soon enough (for us)
> > > >
> > > >
> > > > *Query engine*
> > > >
> > > > We like SQL.
> > > >
> > > > It's the key to so many systems, tools and resources that selecting a
> > > > proprietary query language is not an option (any more).
> > > > The query engines we know (of) in this space are:
> > > >
> > > >    - Presto <https://prestodb.io/>
> > > >    - Apache Drill <https://drill.apache.org/>
> > > >    - Apache Spark <http://spark.apache.org/>
> > > >
> > > > They all have in common the ability to run distributed queries and
> > merge
> > > > data from different sources using different format, which is great!
> > > > Spark does, in addition to this, offer a range of capabilities
> > > > (ML/Graph/etc.) that are appealing to us.
> > > >
> > > > We have not decided yet what to do but Drill is very appealing to us
> > and
> > > we
> > > > have it running in the "production" system used for our dev.
> partners.
> > > >
> > > > Relevant Drill points:
> > > >
> > > >    - + Drill does SQL well and is getting better at dealing with
> > > >    evolving/dirty schema (aka. JSON curve balls)
> > > >    - + Drill is getting better and better in window functions and
> > various
> > > >    aggregations
> > > >    - + Drill can be accessed via JDBC/ODBC and it has a Spark
> > connection
> > > as
> > > >    well
> > > >    - + Drill does User Defined Functions (Could be a bit more
> > accessible
> > > >    but hey, they are there!)
> > > >
> > > > *Storage:*
> > > >
> > > > It's very appealing to use Hive <https://hive.apache.org/> and
> > > Hadoop/HDFS
> > > > <https://hadoop.apache.org/> to store our data and Hive can point to
> > > data
> > > > in S3 as well as in other storage locations.
> > > > Many solutions know how to use the Hive catalog and some optimization
> > > seems
> > > > to be available the utilizes it.
> > > >
> > > > Still we are not convinced that we want to build a hadoop
> > > > <https://hadoop.apache.org/> cluster (yeah, it's mostly prejudice)
> > > > We don't want to use S3 directly or be forced to live in the EC2
> > > ecosystem
> > > > for it to make any operation sense. (yeah, mostly we are cheep)
> > > >
> > > > Ideally we would like something like this:
> > > >
> > > >    - Use S3 (or any other cost effective alternative) as deep storage
> > > >
> > > >    - Cache new and in-demand-data locally to minimize the S3 traffic
> as
> > > >    much as possible (Large tired cache: t1:Flash, t2:SSD and t3:HDD)
> > > >
> > > >    - We would like to use a file system that plays nice with the
> > Parquet
> > > >    file format and dos the scans/reads required to effectively work
> > with
> > > > the
> > > >    columnar nature of it.
> > > >    we are still not sure if HDFS is the ideal match for parquet from
> > this
> > > >    perspective but we think that both Gluster <
> http://www.gluster.org/
> > >
> > > > and
> > > >    Ceph <http://ceph.com/> could be (even though Ceph is a
> Blob/Object
> > > >    store like S3 underneath)
> > > >
> > > > Relevant Drill Points (for file system based operations):
> > > >
> > > >    - - All Drillbits (nodes) require access to the same data for
> > > >    distribution/planning to work ("bring the data to the
> computation")
> > > >    planner that knows about "single-node-data" and can plan for it
> > would
> > > be
> > > >    ideal for alternatives like the one we have in mind
> > > >
> > > >    - - Partition pruning does not work with directories.
> > > >    Partitioning data based on time span into a sub-directory
> structure
> > > has
> > > >    no benefits (Drill goes through them all, always)
> > > >
> > > > We were quite optimistic when we came across Tachyon
> > > > <http://tachyon-project.org/> and got it to work with Drill (see
> > earlier
> > > > email).
> > > > It seemed like the ideal bridge between Drill and S3 that would allow
> > us
> > > to
> > > > do multi-tiered caching and smart sharing of content between
> Drillbits
> > > > (nodes) without fetching it multiple times from S3.
> > > > But Tachyon currently has a single drawback that we can not accept.
> It
> > > > stored all files in S3 in a proprietary way that does not reflect the
> > > "file
> > > > system" that it fronts.
> > > > Also; Accessing Tachyon seems to be a API only exercise and NFS
> access,
> > > for
> > > > example, does not exist.  (In that way Tachyon is wicket fast and
> slow
> > > (as
> > > > in dim) at the same time)
> > > >
> > > > *To oversimplify - the choice is now between the beaten path
> > > > (MapR/BDAS/etc.) or ... *
> > > >
> > > > *A)* the 12" mix - feat. file system only:
> > > >
> > > >    - Query engine: *Drill*
> > > >    - Streaming/log data format: *Avro* / *JSON*
> > > >    - Historical Data Format: *Parquet*
> > > >    - Extra dimensions / enrichment information: *RDBM* / *Cassandra*
> /
> > > >    *Solr* / *More*
> > > >    - File System: *S3 + Tachyon* / *Gluster* / *Ceph*
> > > >
> > > > *B)* the 6" mix - feat. Johnny Cash:
> > > >
> > > >    - Query engine: *Drill / Presto / Spark / Spark + Drill*
> > > >    - Streaming/log data format: *Avro* / *JSON*
> > > >    - Historical Data Format: *Parquet*
> > > >    - Extra dimensions / enrichment information: *RDBM* / *Cassandra*
> /
> > > >    *Solr* / *More*
> > > >    - File System:
> > > > *Hadoop + Hive *
> > > >
> > > >
> > > > *General Drill observations that are relevant to us (partly
> > redundant):*
> > > >
> > > >    - REST API does only return JSON
> > > >    - Uncompressed and incorrectly typed (count(*) returns a number in
> > > >    quotes)
> > > >    - the result set is always known and returning Avro/Protobuf could
> > be
> > > >    beneficial
> > > >
> > > >    - Schema discovery is still a bit immature / fragile /
> temperamental
> > > >    - This is understandable and will change
> > > >
> > > >    - Partition pruning does not work for directories
> > > >    - See point above (Storage)
> > > >
> > > >    - Schema information is not available for data in the file system
> > > >    - This will likely change (at least the schema is created on query
> > > time)
> > > >
> > > >    - Metadata queries are not available (data discovery queries)
> > > >    - Discovering fields, cardinality and composition for data that
> > meets
> > > >    certain criteria
> > > >    - This is essential in building good UI for tenant specific data
> > (data
> > > >    that varies between tenants)
> > > >
> > > > I will, in the near future, try to involve this into a more
> informative
> > > > blog post and share it for those starting down this path.
> > > >
> > > > There are is so much more going on in/around this space (Tez
> > > > <https://tez.apache.org/>, Flink <https://flink.apache.org/> ...)
> > > >
> > > > All input is welcomed.
> > > >
> > > > Best regards,
> > > >  -Stefan Baxter
> > > >
> > > > PS. I'm a bit dyslexic and sometimes the little bastards sneak
> though.
> > > Just
> > > > laugh and shake your head when you come across them :)
> > > >
> > >
> >
>

Re: A incomplete map to a maze - or - the jurney down the rabbit hole

Reply via email to