The second query, which has an expression involving UDF in the where clause, is a bit more complex than what we tested.
I agree that in general partition pruning should work for expression involving functions. I posted your second query to DRILL-3560, which you opened days ago for the partition pruning issue. Thanks for providing this helpful information! Regards, Jinfeng On Tue, Aug 4, 2015 at 4:32 AM, Stefán Baxter <[email protected]> wrote: > Hi, > > You are right, partition pruning works using simple queries (I will adjust > our queries accordingly): > > - select count(*) from dfs.tmp.`/analytics/processed/test/events` as t > where dir0 = '2014'; > > I tried to run the following query: > > - select count(*) from dfs.tmp.`/analytics/processed/test/events` as t > where dir0 = dirInRange(cast('2015-04-10' as > timestamp),cast('2015-07-11' > as timestamp),COALESCE(dir0,'-')) and dir1 = > dirInRange(cast('2015-04-10' > as timestamp),cast('2015-07-11' as > timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-')); > > This later query uses a UDF (dirInRange) that operates solely on > static+directory information (see my earlier post on the same subject) but > that does not work. > > Using a non-static-value for directory evaluation/planning does not work > and can be quite limiting since periods can, for example, overlap years and > months and then filters on dir0/dir1/dir2 combinations can become quite > verbose (but works :). > > I understand the limitation and apologize for making a false claim but it > would be great if partition pruning would work with functions like these. > > Best regards, > -Stefan > > > On Tue, Aug 4, 2015 at 4:56 AM, rahul challapalli < > [email protected]> wrote: > > > As Jinfeng mentioned, directory based partition pruning should work. You > > might also be interested in DRILL-3333 > > <https://issues.apache.org/jira/browse/DRILL-3333> which allows you to > > auto > > partition data when using CTAS. > > > > - Rahul > > > > On Mon, Aug 3, 2015 at 5:03 PM, Jinfeng Ni <[email protected]> > wrote: > > > > > For the issues about partition pruning over directories, could you > please > > > provide more detail information? Drill should do partition pruning > based > > on > > > directory. If it does not work the way you want, probably there is a > bug > > > in the code. We would appreciate if you can provide more detail, so > that > > > we could re-produce the problem and get it fix asap. > > > > > > Regards, > > > > > > Jinfeng > > > > > > > > > On Mon, Aug 3, 2015 at 4:44 PM, Stefán Baxter < > [email protected] > > > > > > wrote: > > > > > > > Hi, > > > > > > > > I have been meaning to write a blog post regarding our Drill > > experiments > > > > but I though I might share some thoughts here first. > > > > Hopefully some newbie can benefit from this and perhaps it sheds some > > > light > > > > on what drives newcomers in this community (or at least some part of > > > them). > > > > > > > > A bit of a back story. I work for a small startup that has been > > > developing > > > > it's application for *two long* but we hope to lunch our product in > > > > Q1/2016. > > > > We have been using Druid as a real-time analytic store and, even > though > > > > Druid does extremely well what it does, we are now forced to replace > it > > > to > > > > allow for functionality that we can not be without. > > > > > > > > We started to evaluate several solutions and we are now left with a > > > handful > > > > that we like. We have not had the time to evaluate all of them to > > > extremes > > > > but we feel comfortable with the general approach we have taken. > > > > > > > > There were many things for us to consider and I want to run through > > them > > > > here to seek advice from those here that know better. (I apologize > for > > > the > > > > long read) > > > > > > > > *Initial questions we had:* > > > > > > > > - Do we want to operate our own (distributed) file system or rely > on > > > > something like S3? > > > > - Are there other options? > > > > > > > > - Do we favor "bringing the computation to the data" or "bring the > > > data > > > > to the computation"? > > > > - For us Drill+S3 is a "bring data to the computation" solution > > while > > > > Hadoop+Hive+MapReduce is a "bringing the computation to the data" > > > > solution > > > > > > > > - How can we effectively merge streaming data with historical data > > and > > > > process both at the same time? > > > > - Something that Druid does very well > > > > > > > > - What is the best way to merge data from different sources in a > > > > polyglot data / format / sources environment? > > > > > > > > - How can we minimize the effect of large number of small tenants? > > > > - Warming up tenant specific indexes in Elastic Search can, for > > > example, > > > > be costly (What a great solution though) > > > > > > > > - How can we evict long-tail data from fast storage and gradually > > move > > > > it into deep storage over time ? > > > > - ... while avoiding a lengthy warm-up period > > > > - ... while using deep storage also as backup even for fresh data > > > > > > > > - Can we cache aggregates or would we just be building a cache > that > > > > next-to-nothing would ever hit? > > > > - aggregation for days, weeks, months with all the mostly used > spins > > > > - historical data does not change here so one level of complexity > is > > > > absent > > > > > > > > - Can we achieve sub-second results dealing with our tenant > specific > > > > data sets > > > > - We will never be FB/Twitter but we still got some sizable chunks > > of > > > > data :) > > > > > > > > - Is there a stack available that already works like this? > > > > - MapR / BDAS etc. etc. > > > > > > > > - Can we build this using "pure" Apache components? > > > > - We would like to have the option of going without > > > support/license-fees > > > > but we also like the option of professional services with SLAs and > > > paid > > > > subscriptions > > > > > > > > - Is this all a premature optimization? > > > > - aka. "let's just use Elastic Search for now" > > > > > > > > > > > > *Storage format* > > > > > > > > We decided early on that the we liked Parquet < > > > https://parquet.apache.org/ > > > > > > > > > enough to commit to it for our processed/historical data. > > > > We know that ORC <https://orc.apache.org/> is also great but the we > > > > believe > > > > that Parquet will, with its support for nested structures, broad > > appeal > > > > and interoperability become a de-facto standard before long. > > > > (We really like ORC as well :) ) > > > > > > > > Parquet has so many things going for it so I jump to what we see as > > > current > > > > cons: > > > > > > > > - Parquet is does not compress as well as ORC > > > > - Seems to makes up for it in retrieval speed ( > > > > > > > > > > > > > > https://developer.ibm.com/hadoop/blog/2014/09/19/big-sql-3-0-file-formats-usage-performance/ > > > > ) > > > > > > > > - Parquet is still missing some things that can make it even > better > > > for > > > > analytics > > > > - Like: Bloom filters < > > > https://issues.apache.org/jira/browse/PARQUET-41 > > > > >, > > > > HyperLogLog <https://issues.apache.org/jira/browse/PARQUET-42> > > > > and indexing/sorting > > > > > > > > - Hive has some problems dealing with data in nested structures in > > > > Parquet files (but that may have changed) > > > > > > > > - It's ineffective to gradually add data to parquet files (or so > we > > > > believe) so streaming data should be batched into Parquet files > > > > > > > > Our streaming data is mixed schema (partially schemed and partially > > > schema > > > > less) so using Avro <https://avro.apache.org/> or Protobuff > > > > <https://developers.google.com/protocol-buffers/> is doable but a > > tricky > > > > while JSON is very doable but freakishly verbose. > > > > Drill will, for example, allow us to join JSON data with Parquet data > > and > > > > in our tests we have been merging log information, in JSON format, > with > > > > historical/older data stored in Parqet. > > > > Querying log files like this is sub-optimal, even though they are > > stored > > > on > > > > PCIE-flash, so we still look for viable alternatives. > > > > > > > > Worst case scenario for us is to continue to work with this log > > approach > > > > but then using Avro. > > > > > > > > Relevant Drill points: > > > > > > > > - + Drill knows all the file formats we like to use (and then > some) > > > > - - Drill still misses RDBMs access (via JDBC) but that will > show > > up > > > > soon enough (for us) > > > > > > > > > > > > *Query engine* > > > > > > > > We like SQL. > > > > > > > > It's the key to so many systems, tools and resources that selecting a > > > > proprietary query language is not an option (any more). > > > > The query engines we know (of) in this space are: > > > > > > > > - Presto <https://prestodb.io/> > > > > - Apache Drill <https://drill.apache.org/> > > > > - Apache Spark <http://spark.apache.org/> > > > > > > > > They all have in common the ability to run distributed queries and > > merge > > > > data from different sources using different format, which is great! > > > > Spark does, in addition to this, offer a range of capabilities > > > > (ML/Graph/etc.) that are appealing to us. > > > > > > > > We have not decided yet what to do but Drill is very appealing to us > > and > > > we > > > > have it running in the "production" system used for our dev. > partners. > > > > > > > > Relevant Drill points: > > > > > > > > - + Drill does SQL well and is getting better at dealing with > > > > evolving/dirty schema (aka. JSON curve balls) > > > > - + Drill is getting better and better in window functions and > > various > > > > aggregations > > > > - + Drill can be accessed via JDBC/ODBC and it has a Spark > > connection > > > as > > > > well > > > > - + Drill does User Defined Functions (Could be a bit more > > accessible > > > > but hey, they are there!) > > > > > > > > *Storage:* > > > > > > > > It's very appealing to use Hive <https://hive.apache.org/> and > > > Hadoop/HDFS > > > > <https://hadoop.apache.org/> to store our data and Hive can point to > > > data > > > > in S3 as well as in other storage locations. > > > > Many solutions know how to use the Hive catalog and some optimization > > > seems > > > > to be available the utilizes it. > > > > > > > > Still we are not convinced that we want to build a hadoop > > > > <https://hadoop.apache.org/> cluster (yeah, it's mostly prejudice) > > > > We don't want to use S3 directly or be forced to live in the EC2 > > > ecosystem > > > > for it to make any operation sense. (yeah, mostly we are cheep) > > > > > > > > Ideally we would like something like this: > > > > > > > > - Use S3 (or any other cost effective alternative) as deep storage > > > > > > > > - Cache new and in-demand-data locally to minimize the S3 traffic > as > > > > much as possible (Large tired cache: t1:Flash, t2:SSD and t3:HDD) > > > > > > > > - We would like to use a file system that plays nice with the > > Parquet > > > > file format and dos the scans/reads required to effectively work > > with > > > > the > > > > columnar nature of it. > > > > we are still not sure if HDFS is the ideal match for parquet from > > this > > > > perspective but we think that both Gluster < > http://www.gluster.org/ > > > > > > > and > > > > Ceph <http://ceph.com/> could be (even though Ceph is a > Blob/Object > > > > store like S3 underneath) > > > > > > > > Relevant Drill Points (for file system based operations): > > > > > > > > - - All Drillbits (nodes) require access to the same data for > > > > distribution/planning to work ("bring the data to the > computation") > > > > planner that knows about "single-node-data" and can plan for it > > would > > > be > > > > ideal for alternatives like the one we have in mind > > > > > > > > - - Partition pruning does not work with directories. > > > > Partitioning data based on time span into a sub-directory > structure > > > has > > > > no benefits (Drill goes through them all, always) > > > > > > > > We were quite optimistic when we came across Tachyon > > > > <http://tachyon-project.org/> and got it to work with Drill (see > > earlier > > > > email). > > > > It seemed like the ideal bridge between Drill and S3 that would allow > > us > > > to > > > > do multi-tiered caching and smart sharing of content between > Drillbits > > > > (nodes) without fetching it multiple times from S3. > > > > But Tachyon currently has a single drawback that we can not accept. > It > > > > stored all files in S3 in a proprietary way that does not reflect the > > > "file > > > > system" that it fronts. > > > > Also; Accessing Tachyon seems to be a API only exercise and NFS > access, > > > for > > > > example, does not exist. (In that way Tachyon is wicket fast and > slow > > > (as > > > > in dim) at the same time) > > > > > > > > *To oversimplify - the choice is now between the beaten path > > > > (MapR/BDAS/etc.) or ... * > > > > > > > > *A)* the 12" mix - feat. file system only: > > > > > > > > - Query engine: *Drill* > > > > - Streaming/log data format: *Avro* / *JSON* > > > > - Historical Data Format: *Parquet* > > > > - Extra dimensions / enrichment information: *RDBM* / *Cassandra* > / > > > > *Solr* / *More* > > > > - File System: *S3 + Tachyon* / *Gluster* / *Ceph* > > > > > > > > *B)* the 6" mix - feat. Johnny Cash: > > > > > > > > - Query engine: *Drill / Presto / Spark / Spark + Drill* > > > > - Streaming/log data format: *Avro* / *JSON* > > > > - Historical Data Format: *Parquet* > > > > - Extra dimensions / enrichment information: *RDBM* / *Cassandra* > / > > > > *Solr* / *More* > > > > - File System: > > > > *Hadoop + Hive * > > > > > > > > > > > > *General Drill observations that are relevant to us (partly > > redundant):* > > > > > > > > - REST API does only return JSON > > > > - Uncompressed and incorrectly typed (count(*) returns a number in > > > > quotes) > > > > - the result set is always known and returning Avro/Protobuf could > > be > > > > beneficial > > > > > > > > - Schema discovery is still a bit immature / fragile / > temperamental > > > > - This is understandable and will change > > > > > > > > - Partition pruning does not work for directories > > > > - See point above (Storage) > > > > > > > > - Schema information is not available for data in the file system > > > > - This will likely change (at least the schema is created on query > > > time) > > > > > > > > - Metadata queries are not available (data discovery queries) > > > > - Discovering fields, cardinality and composition for data that > > meets > > > > certain criteria > > > > - This is essential in building good UI for tenant specific data > > (data > > > > that varies between tenants) > > > > > > > > I will, in the near future, try to involve this into a more > informative > > > > blog post and share it for those starting down this path. > > > > > > > > There are is so much more going on in/around this space (Tez > > > > <https://tez.apache.org/>, Flink <https://flink.apache.org/> ...) > > > > > > > > All input is welcomed. > > > > > > > > Best regards, > > > > -Stefan Baxter > > > > > > > > PS. I'm a bit dyslexic and sometimes the little bastards sneak > though. > > > Just > > > > laugh and shake your head when you come across them :) > > > > > > > > > >
