As Jinfeng mentioned, directory based partition pruning should work. You might also be interested in DRILL-3333 <https://issues.apache.org/jira/browse/DRILL-3333> which allows you to auto partition data when using CTAS.
- Rahul On Mon, Aug 3, 2015 at 5:03 PM, Jinfeng Ni <[email protected]> wrote: > For the issues about partition pruning over directories, could you please > provide more detail information? Drill should do partition pruning based on > directory. If it does not work the way you want, probably there is a bug > in the code. We would appreciate if you can provide more detail, so that > we could re-produce the problem and get it fix asap. > > Regards, > > Jinfeng > > > On Mon, Aug 3, 2015 at 4:44 PM, Stefán Baxter <[email protected]> > wrote: > > > Hi, > > > > I have been meaning to write a blog post regarding our Drill experiments > > but I though I might share some thoughts here first. > > Hopefully some newbie can benefit from this and perhaps it sheds some > light > > on what drives newcomers in this community (or at least some part of > them). > > > > A bit of a back story. I work for a small startup that has been > developing > > it's application for *two long* but we hope to lunch our product in > > Q1/2016. > > We have been using Druid as a real-time analytic store and, even though > > Druid does extremely well what it does, we are now forced to replace it > to > > allow for functionality that we can not be without. > > > > We started to evaluate several solutions and we are now left with a > handful > > that we like. We have not had the time to evaluate all of them to > extremes > > but we feel comfortable with the general approach we have taken. > > > > There were many things for us to consider and I want to run through them > > here to seek advice from those here that know better. (I apologize for > the > > long read) > > > > *Initial questions we had:* > > > > - Do we want to operate our own (distributed) file system or rely on > > something like S3? > > - Are there other options? > > > > - Do we favor "bringing the computation to the data" or "bring the > data > > to the computation"? > > - For us Drill+S3 is a "bring data to the computation" solution while > > Hadoop+Hive+MapReduce is a "bringing the computation to the data" > > solution > > > > - How can we effectively merge streaming data with historical data and > > process both at the same time? > > - Something that Druid does very well > > > > - What is the best way to merge data from different sources in a > > polyglot data / format / sources environment? > > > > - How can we minimize the effect of large number of small tenants? > > - Warming up tenant specific indexes in Elastic Search can, for > example, > > be costly (What a great solution though) > > > > - How can we evict long-tail data from fast storage and gradually move > > it into deep storage over time ? > > - ... while avoiding a lengthy warm-up period > > - ... while using deep storage also as backup even for fresh data > > > > - Can we cache aggregates or would we just be building a cache that > > next-to-nothing would ever hit? > > - aggregation for days, weeks, months with all the mostly used spins > > - historical data does not change here so one level of complexity is > > absent > > > > - Can we achieve sub-second results dealing with our tenant specific > > data sets > > - We will never be FB/Twitter but we still got some sizable chunks of > > data :) > > > > - Is there a stack available that already works like this? > > - MapR / BDAS etc. etc. > > > > - Can we build this using "pure" Apache components? > > - We would like to have the option of going without > support/license-fees > > but we also like the option of professional services with SLAs and > paid > > subscriptions > > > > - Is this all a premature optimization? > > - aka. "let's just use Elastic Search for now" > > > > > > *Storage format* > > > > We decided early on that the we liked Parquet < > https://parquet.apache.org/ > > > > > enough to commit to it for our processed/historical data. > > We know that ORC <https://orc.apache.org/> is also great but the we > > believe > > that Parquet will, with its support for nested structures, broad appeal > > and interoperability become a de-facto standard before long. > > (We really like ORC as well :) ) > > > > Parquet has so many things going for it so I jump to what we see as > current > > cons: > > > > - Parquet is does not compress as well as ORC > > - Seems to makes up for it in retrieval speed ( > > > > > https://developer.ibm.com/hadoop/blog/2014/09/19/big-sql-3-0-file-formats-usage-performance/ > > ) > > > > - Parquet is still missing some things that can make it even better > for > > analytics > > - Like: Bloom filters < > https://issues.apache.org/jira/browse/PARQUET-41 > > >, > > HyperLogLog <https://issues.apache.org/jira/browse/PARQUET-42> > > and indexing/sorting > > > > - Hive has some problems dealing with data in nested structures in > > Parquet files (but that may have changed) > > > > - It's ineffective to gradually add data to parquet files (or so we > > believe) so streaming data should be batched into Parquet files > > > > Our streaming data is mixed schema (partially schemed and partially > schema > > less) so using Avro <https://avro.apache.org/> or Protobuff > > <https://developers.google.com/protocol-buffers/> is doable but a tricky > > while JSON is very doable but freakishly verbose. > > Drill will, for example, allow us to join JSON data with Parquet data and > > in our tests we have been merging log information, in JSON format, with > > historical/older data stored in Parqet. > > Querying log files like this is sub-optimal, even though they are stored > on > > PCIE-flash, so we still look for viable alternatives. > > > > Worst case scenario for us is to continue to work with this log approach > > but then using Avro. > > > > Relevant Drill points: > > > > - + Drill knows all the file formats we like to use (and then some) > > - - Drill still misses RDBMs access (via JDBC) but that will show up > > soon enough (for us) > > > > > > *Query engine* > > > > We like SQL. > > > > It's the key to so many systems, tools and resources that selecting a > > proprietary query language is not an option (any more). > > The query engines we know (of) in this space are: > > > > - Presto <https://prestodb.io/> > > - Apache Drill <https://drill.apache.org/> > > - Apache Spark <http://spark.apache.org/> > > > > They all have in common the ability to run distributed queries and merge > > data from different sources using different format, which is great! > > Spark does, in addition to this, offer a range of capabilities > > (ML/Graph/etc.) that are appealing to us. > > > > We have not decided yet what to do but Drill is very appealing to us and > we > > have it running in the "production" system used for our dev. partners. > > > > Relevant Drill points: > > > > - + Drill does SQL well and is getting better at dealing with > > evolving/dirty schema (aka. JSON curve balls) > > - + Drill is getting better and better in window functions and various > > aggregations > > - + Drill can be accessed via JDBC/ODBC and it has a Spark connection > as > > well > > - + Drill does User Defined Functions (Could be a bit more accessible > > but hey, they are there!) > > > > *Storage:* > > > > It's very appealing to use Hive <https://hive.apache.org/> and > Hadoop/HDFS > > <https://hadoop.apache.org/> to store our data and Hive can point to > data > > in S3 as well as in other storage locations. > > Many solutions know how to use the Hive catalog and some optimization > seems > > to be available the utilizes it. > > > > Still we are not convinced that we want to build a hadoop > > <https://hadoop.apache.org/> cluster (yeah, it's mostly prejudice) > > We don't want to use S3 directly or be forced to live in the EC2 > ecosystem > > for it to make any operation sense. (yeah, mostly we are cheep) > > > > Ideally we would like something like this: > > > > - Use S3 (or any other cost effective alternative) as deep storage > > > > - Cache new and in-demand-data locally to minimize the S3 traffic as > > much as possible (Large tired cache: t1:Flash, t2:SSD and t3:HDD) > > > > - We would like to use a file system that plays nice with the Parquet > > file format and dos the scans/reads required to effectively work with > > the > > columnar nature of it. > > we are still not sure if HDFS is the ideal match for parquet from this > > perspective but we think that both Gluster <http://www.gluster.org/> > > and > > Ceph <http://ceph.com/> could be (even though Ceph is a Blob/Object > > store like S3 underneath) > > > > Relevant Drill Points (for file system based operations): > > > > - - All Drillbits (nodes) require access to the same data for > > distribution/planning to work ("bring the data to the computation") > > planner that knows about "single-node-data" and can plan for it would > be > > ideal for alternatives like the one we have in mind > > > > - - Partition pruning does not work with directories. > > Partitioning data based on time span into a sub-directory structure > has > > no benefits (Drill goes through them all, always) > > > > We were quite optimistic when we came across Tachyon > > <http://tachyon-project.org/> and got it to work with Drill (see earlier > > email). > > It seemed like the ideal bridge between Drill and S3 that would allow us > to > > do multi-tiered caching and smart sharing of content between Drillbits > > (nodes) without fetching it multiple times from S3. > > But Tachyon currently has a single drawback that we can not accept. It > > stored all files in S3 in a proprietary way that does not reflect the > "file > > system" that it fronts. > > Also; Accessing Tachyon seems to be a API only exercise and NFS access, > for > > example, does not exist. (In that way Tachyon is wicket fast and slow > (as > > in dim) at the same time) > > > > *To oversimplify - the choice is now between the beaten path > > (MapR/BDAS/etc.) or ... * > > > > *A)* the 12" mix - feat. file system only: > > > > - Query engine: *Drill* > > - Streaming/log data format: *Avro* / *JSON* > > - Historical Data Format: *Parquet* > > - Extra dimensions / enrichment information: *RDBM* / *Cassandra* / > > *Solr* / *More* > > - File System: *S3 + Tachyon* / *Gluster* / *Ceph* > > > > *B)* the 6" mix - feat. Johnny Cash: > > > > - Query engine: *Drill / Presto / Spark / Spark + Drill* > > - Streaming/log data format: *Avro* / *JSON* > > - Historical Data Format: *Parquet* > > - Extra dimensions / enrichment information: *RDBM* / *Cassandra* / > > *Solr* / *More* > > - File System: > > *Hadoop + Hive * > > > > > > *General Drill observations that are relevant to us (partly redundant):* > > > > - REST API does only return JSON > > - Uncompressed and incorrectly typed (count(*) returns a number in > > quotes) > > - the result set is always known and returning Avro/Protobuf could be > > beneficial > > > > - Schema discovery is still a bit immature / fragile / temperamental > > - This is understandable and will change > > > > - Partition pruning does not work for directories > > - See point above (Storage) > > > > - Schema information is not available for data in the file system > > - This will likely change (at least the schema is created on query > time) > > > > - Metadata queries are not available (data discovery queries) > > - Discovering fields, cardinality and composition for data that meets > > certain criteria > > - This is essential in building good UI for tenant specific data (data > > that varies between tenants) > > > > I will, in the near future, try to involve this into a more informative > > blog post and share it for those starting down this path. > > > > There are is so much more going on in/around this space (Tez > > <https://tez.apache.org/>, Flink <https://flink.apache.org/> ...) > > > > All input is welcomed. > > > > Best regards, > > -Stefan Baxter > > > > PS. I'm a bit dyslexic and sometimes the little bastards sneak though. > Just > > laugh and shake your head when you come across them :) > > >
