Hi, The short back story is this:
- We are serving multiple tenants with vastly different data volume and needs - there no such thing as fixed period segment sizes (to get to approx. volume per segment) - We do queries that combined information from historical and fresh (streaming) data (parquet and json/avro respectively) using joins - currently we are using loggers to emit the streaming data but this will replaced - The "fresh" data (json/avro) files live in a single directory - 1 file per day - Fresh data is occasionally transformed from json/avro to parquet - the frequency of this is set on tenant/volume basis This is why we need/like to*: - Use directory structure and file names as a flexible chronological partitions (via UDFs) - Use parquet partitions for "logical data separation" based on other attributes than time * Please remember that adding new data to parquet files would eliminate the need for much of this ** The same is true if would move this whole thing to some metadata driven environment like Hive The Historical (parquet) directory structure might look something like this: 1. /<tenant>/<source>/streaming/2015/09/10 - high volume :: data transformed daily 2. /<tenant>/<source>/streaming/2015/W10 - medium volume :: data transformed weekly 3. /<tenant>/<source>/streaming/2015/09 - low(er) volume :: data transformed monthly So yes, we think that having the ability to evaluate full paths and file names where we can affect the pruning/scanning with appropriate exceptions would help us gain some sanity :). I realize that pruning should preferably be done in the planning phase but this would allow for a not-too-messy interception of the scanning process. Best regards, -Stefan On Fri, Sep 18, 2015 at 6:01 AM, Jacques Nadeau <[email protected]> wrote: > Can you also provide some examples of what you are trying to accomplish? > > It seems like you might be saying that you want a virtual attribute for the > entire path rather than individual pieces? Also remember that partition > pruning can also be done if you're using Parquet files without all the dirN > syntax. > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > On Thu, Sep 17, 2015 at 10:42 AM, Stefán Baxter <[email protected] > > > wrote: > > > Hi, > > > > I have been writing a few simple utility functions for Drill and staring > at > > the cumbersome dirN conditions required to take advantage of directory > > pruning. > > > > Would it be possible to allow UDFs to throw fileOutOfScope and > > directoryOutOfScope exceptions that would allow me to a) write a failry > > clever inRange(from, to, dirN...) function and would b) allow for > > additional pruning during execution? > > > > Maybe I'm seeing this all wrong but the process of complicating all > queries > > with a, sometimes quite complicated, dirN tail just seems like too much > > redundancy. > > > > Regards, > > -Stefan > > >
