Re: directory pruning and UDFs

Stefán Baxter Fri, 18 Sep 2015 05:15:07 -0700

Hi,

The short back story is this:


   - We are serving multiple tenants with vastly different data volume and
   needs
   - there no such thing as fixed period segment sizes (to get to approx.
   volume per segment)

   - We do queries that combined information from historical and fresh
   (streaming) data (parquet and json/avro respectively) using joins
   - currently we are using loggers to emit the streaming data but this
   will replaced

   - The "fresh" data (json/avro)  files live in a single directory
   - 1 file per day

   - Fresh data is occasionally transformed from json/avro to parquet
   - the frequency of this is set on tenant/volume basis

This is why we need/like to*:

   - Use directory structure and file names as a flexible chronological
   partitions (via UDFs)
   - Use parquet partitions for "logical data separation" based on other
   attributes than time

   * Please remember that adding new data to parquet files would eliminate
   the need for much of this
   ** The same is true if would move this whole thing to some metadata
   driven environment like Hive

The Historical (parquet) directory structure might look something like this:

   1. /<tenant>/<source>/streaming/2015/09/10
   - high volume :: data transformed daily

   2. /<tenant>/<source>/streaming/2015/W10
   - medium volume :: data transformed weekly

   3. /<tenant>/<source>/streaming/2015/09
   - low(er) volume :: data transformed monthly

So yes, we think that having the ability to evaluate full paths and file
names where we can affect the pruning/scanning with appropriate exceptions
would help us gain some sanity :).

I realize that pruning should preferably be done in the planning phase but
this would allow for a not-too-messy interception of the scanning process.

Best regards,
 -Stefan


On Fri, Sep 18, 2015 at 6:01 AM, Jacques Nadeau <[email protected]> wrote:

> Can you also provide some examples of what you are trying to accomplish?
>
> It seems like you might be saying that you want a virtual attribute for the
> entire path rather than individual pieces? Also remember that partition
> pruning can also be done if you're using Parquet files without all the dirN
> syntax.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Thu, Sep 17, 2015 at 10:42 AM, Stefán Baxter <[email protected]
> >
> wrote:
>
> > Hi,
> >
> > I have been writing a few simple utility functions for Drill and staring
> at
> > the cumbersome dirN conditions required to take advantage of directory
> > pruning.
> >
> > Would it be possible to allow UDFs to throw fileOutOfScope and
> > directoryOutOfScope exceptions that would allow me to a) write a failry
> > clever inRange(from, to, dirN...) function and would b) allow for
> > additional pruning during execution?
> >
> > Maybe I'm seeing this all wrong but the process of complicating all
> queries
> > with a, sometimes quite complicated, dirN tail just seems like too much
> > redundancy.
> >
> > Regards,
> >  -Stefan
> >
>

Re: directory pruning and UDFs

Reply via email to