Re: directory pruning and UDFs

Stefán Baxter Thu, 24 Sep 2015 15:24:07 -0700

Hi,

It's here: https://issues.apache.org/jira/browse/DRILL-3838


hopefully this can be accommodated soon :).

Regards,
 -Stefan



On Wed, Sep 23, 2015 at 5:21 PM, Jacques Nadeau <[email protected]> wrote:

> Hey Stefan,
>
> Yes, this makes a lot of sense and seems reasonable. We've talked about
> providing the simple filename as a virtual attribute. It seems like we
> should also provide a full path attribute (from the root of the workspace).
> Can you open a JIRA for this? It isn't something that is supported now but
> should be fairly trivial to do while we are adding the filename virtual
> attribute.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Tue, Sep 22, 2015 at 1:51 PM, Stefán Baxter <[email protected]>
> wrote:
>
> > Jacques,
> >
> > Is this something you think makes sense and could be accommodated?
> >
> > Regards,
> >  -Stefan
> >
> > On Fri, Sep 18, 2015 at 12:13 PM, Stefán Baxter <
> [email protected]
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > The short back story is this:
> > >
> > >    - We are serving multiple tenants with vastly different data volume
> > >    and needs
> > >    - there no such thing as fixed period segment sizes (to get to
> approx.
> > >    volume per segment)
> > >
> > >    - We do queries that combined information from historical and fresh
> > >    (streaming) data (parquet and json/avro respectively) using joins
> > >    - currently we are using loggers to emit the streaming data but this
> > >    will replaced
> > >
> > >    - The "fresh" data (json/avro)  files live in a single directory
> > >    - 1 file per day
> > >
> > >    - Fresh data is occasionally transformed from json/avro to parquet
> > >    - the frequency of this is set on tenant/volume basis
> > >
> > > This is why we need/like to*:
> > >
> > >    - Use directory structure and file names as a flexible chronological
> > >    partitions (via UDFs)
> > >    - Use parquet partitions for "logical data separation" based on
> other
> > >    attributes than time
> > >
> > >    * Please remember that adding new data to parquet files would
> > >    eliminate the need for much of this
> > >    ** The same is true if would move this whole thing to some metadata
> > >    driven environment like Hive
> > >
> > > The Historical (parquet) directory structure might look something like
> > > this:
> > >
> > >    1. /<tenant>/<source>/streaming/2015/09/10
> > >    - high volume :: data transformed daily
> > >
> > >    2. /<tenant>/<source>/streaming/2015/W10
> > >    - medium volume :: data transformed weekly
> > >
> > >    3. /<tenant>/<source>/streaming/2015/09
> > >    - low(er) volume :: data transformed monthly
> > >
> > > So yes, we think that having the ability to evaluate full paths and
> file
> > > names where we can affect the pruning/scanning with appropriate
> > exceptions
> > > would help us gain some sanity :).
> > >
> > > I realize that pruning should preferably be done in the planning phase
> > but
> > > this would allow for a not-too-messy interception of the scanning
> > process.
> > >
> > > Best regards,
> > >  -Stefan
> > >
> > >
> > > On Fri, Sep 18, 2015 at 6:01 AM, Jacques Nadeau <[email protected]>
> > > wrote:
> > >
> > >> Can you also provide some examples of what you are trying to
> accomplish?
> > >>
> > >> It seems like you might be saying that you want a virtual attribute
> for
> > >> the
> > >> entire path rather than individual pieces? Also remember that
> partition
> > >> pruning can also be done if you're using Parquet files without all the
> > >> dirN
> > >> syntax.
> > >>
> > >> --
> > >> Jacques Nadeau
> > >> CTO and Co-Founder, Dremio
> > >>
> > >> On Thu, Sep 17, 2015 at 10:42 AM, Stefán Baxter <
> > >> [email protected]>
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > I have been writing a few simple utility functions for Drill and
> > >> staring at
> > >> > the cumbersome dirN conditions required to take advantage of
> directory
> > >> > pruning.
> > >> >
> > >> > Would it be possible to allow UDFs to throw fileOutOfScope and
> > >> > directoryOutOfScope exceptions that would allow me to a) write a
> > failry
> > >> > clever inRange(from, to, dirN...) function and would b) allow for
> > >> > additional pruning during execution?
> > >> >
> > >> > Maybe I'm seeing this all wrong but the process of complicating all
> > >> queries
> > >> > with a, sometimes quite complicated, dirN tail just seems like too
> > much
> > >> > redundancy.
> > >> >
> > >> > Regards,
> > >> >  -Stefan
> > >> >
> > >>
> > >
> > >
> >
>

Re: directory pruning and UDFs

Reply via email to