Re: storage structure - querying directories - sanity check and UDF assistance

Jason Altekruse Fri, 24 Jul 2015 13:47:54 -0700

I'm not sure, it is possible that it is being evaluated during planning to
prune the scan, but the filter above the scan is not being removed as it
should be. I'll try to re-create it the case to take a look.


Stefan,
Earlier you had mentioned that it was not only inefficient, but it was also
giving you back data when none matched the filter condition. Is this still
an issue, or did the comments from Hakim and Jacques allow you to fix an
issue with the UDF that was causing this?

On Fri, Jul 24, 2015 at 11:03 AM, Jacques Nadeau <[email protected]> wrote:

>   - This is being called for *every record for every file in every
>    directory*
>
> Are you sure?  Constant reduction should take care of this.  @Jason, any
> ideas why it might be failing?
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Fri, Jul 24, 2015 at 10:45 AM, Stefán Baxter <[email protected]
> >
> wrote:
>
> > Hi,
> >
> > thanks for the tips.
> >
> > Observation:
> >
> >    - This is being called for *every record for every file in every
> >    directory*
> >
> > Can you please tell me what needs to be done to make sure this is only
> > called 1 for each directory, preferably before file in that directory are
> > opened/scanned.
> >
> > Regards,
> >   -Stefán
> >
> > On Fri, Jul 24, 2015 at 3:28 PM, Jacques Nadeau <[email protected]>
> > wrote:
> >
> > > Two quick notes:
> > >
> > > - If you switch to internal null handling, you have to define separate
> > udfs
> > > for each possible combination of nullable and non-nullable values.
> > > - isSet is an integer, so your if clause would actually be:
> > >
> > > if (! (yearDir.isSet == 1) ) {
> > >   // yearDir is NULL, handle this here
> > > }
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Fri, Jul 24, 2015 at 8:10 AM, Abdel Hakim Deneche <
> > > [email protected]>
> > > wrote:
> > >
> > > > Hi Stehan,
> > > >
> > > > I think when you specify your UDF as NULL_IF_NULL it means Drill will
> > > > handle null values automatically: if any passed argument to your UDF
> is
> > > > NULL, the UDF won't be evaluated and Drill will return NULL instead.
> > > >
> > > > In your case your UDF need to handle NULL values by setting:
> > > >
> > > > nulls = NullHandling.INTERNAL
> > > >
> > > > Then inside your UDF you can check if a passed argument is null like
> > > this:
> > > >
> > > > if (!yearDir.isSet) {
> > > >   // yearDir is NULL, handle this here
> > > > }
> > > >
> > > > Let me know if this works for you, I didn't work on UDFs for quite
> some
> > > > time now and they may have slightly changed since then.
> > > >
> > > > Thanks
> > > >
> > > >
> > > > On Fri, Jul 24, 2015 at 7:37 AM, Stefán Baxter <
> > > [email protected]>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I have this running now:
> > > > >
> > > > > "select occurred_at, dir0, dir1, dir2 from
> > > > > dfs.tmp.`/analytics/processed/test/events` as t where dir0 =
> > > > > dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
> > > > >
> timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-'))
> > > > order
> > > > > by occurred_at;"
> > > > >
> > > > >
> > > > > Observations:
> > > > >
> > > > >    - Being able to do this without using COALESCE would be more
> > > readable
> > > > >
> > > > >    - This is being called for every record for every file in every
> > > > > directory
> > > > >    - I would love for this expression only being evaluated as "the
> > > start
> > > > of
> > > > >    a new directory"
> > > > >
> > > > >    - This function always return 1 and should never match dir0
> > > > >    - For some strange reason this query sometimes returns results
> > even
> > > > >    though it should never do that (dir0 != 1)
> > > > >
> > > > > Regards,
> > > > >  -Stefan
> > > > >
> > > > >
> > > > > On Fri, Jul 24, 2015 at 12:12 PM, Stefán Baxter <
> > > > [email protected]
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I would like to share our intentions for organizing our data and
> > how
> > > we
> > > > > > plan to construct queries for it.
> > > > > >
> > > > > > There are four main reasons for sharing this:
> > > > > >
> > > > > > a) I would like to sanity check the approach
> > > > > > b) I'm having a hard time writing a UDF to optimize this and
> need a
> > > bit
> > > > > of
> > > > > > help.
> > > > > >
> > > > > > c) This can perhaps benefit someone else
> > > > > >
> > > > > > d) Check if UDF like the one I'm writing already exists
> > > > > >
> > > > > >
> > > > > > Relevant facts:
> > > > > >
> > > > > >    - We have multiple tenants and they collect variable amount of
> > > data
> > > > > >    (from almost non to quite a lot)
> > > > > >    - Each tenant has multiple data-sources (aka time-series /
> > tables)
> > > > > >    - Data is processed to Parquet at variable intervals depending
> > on
> > > > the
> > > > > >    tenant and his volume
> > > > > >    - Parquet files can not be updated/appended to in drill
> (Create
> > > from
> > > > > >    creates a new directory (aka table))
> > > > > >
> > > > > > The structure we intent to use
> > > > > >
> > > > > >    - / <root-directory> / [processed|streaming] / <tenant> /
> > > > > >    <data-source> / <year> / <month> / <date>
> > > > > >    example:
> > "/analytics/processed/some-tenant-name/events/2015/07/24"
> > > > > >
> > > > > >    - where *processed* is the sub-directory used for parquet
> files
> > > and
> > > > > >    the *streaming sub-directory is used for "raw materials"
> > > > > >    (json/csv/logs etc.)*
> > > > > >    this structure also allows us to have dev-nodes with local
> > > streaming
> > > > > >    directory and remote (hdfs/s3) processed directory
> > > > > >
> > > > > > It will vary between tenant how often data is processed and how
> > data
> > > is
> > > > > > stored:
> > > > > >
> > > > > >    - small tenant:
> > > > > >
> > > > >
> > > /analytics/processed/tenantA/pageviews/2014/on-year-partitioned.parquet
> > > > > >    - medium tenant:
> > > > > >
> > > > >
> > > >
> > >
> >
> /analytics/processed/tenantA/pageviews/2015/06/on-month-partitioned.parquet
> > > > > >    - ... (you get the picture)
> > > > > >
> > > > > >    *Important:* this means that in some cases we will have empty
> or
> > > > > >    missing sub-directories and in some cases we will have all the
> > > data
> > > > > in the
> > > > > >    root directory but the queries we run will always be the same
> > ones
> > > > > >    (regardless of tenant specific storage "layout").
> > > > > >
> > > > > > When we do selects we use *union *to query both *processed* and
> > > > > *streaming
> > > > > > directory structured *and then merge the data (if needed (for
> > > grouping
> > > > > > etc)).
> > > > > >
> > > > > > This is all working just fine (so far, so good).
> > > > > >
> > > > > >
> > > > > > Now I'm creating a small UDF that is supposed to help with the
> > > scanning
> > > > > of
> > > > > > these directories.
> > > > > >
> > > > > > Requirements:
> > > > > >
> > > > > >    - Avoid scanning/opening files in directories that are
> > irrelevant
> > > > > >    - Include files in every directory of full/partial relevance
> > > > > >    - minimize the additional overhead of including this UDF in
> the
> > > SQL
> > > > > >    command.
> > > > > >
> > > > > >
> > > > > > The UDF is called dirInRange and it takes this parameters:
> > > > > >
> > > > > >    - @Param TimeStampHolder from;
> > > > > >    - @Param NullableTimeStampHolder to;
> > > > > >    - @Param NullableVarCharHolder yearDir;
> > > > > >    - @Param NullableVarCharHolder monthDir;
> > > > > >    - @Param NullableVarCharHolder dayDir;
> > > > > >
> > > > > > It currently returns a int (1 if in range and other values
> > depending
> > > on
> > > > > > what failed)
> > > > > >
> > > > > > The function is defined like this
> > > > > >
> > > > > > @FunctionTemplate(name = "dirInRange", scope =
> > > > > FunctionTemplate.FunctionScope.SIMPLE, nulls =
> > > > > FunctionTemplate.NullHandling.NULL_IF_NULL)
> > > > > >
> > > > > > public static class dirInRange implements DrillSimpleFunc
> > > > > >
> > > > > > ... and its called like this:
> > > > > >
> > > > > > "select occurred_at, dir0, dir1, dir2 from
> > > > > > dfs.tmp.`/analytics/processed/test/events` as t where
> > > > > *dirInRange*(cast('2015-04-10'
> > > > > > as timestamp),cast('2015-07-11' as timestamp),dir0,dir1,dir2) = 1
> > > order
> > > > > by
> > > > > > occurred_at;"
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > This it not working and I have several questions:
> > > > > >
> > > > > >    - Is it possible that this function is only called if/when
> dir0,
> > > > dir1,
> > > > > >    and dir2 all exists
> > > > > >    - If that is the case would it not be better that these
> > > > > >    utility-variables existed with null value until needed/used
> > > > > >    - having a Nullable*Holder seems to play no role in this at
> all
> > > > > >
> > > > > >    - Is there anything I can do to have this expression evaluated
> > > only
> > > > > >    once per directory?
> > > > > >
> > > > > >    - Is there anything I can do to have this evaluated before the
> > > files
> > > > > >    in the directory are processed?
> > > > > >
> > > > > >    - How do I pass null value into this?
> > > > > >    - this fails:  *dirInRange*(cast('2015-04-10' as
> > > > > >    timestamp),null,dir0,dir1,dir2)
> > > > > >    Error: PARSE ERROR: From line 1, column 82 to line 1, column
> 85:
> > > > > >    Illegal use of 'NULL'
> > > > > >
> > > > > > This will be a part of my mini UDF library (
> > > > > > https://github.com/acmeguy/asdrill), feel free to use/abuse it
> if
> > it
> > > > > > helps.
> > > > > >
> > > > > > Regards,
> > > > > >  -Stefán
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Abdelhakim Deneche
> > > >
> > > > Software Engineer
> > > >
> > > >   <http://www.mapr.com/>
> > > >
> > > >
> > > > Now Available - Free Hadoop On-Demand Training
> > > > <
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > >
> > > >
> > >
> >
>

Re: storage structure - querying directories - sanity check and UDF assistance

Reply via email to