Re: storage structure - querying directories - sanity check and UDF assistance

Stefán Baxter Fri, 24 Jul 2015 10:46:13 -0700

Hi,

thanks for the tips.


Observation:

   - This is being called for *every record for every file in every
   directory*

Can you please tell me what needs to be done to make sure this is only
called 1 for each directory, preferably before file in that directory are
opened/scanned.

Regards,
  -Stefán

On Fri, Jul 24, 2015 at 3:28 PM, Jacques Nadeau <[email protected]> wrote:

> Two quick notes:
>
> - If you switch to internal null handling, you have to define separate udfs
> for each possible combination of nullable and non-nullable values.
> - isSet is an integer, so your if clause would actually be:
>
> if (! (yearDir.isSet == 1) ) {
>   // yearDir is NULL, handle this here
> }
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Fri, Jul 24, 2015 at 8:10 AM, Abdel Hakim Deneche <
> [email protected]>
> wrote:
>
> > Hi Stehan,
> >
> > I think when you specify your UDF as NULL_IF_NULL it means Drill will
> > handle null values automatically: if any passed argument to your UDF is
> > NULL, the UDF won't be evaluated and Drill will return NULL instead.
> >
> > In your case your UDF need to handle NULL values by setting:
> >
> > nulls = NullHandling.INTERNAL
> >
> > Then inside your UDF you can check if a passed argument is null like
> this:
> >
> > if (!yearDir.isSet) {
> >   // yearDir is NULL, handle this here
> > }
> >
> > Let me know if this works for you, I didn't work on UDFs for quite some
> > time now and they may have slightly changed since then.
> >
> > Thanks
> >
> >
> > On Fri, Jul 24, 2015 at 7:37 AM, Stefán Baxter <
> [email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > I have this running now:
> > >
> > > "select occurred_at, dir0, dir1, dir2 from
> > > dfs.tmp.`/analytics/processed/test/events` as t where dir0 =
> > > dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
> > > timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-'))
> > order
> > > by occurred_at;"
> > >
> > >
> > > Observations:
> > >
> > >    - Being able to do this without using COALESCE would be more
> readable
> > >
> > >    - This is being called for every record for every file in every
> > > directory
> > >    - I would love for this expression only being evaluated as "the
> start
> > of
> > >    a new directory"
> > >
> > >    - This function always return 1 and should never match dir0
> > >    - For some strange reason this query sometimes returns results even
> > >    though it should never do that (dir0 != 1)
> > >
> > > Regards,
> > >  -Stefan
> > >
> > >
> > > On Fri, Jul 24, 2015 at 12:12 PM, Stefán Baxter <
> > [email protected]
> > > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I would like to share our intentions for organizing our data and how
> we
> > > > plan to construct queries for it.
> > > >
> > > > There are four main reasons for sharing this:
> > > >
> > > > a) I would like to sanity check the approach
> > > > b) I'm having a hard time writing a UDF to optimize this and need a
> bit
> > > of
> > > > help.
> > > >
> > > > c) This can perhaps benefit someone else
> > > >
> > > > d) Check if UDF like the one I'm writing already exists
> > > >
> > > >
> > > > Relevant facts:
> > > >
> > > >    - We have multiple tenants and they collect variable amount of
> data
> > > >    (from almost non to quite a lot)
> > > >    - Each tenant has multiple data-sources (aka time-series / tables)
> > > >    - Data is processed to Parquet at variable intervals depending on
> > the
> > > >    tenant and his volume
> > > >    - Parquet files can not be updated/appended to in drill (Create
> from
> > > >    creates a new directory (aka table))
> > > >
> > > > The structure we intent to use
> > > >
> > > >    - / <root-directory> / [processed|streaming] / <tenant> /
> > > >    <data-source> / <year> / <month> / <date>
> > > >    example: "/analytics/processed/some-tenant-name/events/2015/07/24"
> > > >
> > > >    - where *processed* is the sub-directory used for parquet files
> and
> > > >    the *streaming sub-directory is used for "raw materials"
> > > >    (json/csv/logs etc.)*
> > > >    this structure also allows us to have dev-nodes with local
> streaming
> > > >    directory and remote (hdfs/s3) processed directory
> > > >
> > > > It will vary between tenant how often data is processed and how data
> is
> > > > stored:
> > > >
> > > >    - small tenant:
> > > >
> > >
> /analytics/processed/tenantA/pageviews/2014/on-year-partitioned.parquet
> > > >    - medium tenant:
> > > >
> > >
> >
> /analytics/processed/tenantA/pageviews/2015/06/on-month-partitioned.parquet
> > > >    - ... (you get the picture)
> > > >
> > > >    *Important:* this means that in some cases we will have empty or
> > > >    missing sub-directories and in some cases we will have all the
> data
> > > in the
> > > >    root directory but the queries we run will always be the same ones
> > > >    (regardless of tenant specific storage "layout").
> > > >
> > > > When we do selects we use *union *to query both *processed* and
> > > *streaming
> > > > directory structured *and then merge the data (if needed (for
> grouping
> > > > etc)).
> > > >
> > > > This is all working just fine (so far, so good).
> > > >
> > > >
> > > > Now I'm creating a small UDF that is supposed to help with the
> scanning
> > > of
> > > > these directories.
> > > >
> > > > Requirements:
> > > >
> > > >    - Avoid scanning/opening files in directories that are irrelevant
> > > >    - Include files in every directory of full/partial relevance
> > > >    - minimize the additional overhead of including this UDF in the
> SQL
> > > >    command.
> > > >
> > > >
> > > > The UDF is called dirInRange and it takes this parameters:
> > > >
> > > >    - @Param TimeStampHolder from;
> > > >    - @Param NullableTimeStampHolder to;
> > > >    - @Param NullableVarCharHolder yearDir;
> > > >    - @Param NullableVarCharHolder monthDir;
> > > >    - @Param NullableVarCharHolder dayDir;
> > > >
> > > > It currently returns a int (1 if in range and other values depending
> on
> > > > what failed)
> > > >
> > > > The function is defined like this
> > > >
> > > > @FunctionTemplate(name = "dirInRange", scope =
> > > FunctionTemplate.FunctionScope.SIMPLE, nulls =
> > > FunctionTemplate.NullHandling.NULL_IF_NULL)
> > > >
> > > > public static class dirInRange implements DrillSimpleFunc
> > > >
> > > > ... and its called like this:
> > > >
> > > > "select occurred_at, dir0, dir1, dir2 from
> > > > dfs.tmp.`/analytics/processed/test/events` as t where
> > > *dirInRange*(cast('2015-04-10'
> > > > as timestamp),cast('2015-07-11' as timestamp),dir0,dir1,dir2) = 1
> order
> > > by
> > > > occurred_at;"
> > > >
> > > >
> > > >
> > > >
> > > > This it not working and I have several questions:
> > > >
> > > >    - Is it possible that this function is only called if/when dir0,
> > dir1,
> > > >    and dir2 all exists
> > > >    - If that is the case would it not be better that these
> > > >    utility-variables existed with null value until needed/used
> > > >    - having a Nullable*Holder seems to play no role in this at all
> > > >
> > > >    - Is there anything I can do to have this expression evaluated
> only
> > > >    once per directory?
> > > >
> > > >    - Is there anything I can do to have this evaluated before the
> files
> > > >    in the directory are processed?
> > > >
> > > >    - How do I pass null value into this?
> > > >    - this fails:  *dirInRange*(cast('2015-04-10' as
> > > >    timestamp),null,dir0,dir1,dir2)
> > > >    Error: PARSE ERROR: From line 1, column 82 to line 1, column 85:
> > > >    Illegal use of 'NULL'
> > > >
> > > > This will be a part of my mini UDF library (
> > > > https://github.com/acmeguy/asdrill), feel free to use/abuse it if it
> > > > helps.
> > > >
> > > > Regards,
> > > >  -Stefán
> > > >
> > > >
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>

Re: storage structure - querying directories - sanity check and UDF assistance

Reply via email to