Re: storage structure - querying directories - sanity check and UDF assistance

Jacques Nadeau Fri, 24 Jul 2015 08:30:18 -0700

Two quick notes:

- If you switch to internal null handling, you have to define separate udfs
for each possible combination of nullable and non-nullable values.
- isSet is an integer, so your if clause would actually be:


if (! (yearDir.isSet == 1) ) {
  // yearDir is NULL, handle this here
}

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Fri, Jul 24, 2015 at 8:10 AM, Abdel Hakim Deneche <adene...@maprtech.com>
wrote:

> Hi Stehan,
>
> I think when you specify your UDF as NULL_IF_NULL it means Drill will
> handle null values automatically: if any passed argument to your UDF is
> NULL, the UDF won't be evaluated and Drill will return NULL instead.
>
> In your case your UDF need to handle NULL values by setting:
>
> nulls = NullHandling.INTERNAL
>
> Then inside your UDF you can check if a passed argument is null like this:
>
> if (!yearDir.isSet) {
>   // yearDir is NULL, handle this here
> }
>
> Let me know if this works for you, I didn't work on UDFs for quite some
> time now and they may have slightly changed since then.
>
> Thanks
>
>
> On Fri, Jul 24, 2015 at 7:37 AM, Stefán Baxter <ste...@activitystream.com>
> wrote:
>
> > Hi,
> >
> > I have this running now:
> >
> > "select occurred_at, dir0, dir1, dir2 from
> > dfs.tmp.`/analytics/processed/test/events` as t where dir0 =
> > dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
> > timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-'))
> order
> > by occurred_at;"
> >
> >
> > Observations:
> >
> >    - Being able to do this without using COALESCE would be more readable
> >
> >    - This is being called for every record for every file in every
> > directory
> >    - I would love for this expression only being evaluated as "the start
> of
> >    a new directory"
> >
> >    - This function always return 1 and should never match dir0
> >    - For some strange reason this query sometimes returns results even
> >    though it should never do that (dir0 != 1)
> >
> > Regards,
> >  -Stefan
> >
> >
> > On Fri, Jul 24, 2015 at 12:12 PM, Stefán Baxter <
> ste...@activitystream.com
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > I would like to share our intentions for organizing our data and how we
> > > plan to construct queries for it.
> > >
> > > There are four main reasons for sharing this:
> > >
> > > a) I would like to sanity check the approach
> > > b) I'm having a hard time writing a UDF to optimize this and need a bit
> > of
> > > help.
> > >
> > > c) This can perhaps benefit someone else
> > >
> > > d) Check if UDF like the one I'm writing already exists
> > >
> > >
> > > Relevant facts:
> > >
> > >    - We have multiple tenants and they collect variable amount of data
> > >    (from almost non to quite a lot)
> > >    - Each tenant has multiple data-sources (aka time-series / tables)
> > >    - Data is processed to Parquet at variable intervals depending on
> the
> > >    tenant and his volume
> > >    - Parquet files can not be updated/appended to in drill (Create from
> > >    creates a new directory (aka table))
> > >
> > > The structure we intent to use
> > >
> > >    - / <root-directory> / [processed|streaming] / <tenant> /
> > >    <data-source> / <year> / <month> / <date>
> > >    example: "/analytics/processed/some-tenant-name/events/2015/07/24"
> > >
> > >    - where *processed* is the sub-directory used for parquet files and
> > >    the *streaming sub-directory is used for "raw materials"
> > >    (json/csv/logs etc.)*
> > >    this structure also allows us to have dev-nodes with local streaming
> > >    directory and remote (hdfs/s3) processed directory
> > >
> > > It will vary between tenant how often data is processed and how data is
> > > stored:
> > >
> > >    - small tenant:
> > >
> >  /analytics/processed/tenantA/pageviews/2014/on-year-partitioned.parquet
> > >    - medium tenant:
> > >
> >
> /analytics/processed/tenantA/pageviews/2015/06/on-month-partitioned.parquet
> > >    - ... (you get the picture)
> > >
> > >    *Important:* this means that in some cases we will have empty or
> > >    missing sub-directories and in some cases we will have all the data
> > in the
> > >    root directory but the queries we run will always be the same ones
> > >    (regardless of tenant specific storage "layout").
> > >
> > > When we do selects we use *union *to query both *processed* and
> > *streaming
> > > directory structured *and then merge the data (if needed (for grouping
> > > etc)).
> > >
> > > This is all working just fine (so far, so good).
> > >
> > >
> > > Now I'm creating a small UDF that is supposed to help with the scanning
> > of
> > > these directories.
> > >
> > > Requirements:
> > >
> > >    - Avoid scanning/opening files in directories that are irrelevant
> > >    - Include files in every directory of full/partial relevance
> > >    - minimize the additional overhead of including this UDF in the SQL
> > >    command.
> > >
> > >
> > > The UDF is called dirInRange and it takes this parameters:
> > >
> > >    - @Param TimeStampHolder from;
> > >    - @Param NullableTimeStampHolder to;
> > >    - @Param NullableVarCharHolder yearDir;
> > >    - @Param NullableVarCharHolder monthDir;
> > >    - @Param NullableVarCharHolder dayDir;
> > >
> > > It currently returns a int (1 if in range and other values depending on
> > > what failed)
> > >
> > > The function is defined like this
> > >
> > > @FunctionTemplate(name = "dirInRange", scope =
> > FunctionTemplate.FunctionScope.SIMPLE, nulls =
> > FunctionTemplate.NullHandling.NULL_IF_NULL)
> > >
> > > public static class dirInRange implements DrillSimpleFunc
> > >
> > > ... and its called like this:
> > >
> > > "select occurred_at, dir0, dir1, dir2 from
> > > dfs.tmp.`/analytics/processed/test/events` as t where
> > *dirInRange*(cast('2015-04-10'
> > > as timestamp),cast('2015-07-11' as timestamp),dir0,dir1,dir2) = 1 order
> > by
> > > occurred_at;"
> > >
> > >
> > >
> > >
> > > This it not working and I have several questions:
> > >
> > >    - Is it possible that this function is only called if/when dir0,
> dir1,
> > >    and dir2 all exists
> > >    - If that is the case would it not be better that these
> > >    utility-variables existed with null value until needed/used
> > >    - having a Nullable*Holder seems to play no role in this at all
> > >
> > >    - Is there anything I can do to have this expression evaluated only
> > >    once per directory?
> > >
> > >    - Is there anything I can do to have this evaluated before the files
> > >    in the directory are processed?
> > >
> > >    - How do I pass null value into this?
> > >    - this fails:  *dirInRange*(cast('2015-04-10' as
> > >    timestamp),null,dir0,dir1,dir2)
> > >    Error: PARSE ERROR: From line 1, column 82 to line 1, column 85:
> > >    Illegal use of 'NULL'
> > >
> > > This will be a part of my mini UDF library (
> > > https://github.com/acmeguy/asdrill), feel free to use/abuse it if it
> > > helps.
> > >
> > > Regards,
> > >  -Stefán
> > >
> > >
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Re: storage structure - querying directories - sanity check and UDF assistance

Reply via email to