Re: storage structure - querying directories - sanity check and UDF assistance

Abdel Hakim Deneche Fri, 24 Jul 2015 08:11:39 -0700

Hi Stehan,

I think when you specify your UDF as NULL_IF_NULL it means Drill will
handle null values automatically: if any passed argument to your UDF is
NULL, the UDF won't be evaluated and Drill will return NULL instead.


In your case your UDF need to handle NULL values by setting:

nulls = NullHandling.INTERNAL

Then inside your UDF you can check if a passed argument is null like this:

if (!yearDir.isSet) {
  // yearDir is NULL, handle this here
}

Let me know if this works for you, I didn't work on UDFs for quite some
time now and they may have slightly changed since then.

Thanks


On Fri, Jul 24, 2015 at 7:37 AM, Stefán Baxter <ste...@activitystream.com>
wrote:

> Hi,
>
> I have this running now:
>
> "select occurred_at, dir0, dir1, dir2 from
> dfs.tmp.`/analytics/processed/test/events` as t where dir0 =
> dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
> timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-')) order
> by occurred_at;"
>
>
> Observations:
>
>    - Being able to do this without using COALESCE would be more readable
>
>    - This is being called for every record for every file in every
> directory
>    - I would love for this expression only being evaluated as "the start of
>    a new directory"
>
>    - This function always return 1 and should never match dir0
>    - For some strange reason this query sometimes returns results even
>    though it should never do that (dir0 != 1)
>
> Regards,
>  -Stefan
>
>
> On Fri, Jul 24, 2015 at 12:12 PM, Stefán Baxter <ste...@activitystream.com
> >
> wrote:
>
> > Hi,
> >
> > I would like to share our intentions for organizing our data and how we
> > plan to construct queries for it.
> >
> > There are four main reasons for sharing this:
> >
> > a) I would like to sanity check the approach
> > b) I'm having a hard time writing a UDF to optimize this and need a bit
> of
> > help.
> >
> > c) This can perhaps benefit someone else
> >
> > d) Check if UDF like the one I'm writing already exists
> >
> >
> > Relevant facts:
> >
> >    - We have multiple tenants and they collect variable amount of data
> >    (from almost non to quite a lot)
> >    - Each tenant has multiple data-sources (aka time-series / tables)
> >    - Data is processed to Parquet at variable intervals depending on the
> >    tenant and his volume
> >    - Parquet files can not be updated/appended to in drill (Create from
> >    creates a new directory (aka table))
> >
> > The structure we intent to use
> >
> >    - / <root-directory> / [processed|streaming] / <tenant> /
> >    <data-source> / <year> / <month> / <date>
> >    example: "/analytics/processed/some-tenant-name/events/2015/07/24"
> >
> >    - where *processed* is the sub-directory used for parquet files and
> >    the *streaming sub-directory is used for "raw materials"
> >    (json/csv/logs etc.)*
> >    this structure also allows us to have dev-nodes with local streaming
> >    directory and remote (hdfs/s3) processed directory
> >
> > It will vary between tenant how often data is processed and how data is
> > stored:
> >
> >    - small tenant:
> >
>  /analytics/processed/tenantA/pageviews/2014/on-year-partitioned.parquet
> >    - medium tenant:
> >
>  /analytics/processed/tenantA/pageviews/2015/06/on-month-partitioned.parquet
> >    - ... (you get the picture)
> >
> >    *Important:* this means that in some cases we will have empty or
> >    missing sub-directories and in some cases we will have all the data
> in the
> >    root directory but the queries we run will always be the same ones
> >    (regardless of tenant specific storage "layout").
> >
> > When we do selects we use *union *to query both *processed* and
> *streaming
> > directory structured *and then merge the data (if needed (for grouping
> > etc)).
> >
> > This is all working just fine (so far, so good).
> >
> >
> > Now I'm creating a small UDF that is supposed to help with the scanning
> of
> > these directories.
> >
> > Requirements:
> >
> >    - Avoid scanning/opening files in directories that are irrelevant
> >    - Include files in every directory of full/partial relevance
> >    - minimize the additional overhead of including this UDF in the SQL
> >    command.
> >
> >
> > The UDF is called dirInRange and it takes this parameters:
> >
> >    - @Param TimeStampHolder from;
> >    - @Param NullableTimeStampHolder to;
> >    - @Param NullableVarCharHolder yearDir;
> >    - @Param NullableVarCharHolder monthDir;
> >    - @Param NullableVarCharHolder dayDir;
> >
> > It currently returns a int (1 if in range and other values depending on
> > what failed)
> >
> > The function is defined like this
> >
> > @FunctionTemplate(name = "dirInRange", scope =
> FunctionTemplate.FunctionScope.SIMPLE, nulls =
> FunctionTemplate.NullHandling.NULL_IF_NULL)
> >
> > public static class dirInRange implements DrillSimpleFunc
> >
> > ... and its called like this:
> >
> > "select occurred_at, dir0, dir1, dir2 from
> > dfs.tmp.`/analytics/processed/test/events` as t where
> *dirInRange*(cast('2015-04-10'
> > as timestamp),cast('2015-07-11' as timestamp),dir0,dir1,dir2) = 1 order
> by
> > occurred_at;"
> >
> >
> >
> >
> > This it not working and I have several questions:
> >
> >    - Is it possible that this function is only called if/when dir0, dir1,
> >    and dir2 all exists
> >    - If that is the case would it not be better that these
> >    utility-variables existed with null value until needed/used
> >    - having a Nullable*Holder seems to play no role in this at all
> >
> >    - Is there anything I can do to have this expression evaluated only
> >    once per directory?
> >
> >    - Is there anything I can do to have this evaluated before the files
> >    in the directory are processed?
> >
> >    - How do I pass null value into this?
> >    - this fails:  *dirInRange*(cast('2015-04-10' as
> >    timestamp),null,dir0,dir1,dir2)
> >    Error: PARSE ERROR: From line 1, column 82 to line 1, column 85:
> >    Illegal use of 'NULL'
> >
> > This will be a part of my mini UDF library (
> > https://github.com/acmeguy/asdrill), feel free to use/abuse it if it
> > helps.
> >
> > Regards,
> >  -Stefán
> >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: storage structure - querying directories - sanity check and UDF assistance

Reply via email to