Two quick notes: - If you switch to internal null handling, you have to define separate udfs for each possible combination of nullable and non-nullable values. - isSet is an integer, so your if clause would actually be:
if (! (yearDir.isSet == 1) ) { // yearDir is NULL, handle this here } -- Jacques Nadeau CTO and Co-Founder, Dremio On Fri, Jul 24, 2015 at 8:10 AM, Abdel Hakim Deneche <adene...@maprtech.com> wrote: > Hi Stehan, > > I think when you specify your UDF as NULL_IF_NULL it means Drill will > handle null values automatically: if any passed argument to your UDF is > NULL, the UDF won't be evaluated and Drill will return NULL instead. > > In your case your UDF need to handle NULL values by setting: > > nulls = NullHandling.INTERNAL > > Then inside your UDF you can check if a passed argument is null like this: > > if (!yearDir.isSet) { > // yearDir is NULL, handle this here > } > > Let me know if this works for you, I didn't work on UDFs for quite some > time now and they may have slightly changed since then. > > Thanks > > > On Fri, Jul 24, 2015 at 7:37 AM, Stefán Baxter <ste...@activitystream.com> > wrote: > > > Hi, > > > > I have this running now: > > > > "select occurred_at, dir0, dir1, dir2 from > > dfs.tmp.`/analytics/processed/test/events` as t where dir0 = > > dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as > > timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-')) > order > > by occurred_at;" > > > > > > Observations: > > > > - Being able to do this without using COALESCE would be more readable > > > > - This is being called for every record for every file in every > > directory > > - I would love for this expression only being evaluated as "the start > of > > a new directory" > > > > - This function always return 1 and should never match dir0 > > - For some strange reason this query sometimes returns results even > > though it should never do that (dir0 != 1) > > > > Regards, > > -Stefan > > > > > > On Fri, Jul 24, 2015 at 12:12 PM, Stefán Baxter < > ste...@activitystream.com > > > > > wrote: > > > > > Hi, > > > > > > I would like to share our intentions for organizing our data and how we > > > plan to construct queries for it. > > > > > > There are four main reasons for sharing this: > > > > > > a) I would like to sanity check the approach > > > b) I'm having a hard time writing a UDF to optimize this and need a bit > > of > > > help. > > > > > > c) This can perhaps benefit someone else > > > > > > d) Check if UDF like the one I'm writing already exists > > > > > > > > > Relevant facts: > > > > > > - We have multiple tenants and they collect variable amount of data > > > (from almost non to quite a lot) > > > - Each tenant has multiple data-sources (aka time-series / tables) > > > - Data is processed to Parquet at variable intervals depending on > the > > > tenant and his volume > > > - Parquet files can not be updated/appended to in drill (Create from > > > creates a new directory (aka table)) > > > > > > The structure we intent to use > > > > > > - / <root-directory> / [processed|streaming] / <tenant> / > > > <data-source> / <year> / <month> / <date> > > > example: "/analytics/processed/some-tenant-name/events/2015/07/24" > > > > > > - where *processed* is the sub-directory used for parquet files and > > > the *streaming sub-directory is used for "raw materials" > > > (json/csv/logs etc.)* > > > this structure also allows us to have dev-nodes with local streaming > > > directory and remote (hdfs/s3) processed directory > > > > > > It will vary between tenant how often data is processed and how data is > > > stored: > > > > > > - small tenant: > > > > > /analytics/processed/tenantA/pageviews/2014/on-year-partitioned.parquet > > > - medium tenant: > > > > > > /analytics/processed/tenantA/pageviews/2015/06/on-month-partitioned.parquet > > > - ... (you get the picture) > > > > > > *Important:* this means that in some cases we will have empty or > > > missing sub-directories and in some cases we will have all the data > > in the > > > root directory but the queries we run will always be the same ones > > > (regardless of tenant specific storage "layout"). > > > > > > When we do selects we use *union *to query both *processed* and > > *streaming > > > directory structured *and then merge the data (if needed (for grouping > > > etc)). > > > > > > This is all working just fine (so far, so good). > > > > > > > > > Now I'm creating a small UDF that is supposed to help with the scanning > > of > > > these directories. > > > > > > Requirements: > > > > > > - Avoid scanning/opening files in directories that are irrelevant > > > - Include files in every directory of full/partial relevance > > > - minimize the additional overhead of including this UDF in the SQL > > > command. > > > > > > > > > The UDF is called dirInRange and it takes this parameters: > > > > > > - @Param TimeStampHolder from; > > > - @Param NullableTimeStampHolder to; > > > - @Param NullableVarCharHolder yearDir; > > > - @Param NullableVarCharHolder monthDir; > > > - @Param NullableVarCharHolder dayDir; > > > > > > It currently returns a int (1 if in range and other values depending on > > > what failed) > > > > > > The function is defined like this > > > > > > @FunctionTemplate(name = "dirInRange", scope = > > FunctionTemplate.FunctionScope.SIMPLE, nulls = > > FunctionTemplate.NullHandling.NULL_IF_NULL) > > > > > > public static class dirInRange implements DrillSimpleFunc > > > > > > ... and its called like this: > > > > > > "select occurred_at, dir0, dir1, dir2 from > > > dfs.tmp.`/analytics/processed/test/events` as t where > > *dirInRange*(cast('2015-04-10' > > > as timestamp),cast('2015-07-11' as timestamp),dir0,dir1,dir2) = 1 order > > by > > > occurred_at;" > > > > > > > > > > > > > > > This it not working and I have several questions: > > > > > > - Is it possible that this function is only called if/when dir0, > dir1, > > > and dir2 all exists > > > - If that is the case would it not be better that these > > > utility-variables existed with null value until needed/used > > > - having a Nullable*Holder seems to play no role in this at all > > > > > > - Is there anything I can do to have this expression evaluated only > > > once per directory? > > > > > > - Is there anything I can do to have this evaluated before the files > > > in the directory are processed? > > > > > > - How do I pass null value into this? > > > - this fails: *dirInRange*(cast('2015-04-10' as > > > timestamp),null,dir0,dir1,dir2) > > > Error: PARSE ERROR: From line 1, column 82 to line 1, column 85: > > > Illegal use of 'NULL' > > > > > > This will be a part of my mini UDF library ( > > > https://github.com/acmeguy/asdrill), feel free to use/abuse it if it > > > helps. > > > > > > Regards, > > > -Stefán > > > > > > > > > > > > -- > > Abdelhakim Deneche > > Software Engineer > > <http://www.mapr.com/> > > > Now Available - Free Hadoop On-Demand Training > < > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > >