Hi,
I have this running now:
"select occurred_at, dir0, dir1, dir2 from
dfs.tmp.`/analytics/processed/test/events` as t where dir0 =
dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-')) order
by occurred_at;"
Observations:
- Being able to do this without using COALESCE would be more readable
- This is being called for every record for every file in every directory
- I would love for this expression only being evaluated as "the start of
a new directory"
- This function always return 1 and should never match dir0
- For some strange reason this query sometimes returns results even
though it should never do that (dir0 != 1)
Regards,
-Stefan
On Fri, Jul 24, 2015 at 12:12 PM, Stefán Baxter <[email protected]>
wrote:
> Hi,
>
> I would like to share our intentions for organizing our data and how we
> plan to construct queries for it.
>
> There are four main reasons for sharing this:
>
> a) I would like to sanity check the approach
> b) I'm having a hard time writing a UDF to optimize this and need a bit of
> help.
>
> c) This can perhaps benefit someone else
>
> d) Check if UDF like the one I'm writing already exists
>
>
> Relevant facts:
>
> - We have multiple tenants and they collect variable amount of data
> (from almost non to quite a lot)
> - Each tenant has multiple data-sources (aka time-series / tables)
> - Data is processed to Parquet at variable intervals depending on the
> tenant and his volume
> - Parquet files can not be updated/appended to in drill (Create from
> creates a new directory (aka table))
>
> The structure we intent to use
>
> - / <root-directory> / [processed|streaming] / <tenant> /
> <data-source> / <year> / <month> / <date>
> example: "/analytics/processed/some-tenant-name/events/2015/07/24"
>
> - where *processed* is the sub-directory used for parquet files and
> the *streaming sub-directory is used for "raw materials"
> (json/csv/logs etc.)*
> this structure also allows us to have dev-nodes with local streaming
> directory and remote (hdfs/s3) processed directory
>
> It will vary between tenant how often data is processed and how data is
> stored:
>
> - small tenant:
> /analytics/processed/tenantA/pageviews/2014/on-year-partitioned.parquet
> - medium tenant:
>
> /analytics/processed/tenantA/pageviews/2015/06/on-month-partitioned.parquet
> - ... (you get the picture)
>
> *Important:* this means that in some cases we will have empty or
> missing sub-directories and in some cases we will have all the data in the
> root directory but the queries we run will always be the same ones
> (regardless of tenant specific storage "layout").
>
> When we do selects we use *union *to query both *processed* and *streaming
> directory structured *and then merge the data (if needed (for grouping
> etc)).
>
> This is all working just fine (so far, so good).
>
>
> Now I'm creating a small UDF that is supposed to help with the scanning of
> these directories.
>
> Requirements:
>
> - Avoid scanning/opening files in directories that are irrelevant
> - Include files in every directory of full/partial relevance
> - minimize the additional overhead of including this UDF in the SQL
> command.
>
>
> The UDF is called dirInRange and it takes this parameters:
>
> - @Param TimeStampHolder from;
> - @Param NullableTimeStampHolder to;
> - @Param NullableVarCharHolder yearDir;
> - @Param NullableVarCharHolder monthDir;
> - @Param NullableVarCharHolder dayDir;
>
> It currently returns a int (1 if in range and other values depending on
> what failed)
>
> The function is defined like this
>
> @FunctionTemplate(name = "dirInRange", scope =
> FunctionTemplate.FunctionScope.SIMPLE, nulls =
> FunctionTemplate.NullHandling.NULL_IF_NULL)
>
> public static class dirInRange implements DrillSimpleFunc
>
> ... and its called like this:
>
> "select occurred_at, dir0, dir1, dir2 from
> dfs.tmp.`/analytics/processed/test/events` as t where
> *dirInRange*(cast('2015-04-10'
> as timestamp),cast('2015-07-11' as timestamp),dir0,dir1,dir2) = 1 order by
> occurred_at;"
>
>
>
>
> This it not working and I have several questions:
>
> - Is it possible that this function is only called if/when dir0, dir1,
> and dir2 all exists
> - If that is the case would it not be better that these
> utility-variables existed with null value until needed/used
> - having a Nullable*Holder seems to play no role in this at all
>
> - Is there anything I can do to have this expression evaluated only
> once per directory?
>
> - Is there anything I can do to have this evaluated before the files
> in the directory are processed?
>
> - How do I pass null value into this?
> - this fails: *dirInRange*(cast('2015-04-10' as
> timestamp),null,dir0,dir1,dir2)
> Error: PARSE ERROR: From line 1, column 82 to line 1, column 85:
> Illegal use of 'NULL'
>
> This will be a part of my mini UDF library (
> https://github.com/acmeguy/asdrill), feel free to use/abuse it if it
> helps.
>
> Regards,
> -Stefán
>
>