storage structure - querying directories - sanity check and UDF assistance

Stefán Baxter Fri, 24 Jul 2015 05:14:16 -0700

Hi,

I would like to share our intentions for organizing our data and how we
plan to construct queries for it.


There are four main reasons for sharing this:

a) I would like to sanity check the approach
b) I'm having a hard time writing a UDF to optimize this and need a bit of
help.

c) This can perhaps benefit someone else

d) Check if UDF like the one I'm writing already exists


Relevant facts:

   - We have multiple tenants and they collect variable amount of data
   (from almost non to quite a lot)
   - Each tenant has multiple data-sources (aka time-series / tables)
   - Data is processed to Parquet at variable intervals depending on the
   tenant and his volume
   - Parquet files can not be updated/appended to in drill (Create from
   creates a new directory (aka table))

The structure we intent to use

   - / <root-directory> / [processed|streaming] / <tenant> / <data-source>
   / <year> / <month> / <date>
   example: "/analytics/processed/some-tenant-name/events/2015/07/24"

   - where *processed* is the sub-directory used for parquet files and
the *streaming
   sub-directory is used for "raw materials" (json/csv/logs etc.)*
   this structure also allows us to have dev-nodes with local streaming
   directory and remote (hdfs/s3) processed directory

It will vary between tenant how often data is processed and how data is
stored:

   - small tenant:
    /analytics/processed/tenantA/pageviews/2014/on-year-partitioned.parquet
   - medium tenant:
    /analytics/processed/tenantA/pageviews/2015/06/on-month-partitioned.parquet
   - ... (you get the picture)

   *Important:* this means that in some cases we will have empty or missing
   sub-directories and in some cases we will have all the data in the root
   directory but the queries we run will always be the same ones (regardless
   of tenant specific storage "layout").

When we do selects we use *union *to query both *processed* and *streaming
directory structured *and then merge the data (if needed (for grouping
etc)).

This is all working just fine (so far, so good).


Now I'm creating a small UDF that is supposed to help with the scanning of
these directories.

Requirements:

   - Avoid scanning/opening files in directories that are irrelevant
   - Include files in every directory of full/partial relevance
   - minimize the additional overhead of including this UDF in the SQL
   command.


The UDF is called dirInRange and it takes this parameters:

   - @Param TimeStampHolder from;
   - @Param NullableTimeStampHolder to;
   - @Param NullableVarCharHolder yearDir;
   - @Param NullableVarCharHolder monthDir;
   - @Param NullableVarCharHolder dayDir;

It currently returns a int (1 if in range and other values depending on
what failed)

The function is defined like this

@FunctionTemplate(name = "dirInRange", scope =
FunctionTemplate.FunctionScope.SIMPLE, nulls =
FunctionTemplate.NullHandling.NULL_IF_NULL)

public static class dirInRange implements DrillSimpleFunc

... and its called like this:

"select occurred_at, dir0, dir1, dir2 from
dfs.tmp.`/analytics/processed/test/events` as t where
*dirInRange*(cast('2015-04-10'
as timestamp),cast('2015-07-11' as timestamp),dir0,dir1,dir2) = 1 order by
occurred_at;"




This it not working and I have several questions:

   - Is it possible that this function is only called if/when dir0, dir1,
   and dir2 all exists
   - If that is the case would it not be better that these
   utility-variables existed with null value until needed/used
   - having a Nullable*Holder seems to play no role in this at all

   - Is there anything I can do to have this expression evaluated only once
   per directory?

   - Is there anything I can do to have this evaluated before the files in
   the directory are processed?

   - How do I pass null value into this?
   - this fails:  *dirInRange*(cast('2015-04-10' as
   timestamp),null,dir0,dir1,dir2)
   Error: PARSE ERROR: From line 1, column 82 to line 1, column 85: Illegal
   use of 'NULL'

This will be a part of my mini UDF library (
https://github.com/acmeguy/asdrill), feel free to use/abuse it if it helps.

Regards,
 -Stefán

storage structure - querying directories - sanity check and UDF assistance

Reply via email to