Re: storage structure - querying directories - sanity check and UDF assistance

Stefán Baxter Fri, 24 Jul 2015 07:38:01 -0700

Hi,

I have this running now:


"select occurred_at, dir0, dir1, dir2 from
dfs.tmp.`/analytics/processed/test/events` as t where dir0 =
dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-')) order
by occurred_at;"


Observations:

   - Being able to do this without using COALESCE would be more readable

   - This is being called for every record for every file in every directory
   - I would love for this expression only being evaluated as "the start of
   a new directory"

   - This function always return 1 and should never match dir0
   - For some strange reason this query sometimes returns results even
   though it should never do that (dir0 != 1)

Regards,
 -Stefan


On Fri, Jul 24, 2015 at 12:12 PM, Stefán Baxter <[email protected]>
wrote:

> Hi,
>
> I would like to share our intentions for organizing our data and how we
> plan to construct queries for it.
>
> There are four main reasons for sharing this:
>
> a) I would like to sanity check the approach
> b) I'm having a hard time writing a UDF to optimize this and need a bit of
> help.
>
> c) This can perhaps benefit someone else
>
> d) Check if UDF like the one I'm writing already exists
>
>
> Relevant facts:
>
>    - We have multiple tenants and they collect variable amount of data
>    (from almost non to quite a lot)
>    - Each tenant has multiple data-sources (aka time-series / tables)
>    - Data is processed to Parquet at variable intervals depending on the
>    tenant and his volume
>    - Parquet files can not be updated/appended to in drill (Create from
>    creates a new directory (aka table))
>
> The structure we intent to use
>
>    - / <root-directory> / [processed|streaming] / <tenant> /
>    <data-source> / <year> / <month> / <date>
>    example: "/analytics/processed/some-tenant-name/events/2015/07/24"
>
>    - where *processed* is the sub-directory used for parquet files and
>    the *streaming sub-directory is used for "raw materials"
>    (json/csv/logs etc.)*
>    this structure also allows us to have dev-nodes with local streaming
>    directory and remote (hdfs/s3) processed directory
>
> It will vary between tenant how often data is processed and how data is
> stored:
>
>    - small tenant:
>     /analytics/processed/tenantA/pageviews/2014/on-year-partitioned.parquet
>    - medium tenant:
>     
> /analytics/processed/tenantA/pageviews/2015/06/on-month-partitioned.parquet
>    - ... (you get the picture)
>
>    *Important:* this means that in some cases we will have empty or
>    missing sub-directories and in some cases we will have all the data in the
>    root directory but the queries we run will always be the same ones
>    (regardless of tenant specific storage "layout").
>
> When we do selects we use *union *to query both *processed* and *streaming
> directory structured *and then merge the data (if needed (for grouping
> etc)).
>
> This is all working just fine (so far, so good).
>
>
> Now I'm creating a small UDF that is supposed to help with the scanning of
> these directories.
>
> Requirements:
>
>    - Avoid scanning/opening files in directories that are irrelevant
>    - Include files in every directory of full/partial relevance
>    - minimize the additional overhead of including this UDF in the SQL
>    command.
>
>
> The UDF is called dirInRange and it takes this parameters:
>
>    - @Param TimeStampHolder from;
>    - @Param NullableTimeStampHolder to;
>    - @Param NullableVarCharHolder yearDir;
>    - @Param NullableVarCharHolder monthDir;
>    - @Param NullableVarCharHolder dayDir;
>
> It currently returns a int (1 if in range and other values depending on
> what failed)
>
> The function is defined like this
>
> @FunctionTemplate(name = "dirInRange", scope = 
> FunctionTemplate.FunctionScope.SIMPLE, nulls = 
> FunctionTemplate.NullHandling.NULL_IF_NULL)
>
> public static class dirInRange implements DrillSimpleFunc
>
> ... and its called like this:
>
> "select occurred_at, dir0, dir1, dir2 from
> dfs.tmp.`/analytics/processed/test/events` as t where 
> *dirInRange*(cast('2015-04-10'
> as timestamp),cast('2015-07-11' as timestamp),dir0,dir1,dir2) = 1 order by
> occurred_at;"
>
>
>
>
> This it not working and I have several questions:
>
>    - Is it possible that this function is only called if/when dir0, dir1,
>    and dir2 all exists
>    - If that is the case would it not be better that these
>    utility-variables existed with null value until needed/used
>    - having a Nullable*Holder seems to play no role in this at all
>
>    - Is there anything I can do to have this expression evaluated only
>    once per directory?
>
>    - Is there anything I can do to have this evaluated before the files
>    in the directory are processed?
>
>    - How do I pass null value into this?
>    - this fails:  *dirInRange*(cast('2015-04-10' as
>    timestamp),null,dir0,dir1,dir2)
>    Error: PARSE ERROR: From line 1, column 82 to line 1, column 85:
>    Illegal use of 'NULL'
>
> This will be a part of my mini UDF library (
> https://github.com/acmeguy/asdrill), feel free to use/abuse it if it
> helps.
>
> Regards,
>  -Stefán
>
>

Re: storage structure - querying directories - sanity check and UDF assistance

Reply via email to