Re: storage structure - querying directories - sanity check and UDF assistance

Ted Dunning Fri, 24 Jul 2015 16:15:07 -0700

I think that constant reduction isn't entirely working in the presence of
joins.  For example, I removed the isRandom annotation from my random
number generator.


You can see constant reduction working if I give a literal number:

0: jdbc:drill:zk=local> select b.x,a.y,random(1, 3) from (values
> (1,1),(1,2),(1,3)) a(x,y) join (values 1) b(x) on a.x = b.x;
> called with 1.0,3.0
> called with 1.0,3.0
> +----+----+---------------------+
> | x  | y  |       EXPR$2        |
> +----+----+---------------------+
> | 1  | 1  | 2.4448566647542935  |
> | 1  | 3  | 2.4448566647542935  |
> | 1  | 2  | 2.4448566647542935  |
> +----+----+---------------------+
> 3 rows selected (0.386 seconds)


The join here has no impact on the call to random and you can see that the
value of EXPR$2 is the same for all output rows.

Note also, however, that while a has three rows, b has only a single row, a
literal constant. This query could be rewritten to be something like this
if SQL understood literal tables:

select B.x, A.y, B.r from (values (1,1),(1,2),(1,3)) A(x,y)
> join (values 1, random(1)) B(x,r)
> on a.x = b.x


But I really think that SQL can't do such a propagation.  In any case, here
is what Drill gets

0: jdbc:drill:zk=local> select b.x,a.y,random(b.x, 3) from (values
> (1,1),(1,2),(1,3)) a(x,y) join (values 1) b(x) on a.x = b.x;
> called with 1.0,3.0
> called with 1.0,3.0
> called with 1.0,3.0
> +----+----+---------------------+
> | x  | y  |       EXPR$2        |
> +----+----+---------------------+
> | 1  | 1  | 1.3263680093229913  |
> | 1  | 3  | 1.3085219784739912  |
> | 1  | 2  | 1.795856354981078   |
> +----+----+---------------------+
> 3 rows selected (2.088 seconds)


Three rows and three values.

I suspect that filtering on directories is likely to have the same issues.


On Fri, Jul 24, 2015 at 11:03 AM, Jacques Nadeau <[email protected]> wrote:

>   - This is being called for *every record for every file in every
>    directory*
>
> Are you sure?  Constant reduction should take care of this.  @Jason, any
> ideas why it might be failing?
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Fri, Jul 24, 2015 at 10:45 AM, Stefán Baxter <[email protected]
> >
> wrote:
>
> > Hi,
> >
> > thanks for the tips.
> >
> > Observation:
> >
> >    - This is being called for *every record for every file in every
> >    directory*
> >
> > Can you please tell me what needs to be done to make sure this is only
> > called 1 for each directory, preferably before file in that directory are
> > opened/scanned.
> >
> > Regards,
> >   -Stefán
> >
> > On Fri, Jul 24, 2015 at 3:28 PM, Jacques Nadeau <[email protected]>
> > wrote:
> >
> > > Two quick notes:
> > >
> > > - If you switch to internal null handling, you have to define separate
> > udfs
> > > for each possible combination of nullable and non-nullable values.
> > > - isSet is an integer, so your if clause would actually be:
> > >
> > > if (! (yearDir.isSet == 1) ) {
> > >   // yearDir is NULL, handle this here
> > > }
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Fri, Jul 24, 2015 at 8:10 AM, Abdel Hakim Deneche <
> > > [email protected]>
> > > wrote:
> > >
> > > > Hi Stehan,
> > > >
> > > > I think when you specify your UDF as NULL_IF_NULL it means Drill will
> > > > handle null values automatically: if any passed argument to your UDF
> is
> > > > NULL, the UDF won't be evaluated and Drill will return NULL instead.
> > > >
> > > > In your case your UDF need to handle NULL values by setting:
> > > >
> > > > nulls = NullHandling.INTERNAL
> > > >
> > > > Then inside your UDF you can check if a passed argument is null like
> > > this:
> > > >
> > > > if (!yearDir.isSet) {
> > > >   // yearDir is NULL, handle this here
> > > > }
> > > >
> > > > Let me know if this works for you, I didn't work on UDFs for quite
> some
> > > > time now and they may have slightly changed since then.
> > > >
> > > > Thanks
> > > >
> > > >
> > > > On Fri, Jul 24, 2015 at 7:37 AM, Stefán Baxter <
> > > [email protected]>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I have this running now:
> > > > >
> > > > > "select occurred_at, dir0, dir1, dir2 from
> > > > > dfs.tmp.`/analytics/processed/test/events` as t where dir0 =
> > > > > dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
> > > > >
> timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-'))
> > > > order
> > > > > by occurred_at;"
> > > > >
> > > > >
> > > > > Observations:
> > > > >
> > > > >    - Being able to do this without using COALESCE would be more
> > > readable
> > > > >
> > > > >    - This is being called for every record for every file in every
> > > > > directory
> > > > >    - I would love for this expression only being evaluated as "the
> > > start
> > > > of
> > > > >    a new directory"
> > > > >
> > > > >    - This function always return 1 and should never match dir0
> > > > >    - For some strange reason this query sometimes returns results
> > even
> > > > >    though it should never do that (dir0 != 1)
> > > > >
> > > > > Regards,
> > > > >  -Stefan
> > > > >
> > > > >
> > > > > On Fri, Jul 24, 2015 at 12:12 PM, Stefán Baxter <
> > > > [email protected]
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I would like to share our intentions for organizing our data and
> > how
> > > we
> > > > > > plan to construct queries for it.
> > > > > >
> > > > > > There are four main reasons for sharing this:
> > > > > >
> > > > > > a) I would like to sanity check the approach
> > > > > > b) I'm having a hard time writing a UDF to optimize this and
> need a
> > > bit
> > > > > of
> > > > > > help.
> > > > > >
> > > > > > c) This can perhaps benefit someone else
> > > > > >
> > > > > > d) Check if UDF like the one I'm writing already exists
> > > > > >
> > > > > >
> > > > > > Relevant facts:
> > > > > >
> > > > > >    - We have multiple tenants and they collect variable amount of
> > > data
> > > > > >    (from almost non to quite a lot)
> > > > > >    - Each tenant has multiple data-sources (aka time-series /
> > tables)
> > > > > >    - Data is processed to Parquet at variable intervals depending
> > on
> > > > the
> > > > > >    tenant and his volume
> > > > > >    - Parquet files can not be updated/appended to in drill
> (Create
> > > from
> > > > > >    creates a new directory (aka table))
> > > > > >
> > > > > > The structure we intent to use
> > > > > >
> > > > > >    - / <root-directory> / [processed|streaming] / <tenant> /
> > > > > >    <data-source> / <year> / <month> / <date>
> > > > > >    example:
> > "/analytics/processed/some-tenant-name/events/2015/07/24"
> > > > > >
> > > > > >    - where *processed* is the sub-directory used for parquet
> files
> > > and
> > > > > >    the *streaming sub-directory is used for "raw materials"
> > > > > >    (json/csv/logs etc.)*
> > > > > >    this structure also allows us to have dev-nodes with local
> > > streaming
> > > > > >    directory and remote (hdfs/s3) processed directory
> > > > > >
> > > > > > It will vary between tenant how often data is processed and how
> > data
> > > is
> > > > > > stored:
> > > > > >
> > > > > >    - small tenant:
> > > > > >
> > > > >
> > > /analytics/processed/tenantA/pageviews/2014/on-year-partitioned.parquet
> > > > > >    - medium tenant:
> > > > > >
> > > > >
> > > >
> > >
> >
> /analytics/processed/tenantA/pageviews/2015/06/on-month-partitioned.parquet
> > > > > >    - ... (you get the picture)
> > > > > >
> > > > > >    *Important:* this means that in some cases we will have empty
> or
> > > > > >    missing sub-directories and in some cases we will have all the
> > > data
> > > > > in the
> > > > > >    root directory but the queries we run will always be the same
> > ones
> > > > > >    (regardless of tenant specific storage "layout").
> > > > > >
> > > > > > When we do selects we use *union *to query both *processed* and
> > > > > *streaming
> > > > > > directory structured *and then merge the data (if needed (for
> > > grouping
> > > > > > etc)).
> > > > > >
> > > > > > This is all working just fine (so far, so good).
> > > > > >
> > > > > >
> > > > > > Now I'm creating a small UDF that is supposed to help with the
> > > scanning
> > > > > of
> > > > > > these directories.
> > > > > >
> > > > > > Requirements:
> > > > > >
> > > > > >    - Avoid scanning/opening files in directories that are
> > irrelevant
> > > > > >    - Include files in every directory of full/partial relevance
> > > > > >    - minimize the additional overhead of including this UDF in
> the
> > > SQL
> > > > > >    command.
> > > > > >
> > > > > >
> > > > > > The UDF is called dirInRange and it takes this parameters:
> > > > > >
> > > > > >    - @Param TimeStampHolder from;
> > > > > >    - @Param NullableTimeStampHolder to;
> > > > > >    - @Param NullableVarCharHolder yearDir;
> > > > > >    - @Param NullableVarCharHolder monthDir;
> > > > > >    - @Param NullableVarCharHolder dayDir;
> > > > > >
> > > > > > It currently returns a int (1 if in range and other values
> > depending
> > > on
> > > > > > what failed)
> > > > > >
> > > > > > The function is defined like this
> > > > > >
> > > > > > @FunctionTemplate(name = "dirInRange", scope =
> > > > > FunctionTemplate.FunctionScope.SIMPLE, nulls =
> > > > > FunctionTemplate.NullHandling.NULL_IF_NULL)
> > > > > >
> > > > > > public static class dirInRange implements DrillSimpleFunc
> > > > > >
> > > > > > ... and its called like this:
> > > > > >
> > > > > > "select occurred_at, dir0, dir1, dir2 from
> > > > > > dfs.tmp.`/analytics/processed/test/events` as t where
> > > > > *dirInRange*(cast('2015-04-10'
> > > > > > as timestamp),cast('2015-07-11' as timestamp),dir0,dir1,dir2) = 1
> > > order
> > > > > by
> > > > > > occurred_at;"
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > This it not working and I have several questions:
> > > > > >
> > > > > >    - Is it possible that this function is only called if/when
> dir0,
> > > > dir1,
> > > > > >    and dir2 all exists
> > > > > >    - If that is the case would it not be better that these
> > > > > >    utility-variables existed with null value until needed/used
> > > > > >    - having a Nullable*Holder seems to play no role in this at
> all
> > > > > >
> > > > > >    - Is there anything I can do to have this expression evaluated
> > > only
> > > > > >    once per directory?
> > > > > >
> > > > > >    - Is there anything I can do to have this evaluated before the
> > > files
> > > > > >    in the directory are processed?
> > > > > >
> > > > > >    - How do I pass null value into this?
> > > > > >    - this fails:  *dirInRange*(cast('2015-04-10' as
> > > > > >    timestamp),null,dir0,dir1,dir2)
> > > > > >    Error: PARSE ERROR: From line 1, column 82 to line 1, column
> 85:
> > > > > >    Illegal use of 'NULL'
> > > > > >
> > > > > > This will be a part of my mini UDF library (
> > > > > > https://github.com/acmeguy/asdrill), feel free to use/abuse it
> if
> > it
> > > > > > helps.
> > > > > >
> > > > > > Regards,
> > > > > >  -Stefán
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Abdelhakim Deneche
> > > >
> > > > Software Engineer
> > > >
> > > >   <http://www.mapr.com/>
> > > >
> > > >
> > > > Now Available - Free Hadoop On-Demand Training
> > > > <
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > >
> > > >
> > >
> >
>

Re: storage structure - querying directories - sanity check and UDF assistance

Reply via email to