Re: storage structure - querying directories - sanity check and UDF assistance

Jason Altekruse Fri, 24 Jul 2015 17:25:38 -0700

This is actually a known issue, constant folding is not working in the
select clause because of a costing problem. Constant folding only works
currently in the where clause today.


https://issues.apache.org/jira/browse/DRILL-2218

On Fri, Jul 24, 2015 at 4:13 PM, Ted Dunning <[email protected]> wrote:

> I think that constant reduction isn't entirely working in the presence of
> joins.  For example, I removed the isRandom annotation from my random
> number generator.
>
> You can see constant reduction working if I give a literal number:
>
> 0: jdbc:drill:zk=local> select b.x,a.y,random(1, 3) from (values
> > (1,1),(1,2),(1,3)) a(x,y) join (values 1) b(x) on a.x = b.x;
> > called with 1.0,3.0
> > called with 1.0,3.0
> > +----+----+---------------------+
> > | x  | y  |       EXPR$2        |
> > +----+----+---------------------+
> > | 1  | 1  | 2.4448566647542935  |
> > | 1  | 3  | 2.4448566647542935  |
> > | 1  | 2  | 2.4448566647542935  |
> > +----+----+---------------------+
> > 3 rows selected (0.386 seconds)
>
>
> The join here has no impact on the call to random and you can see that the
> value of EXPR$2 is the same for all output rows.
>
> Note also, however, that while a has three rows, b has only a single row, a
> literal constant. This query could be rewritten to be something like this
> if SQL understood literal tables:
>
> select B.x, A.y, B.r from (values (1,1),(1,2),(1,3)) A(x,y)
> > join (values 1, random(1)) B(x,r)
> > on a.x = b.x
>
>
> But I really think that SQL can't do such a propagation.  In any case, here
> is what Drill gets
>
> 0: jdbc:drill:zk=local> select b.x,a.y,random(b.x, 3) from (values
> > (1,1),(1,2),(1,3)) a(x,y) join (values 1) b(x) on a.x = b.x;
> > called with 1.0,3.0
> > called with 1.0,3.0
> > called with 1.0,3.0
> > +----+----+---------------------+
> > | x  | y  |       EXPR$2        |
> > +----+----+---------------------+
> > | 1  | 1  | 1.3263680093229913  |
> > | 1  | 3  | 1.3085219784739912  |
> > | 1  | 2  | 1.795856354981078   |
> > +----+----+---------------------+
> > 3 rows selected (2.088 seconds)
>
>
> Three rows and three values.
>
> I suspect that filtering on directories is likely to have the same issues.
>
>
> On Fri, Jul 24, 2015 at 11:03 AM, Jacques Nadeau <[email protected]>
> wrote:
>
> >   - This is being called for *every record for every file in every
> >    directory*
> >
> > Are you sure?  Constant reduction should take care of this.  @Jason, any
> > ideas why it might be failing?
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Fri, Jul 24, 2015 at 10:45 AM, Stefán Baxter <
> [email protected]
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > thanks for the tips.
> > >
> > > Observation:
> > >
> > >    - This is being called for *every record for every file in every
> > >    directory*
> > >
> > > Can you please tell me what needs to be done to make sure this is only
> > > called 1 for each directory, preferably before file in that directory
> are
> > > opened/scanned.
> > >
> > > Regards,
> > >   -Stefán
> > >
> > > On Fri, Jul 24, 2015 at 3:28 PM, Jacques Nadeau <[email protected]>
> > > wrote:
> > >
> > > > Two quick notes:
> > > >
> > > > - If you switch to internal null handling, you have to define
> separate
> > > udfs
> > > > for each possible combination of nullable and non-nullable values.
> > > > - isSet is an integer, so your if clause would actually be:
> > > >
> > > > if (! (yearDir.isSet == 1) ) {
> > > >   // yearDir is NULL, handle this here
> > > > }
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > > > On Fri, Jul 24, 2015 at 8:10 AM, Abdel Hakim Deneche <
> > > > [email protected]>
> > > > wrote:
> > > >
> > > > > Hi Stehan,
> > > > >
> > > > > I think when you specify your UDF as NULL_IF_NULL it means Drill
> will
> > > > > handle null values automatically: if any passed argument to your
> UDF
> > is
> > > > > NULL, the UDF won't be evaluated and Drill will return NULL
> instead.
> > > > >
> > > > > In your case your UDF need to handle NULL values by setting:
> > > > >
> > > > > nulls = NullHandling.INTERNAL
> > > > >
> > > > > Then inside your UDF you can check if a passed argument is null
> like
> > > > this:
> > > > >
> > > > > if (!yearDir.isSet) {
> > > > >   // yearDir is NULL, handle this here
> > > > > }
> > > > >
> > > > > Let me know if this works for you, I didn't work on UDFs for quite
> > some
> > > > > time now and they may have slightly changed since then.
> > > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > > On Fri, Jul 24, 2015 at 7:37 AM, Stefán Baxter <
> > > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have this running now:
> > > > > >
> > > > > > "select occurred_at, dir0, dir1, dir2 from
> > > > > > dfs.tmp.`/analytics/processed/test/events` as t where dir0 =
> > > > > > dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
> > > > > >
> > timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-'))
> > > > > order
> > > > > > by occurred_at;"
> > > > > >
> > > > > >
> > > > > > Observations:
> > > > > >
> > > > > >    - Being able to do this without using COALESCE would be more
> > > > readable
> > > > > >
> > > > > >    - This is being called for every record for every file in
> every
> > > > > > directory
> > > > > >    - I would love for this expression only being evaluated as
> "the
> > > > start
> > > > > of
> > > > > >    a new directory"
> > > > > >
> > > > > >    - This function always return 1 and should never match dir0
> > > > > >    - For some strange reason this query sometimes returns results
> > > even
> > > > > >    though it should never do that (dir0 != 1)
> > > > > >
> > > > > > Regards,
> > > > > >  -Stefan
> > > > > >
> > > > > >
> > > > > > On Fri, Jul 24, 2015 at 12:12 PM, Stefán Baxter <
> > > > > [email protected]
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I would like to share our intentions for organizing our data
> and
> > > how
> > > > we
> > > > > > > plan to construct queries for it.
> > > > > > >
> > > > > > > There are four main reasons for sharing this:
> > > > > > >
> > > > > > > a) I would like to sanity check the approach
> > > > > > > b) I'm having a hard time writing a UDF to optimize this and
> > need a
> > > > bit
> > > > > > of
> > > > > > > help.
> > > > > > >
> > > > > > > c) This can perhaps benefit someone else
> > > > > > >
> > > > > > > d) Check if UDF like the one I'm writing already exists
> > > > > > >
> > > > > > >
> > > > > > > Relevant facts:
> > > > > > >
> > > > > > >    - We have multiple tenants and they collect variable amount
> of
> > > > data
> > > > > > >    (from almost non to quite a lot)
> > > > > > >    - Each tenant has multiple data-sources (aka time-series /
> > > tables)
> > > > > > >    - Data is processed to Parquet at variable intervals
> depending
> > > on
> > > > > the
> > > > > > >    tenant and his volume
> > > > > > >    - Parquet files can not be updated/appended to in drill
> > (Create
> > > > from
> > > > > > >    creates a new directory (aka table))
> > > > > > >
> > > > > > > The structure we intent to use
> > > > > > >
> > > > > > >    - / <root-directory> / [processed|streaming] / <tenant> /
> > > > > > >    <data-source> / <year> / <month> / <date>
> > > > > > >    example:
> > > "/analytics/processed/some-tenant-name/events/2015/07/24"
> > > > > > >
> > > > > > >    - where *processed* is the sub-directory used for parquet
> > files
> > > > and
> > > > > > >    the *streaming sub-directory is used for "raw materials"
> > > > > > >    (json/csv/logs etc.)*
> > > > > > >    this structure also allows us to have dev-nodes with local
> > > > streaming
> > > > > > >    directory and remote (hdfs/s3) processed directory
> > > > > > >
> > > > > > > It will vary between tenant how often data is processed and how
> > > data
> > > > is
> > > > > > > stored:
> > > > > > >
> > > > > > >    - small tenant:
> > > > > > >
> > > > > >
> > > >
> /analytics/processed/tenantA/pageviews/2014/on-year-partitioned.parquet
> > > > > > >    - medium tenant:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /analytics/processed/tenantA/pageviews/2015/06/on-month-partitioned.parquet
> > > > > > >    - ... (you get the picture)
> > > > > > >
> > > > > > >    *Important:* this means that in some cases we will have
> empty
> > or
> > > > > > >    missing sub-directories and in some cases we will have all
> the
> > > > data
> > > > > > in the
> > > > > > >    root directory but the queries we run will always be the
> same
> > > ones
> > > > > > >    (regardless of tenant specific storage "layout").
> > > > > > >
> > > > > > > When we do selects we use *union *to query both *processed* and
> > > > > > *streaming
> > > > > > > directory structured *and then merge the data (if needed (for
> > > > grouping
> > > > > > > etc)).
> > > > > > >
> > > > > > > This is all working just fine (so far, so good).
> > > > > > >
> > > > > > >
> > > > > > > Now I'm creating a small UDF that is supposed to help with the
> > > > scanning
> > > > > > of
> > > > > > > these directories.
> > > > > > >
> > > > > > > Requirements:
> > > > > > >
> > > > > > >    - Avoid scanning/opening files in directories that are
> > > irrelevant
> > > > > > >    - Include files in every directory of full/partial relevance
> > > > > > >    - minimize the additional overhead of including this UDF in
> > the
> > > > SQL
> > > > > > >    command.
> > > > > > >
> > > > > > >
> > > > > > > The UDF is called dirInRange and it takes this parameters:
> > > > > > >
> > > > > > >    - @Param TimeStampHolder from;
> > > > > > >    - @Param NullableTimeStampHolder to;
> > > > > > >    - @Param NullableVarCharHolder yearDir;
> > > > > > >    - @Param NullableVarCharHolder monthDir;
> > > > > > >    - @Param NullableVarCharHolder dayDir;
> > > > > > >
> > > > > > > It currently returns a int (1 if in range and other values
> > > depending
> > > > on
> > > > > > > what failed)
> > > > > > >
> > > > > > > The function is defined like this
> > > > > > >
> > > > > > > @FunctionTemplate(name = "dirInRange", scope =
> > > > > > FunctionTemplate.FunctionScope.SIMPLE, nulls =
> > > > > > FunctionTemplate.NullHandling.NULL_IF_NULL)
> > > > > > >
> > > > > > > public static class dirInRange implements DrillSimpleFunc
> > > > > > >
> > > > > > > ... and its called like this:
> > > > > > >
> > > > > > > "select occurred_at, dir0, dir1, dir2 from
> > > > > > > dfs.tmp.`/analytics/processed/test/events` as t where
> > > > > > *dirInRange*(cast('2015-04-10'
> > > > > > > as timestamp),cast('2015-07-11' as timestamp),dir0,dir1,dir2)
> = 1
> > > > order
> > > > > > by
> > > > > > > occurred_at;"
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > This it not working and I have several questions:
> > > > > > >
> > > > > > >    - Is it possible that this function is only called if/when
> > dir0,
> > > > > dir1,
> > > > > > >    and dir2 all exists
> > > > > > >    - If that is the case would it not be better that these
> > > > > > >    utility-variables existed with null value until needed/used
> > > > > > >    - having a Nullable*Holder seems to play no role in this at
> > all
> > > > > > >
> > > > > > >    - Is there anything I can do to have this expression
> evaluated
> > > > only
> > > > > > >    once per directory?
> > > > > > >
> > > > > > >    - Is there anything I can do to have this evaluated before
> the
> > > > files
> > > > > > >    in the directory are processed?
> > > > > > >
> > > > > > >    - How do I pass null value into this?
> > > > > > >    - this fails:  *dirInRange*(cast('2015-04-10' as
> > > > > > >    timestamp),null,dir0,dir1,dir2)
> > > > > > >    Error: PARSE ERROR: From line 1, column 82 to line 1, column
> > 85:
> > > > > > >    Illegal use of 'NULL'
> > > > > > >
> > > > > > > This will be a part of my mini UDF library (
> > > > > > > https://github.com/acmeguy/asdrill), feel free to use/abuse it
> > if
> > > it
> > > > > > > helps.
> > > > > > >
> > > > > > > Regards,
> > > > > > >  -Stefán
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Abdelhakim Deneche
> > > > >
> > > > > Software Engineer
> > > > >
> > > > >   <http://www.mapr.com/>
> > > > >
> > > > >
> > > > > Now Available - Free Hadoop On-Demand Training
> > > > > <
> > > > >
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: storage structure - querying directories - sanity check and UDF assistance

Reply via email to