Re: Naming directories

John Omernik Tue, 08 Sep 2015 11:58:01 -0700

Good find on the replace. I did testing between regexp_replace, `replace`,
and substring and in my use case, the results about the same. Give the 162
partitions, perhaps it doesn't make that big of a difference.




On Tue, Sep 8, 2015 at 1:35 PM, Jason Altekruse <[email protected]>
wrote:

> For anyone who finds this thread, I noticed that we also have a `replace`
> method that does not use regex that will be able to provide the exact same
> functionality without the copy. The modification I posted earlier works in
> this case, but would not be useful for general purpose replace expressions.
>
> Unfortunately "replace" is a considered SQL keyword by our parser, see [1]
> , so to invoke the function you will have to put it in `backticks`.
>
> The documentation does not seem to mention this function, so I will add
> this to my JIRA to enhance the docs.
>
> [1] - https://issues.apache.org/jira/browse/DRILL-1441
>
> On Tue, Sep 8, 2015 at 11:22 AM, Jason Altekruse <[email protected]
> >
> wrote:
>
> > One thing you can do to speed up the expression evaluation is to use this
> > expression instead of regex_replace. This will avoid copying each value
> > into a short lived String object which unfortunately is the only
> interface
> > available on the java regex library we are using within the function. We
> > should probably look into updating these functions to try to avoid the
> > extra copy.
> >
> > substring(dir0, strpos(dir0, '=') + 1)
> >
> > For now it would probably be best to include some common alternatives to
> > simple regexes in the docs to help people avoid this slowdown when it
> isn't
> > needed. I'll open a JIRA for this.
> >
> >
> > On Tue, Sep 8, 2015 at 9:24 AM, John Omernik <[email protected]> wrote:
> >
> >> Aman - The reason I wanted to utilize the regex_replace was the use case
> >> of
> >> being able to take hive loaded and partitioned data, and present it to a
> >> drill user without having them to understand the dir0, dir1 semantics.
> >> Basically, a person using the data should be able to use the day='' that
> >> they may be used to without having them grok and parse in their head
> what
> >> dir0='day=2015-04-01' etc.   But you are correct, the view is only on
> the
> >> hive created data, because I don't need the view in the Drill loaded
> data.
> >> I appreciate the exchange on this, its very interesting and helpful to
> me
> >> (and hopefully others) in understanding the underlying info.
> >>
> >> Jacques:  I will email you the profile directly, you can let the post
> here
> >> know about the findings, but I would prefer the profile itself stay off
> >> the
> >> userlist (lots of names of specifics in the profile).
> >>
> >> Thanks!
> >>
> >>
> >>
> >> On Mon, Sep 7, 2015 at 10:50 PM, Jacques Nadeau <[email protected]>
> >> wrote:
> >>
> >> > John,
> >> >
> >> > Can you post your profile? We should confirm that pruning is working
> >> > correctly and whether the time is being spent in planning or
> execution.
> >> >
> >> > thanks,
> >> > Jacques
> >> >
> >> > --
> >> > Jacques Nadeau
> >> > CTO and Co-Founder, Dremio
> >> >
> >> > On Mon, Sep 7, 2015 at 1:02 PM, John Omernik <[email protected]>
> wrote:
> >> >
> >> > > So interestingly enough, this particular table doesn't have 4k files
> >> in
> >> > it,
> >> > > it's actually pretty small, in that there is only 1 file per
> >> partition.
> >> > >  (tiny?)  Thus there are only 162 files vs. the 30 that drill
> created
> >> > when
> >> > > reprocessing the table. That probably doesn't help either given that
> >> this
> >> > > is such small data, the planning takes more time than query.  It's
> >> cool
> >> > > that the team is looking to improve this, I found the ability to
> just
> >> > have
> >> > > my data in Parquet partitioned by drill to be a huge win as well.
> The
> >> > > enhancements sound like they will enhance this even more, I would
> >> love to
> >> > > see as close to native drill loaded parquet performance as possible
> >> with
> >> > > Hive loaded tables, that would allow us to use drill to query, and
> >> hive
> >> > to
> >> > > load. (Using complex transforms, longer running queries etc).
> >> > >
> >> > > I love drill :)
> >> > >
> >> > >
> >> > >
> >> > > On Mon, Sep 7, 2015 at 2:23 PM, Aman Sinha <[email protected]>
> >> wrote:
> >> > >
> >> > > > Hi John,
> >> > > > the partition pruning *planning* time is indeed a function of the
> >> > number
> >> > > of
> >> > > > files in the table. The execution time is only dependent on the
> >> number
> >> > of
> >> > > > files in the specified partition.   In the Drill loaded Parquet
> >> files
> >> > you
> >> > > > had 30 files whereas in the Hive loaded parquet files you probably
> >> had
> >> > > 162
> >> > > > directories x 24 hours = about 4000 files  ? or somewhere in that
> >> > > range...
> >> > > >
> >> > > > During the query planning phase, Drill partition pruning will load
> >> the
> >> > > full
> >> > > > pathnames of the files in memory, including materializing the
> >> > > partitioning
> >> > > > columns such as 'day'  into memory and apply the `day` >=
> >> '2015-01-01`
> >> > > > filter.  It turns out this process is expensive when there are
> lots
> >> of
> >> > > > files even if they are spread out over multiple directories.   I
> >> > believe
> >> > > > there's an enhancement JIRA to make this process efficient by
> >> loading
> >> > > only
> >> > > > directory names first and then the files...if not, I will create a
> >> > JIRA.
> >> > > >
> >> > > > Note that partition pruning is still a huge win for more complex
> >> > queries
> >> > > > when the total execution time is substantially longer than the
> >> planning
> >> > > > time.  It is only for shorter running queries against large number
> >> of
> >> > > files
> >> > > > where the planning times becomes more dominant.  There is ongoing
> >> > effort
> >> > > to
> >> > > > improve that.
> >> > > >
> >> > > > Aman
> >> > > >
> >> > > >
> >> > > > On Mon, Sep 7, 2015 at 10:15 AM, John Omernik <[email protected]>
> >> > wrote:
> >> > > >
> >> > > > > As a follow-up to Jacques email, I did some testing with Parquet
> >> > files
> >> > > > > created and partitioned by Apache Hive. (Not using the metastore
> >> to
> >> > > read
> >> > > > > these files, just using the directories and reading the Parquet
> >> files
> >> > > > > directly).
> >> > > > >
> >> > > > > Consider that Hive's partition scheme makes directories that
> have
> >> > > > > partitionfield=partitionvalue as the directory name like this:
> >> > > > >
> >> > > > >
> >> > > > > table
> >> > > > > ---day=2015-09-06
> >> > > > > -------hour=00
> >> > > > > -------hour=01
> >> > > > > -------hour=02
> >> > > > > ---day=2015-09-07
> >> > > > > -------hour=00
> >> > > > > -------hour=01
> >> > > > > -------hour=02
> >> > > > >
> >> > > > >
> >> > > > > Basically in this case, to use hives partition directory scheme
> >> (with
> >> > > > > parquet and without the metastore)  you would have to write
> >> queries
> >> > > such
> >> > > > > as:
> >> > > > >
> >> > > > > select * from table where dir0 >= 'day=2015-09-06' and dir1 =
> >> > 'hour=01'
> >> > > > >
> >> > > > > or
> >> > > > >
> >> > > > > select * from table where dir0 = 'day=2015-09-06' and dir1 <
> >> > 'hour=03'
> >> > > > >
> >> > > > > These are "doable" but are prone to user errors (what happens
> they
> >> > put
> >> > > a
> >> > > > > space between hour, =, and the hour) are non intuitive (likes
> >> become
> >> > > more
> >> > > > > complicated)  etc.
> >> > > > >
> >> > > > >
> >> > > > > So what I did was use the regexp_replace function in Drill to
> >> create
> >> > a
> >> > > > view
> >> > > > > that instead of using dir0 or dir1 directly, I could just work
> >> with
> >> > the
> >> > > > > "name" and the value... like this
> >> > > > >
> >> > > > > regexp_replace(dir0, 'day=', '') as `day`, regexp_replace(dir1,
> >> > > 'hour=',
> >> > > > > '') as `hour`
> >> > > > >
> >> > > > > That allowed me to use the Hive directories easily and
> >> intuitively,
> >> > > > without
> >> > > > > changing the directories.
> >> > > > >
> >> > > > > I will say that performance wasn't great compared to natively
> >> loaded
> >> > > > (drill
> >> > > > > loaded) parquet files.
> >> > > > >
> >> > > > > For example, where I did one query on the hive data that was:
> >> > > > >
> >> > > > > select count(1)  from table where day >= '2015-01-01' using the
> >> hive
> >> > > > loaded
> >> > > > > tables with drill and the drill view it took 26 seconds
> >> > > > >
> >> > > > > When I loaded the whole table into a new parquet (from the hive
> >> view)
> >> > > > table
> >> > > > > in drill, and specified partition by `day`, `hour` the same
> query
> >> ran
> >> > > in
> >> > > > > 1.08 seconds.  Not sure why this is, perhaps there is more work
> >> the
> >> > > drill
> >> > > > > engine has to do, perhaps Hive isn't writing parquet file stats
> >> well,
> >> > > > > perhaps just more IO in that with the drill created table, there
> >> was
> >> > 30
> >> > > > > files created, in the hive table there was at least 162 unique
> >> > > partitions
> >> > > > > (not even counting files) given the directory structure.
> Another
> >> > > example
> >> > > > > of performance difference:
> >> > > > >
> >> > > > > select `day`, `hour` from drill_parquet_table where `day` >=
> >> > > > '2015-01-01`:
> >> > > > > 162 rows in 1.6 seconds
> >> > > > >
> >> > > > > select `day`, `hour` from hive_parquet_table where `day` >=
> >> > > '2015-01-01`:
> >> > > > > 162 rows in 27.2 seconds
> >> > > > >
> >> > > > > Interesting stuff, but the regex_replace does give partition
> >> pruning
> >> > > > based
> >> > > > > on testing. I.e. on the hive table, select count(1) from
> hivetable
> >> > > where
> >> > > > > `day` >= '2015-01-01'  runs much faster than select count(1)
> from
> >> > > > hivetable
> >> > > > > where `day` >= '2014-01-01' indicating to me that is indeed not
> >> > reading
> >> > > > the
> >> > > > > directories that were older than 2015-01-01 on the >=
> '2015-01-01'
> >> > > query.
> >> > > > >
> >> > > > > * Note my observations are that of a drill rookie, so if drill
> >> > experts
> >> > > > have
> >> > > > > any thoughts on what I wrote about my observations, I'd happily
> >> > > defer.  I
> >> > > > > would be interested in a drill expert commenting on the speed of
> >> the
> >> > > > Drill
> >> > > > > loaded Parquet files vs Hive Loaded Parquet files, and if there
> is
> >> > > > > something I can do make Hive loaded parquet less  doggy
> >> > comparatively,
> >> > > or
> >> > > > > if that is just a function of more files to read.
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Fri, Sep 4, 2015 at 4:48 PM, Jacques Nadeau <
> >> [email protected]>
> >> > > > wrote:
> >> > > > >
> >> > > > > > You can create a view that renames the columns to whatever you
> >> > like.
> >> > > > For
> >> > > > > > example:
> >> > > > > >
> >> > > > > > CREATE VIEW mydata AS SELECT dir0 as "year", dir1 as "month",
> >> dir2
> >> > as
> >> > > > > > "day", dir3 as "hour", a, b, ..., z FROM
> >> > > `/warehouse/database/table/`
> >> > > > > >
> >> > > > > > Then you can query: select * from mydata where year = 2012
> >> > > > > >
> >> > > > > > --
> >> > > > > > Jacques Nadeau
> >> > > > > > CTO and Co-Founder, Dremio
> >> > > > > >
> >> > > > > > On Fri, Sep 4, 2015 at 2:35 PM, Grant Overby (groverby) <
> >> > > > > > [email protected]>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > I’m using parquet files in hdfs. My files are stored thusly:
> >> > > > > > >
> >> > > > > > >
> >> > /warehouse/database/table/field0/field1/field2/field3/fileX.parquet
> >> > > > > > >
> >> > > > > > > I’d like to give a name to field0..3 that could be used in
> >> > queries
> >> > > in
> >> > > > > > > stead of dir0, dir1, dir2, dir3. Is this possible?
> >> > > > > > > [
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://www.cisco.com/web/europe/images/email/signature/est2014/logo_06.png?ct=1398192119726
> >> > > > > > > ]
> >> > > > > > >
> >> > > > > > > Grant Overby
> >> > > > > > > Software Engineer
> >> > > > > > > Cisco.com<http://www.cisco.com/>
> >> > > > > > > [email protected]<mailto:[email protected]>
> >> > > > > > > Mobile: 865 724 4910
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > [
> http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif]
> >> > > Think
> >> > > > > > > before you print.
> >> > > > > > >
> >> > > > > > > This email may contain confidential and privileged material
> >> for
> >> > the
> >> > > > > sole
> >> > > > > > > use of the intended recipient. Any review, use, distribution
> >> or
> >> > > > > > disclosure
> >> > > > > > > by others is strictly prohibited. If you are not the
> intended
> >> > > > recipient
> >> > > > > > (or
> >> > > > > > > authorized to receive for the recipient), please contact the
> >> > sender
> >> > > > by
> >> > > > > > > reply email and delete all copies of this message.
> >> > > > > > >
> >> > > > > > > Please click here<
> >> > > > > > >
> >> > http://www.cisco.com/web/about/doing_business/legal/cri/index.html
> >> > > >
> >> > > > > for
> >> > > > > > > Company Registration Information.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Naming directories

Reply via email to