Re: Naming directories

Jason Altekruse Tue, 08 Sep 2015 11:23:23 -0700

One thing you can do to speed up the expression evaluation is to use this
expression instead of regex_replace. This will avoid copying each value
into a short lived String object which unfortunately is the only interface
available on the java regex library we are using within the function. We
should probably look into updating these functions to try to avoid the
extra copy.


substring(dir0, strpos(dir0, '=') + 1)

For now it would probably be best to include some common alternatives to
simple regexes in the docs to help people avoid this slowdown when it isn't
needed. I'll open a JIRA for this.


On Tue, Sep 8, 2015 at 9:24 AM, John Omernik <[email protected]> wrote:

> Aman - The reason I wanted to utilize the regex_replace was the use case of
> being able to take hive loaded and partitioned data, and present it to a
> drill user without having them to understand the dir0, dir1 semantics.
> Basically, a person using the data should be able to use the day='' that
> they may be used to without having them grok and parse in their head what
> dir0='day=2015-04-01' etc.   But you are correct, the view is only on the
> hive created data, because I don't need the view in the Drill loaded data.
> I appreciate the exchange on this, its very interesting and helpful to me
> (and hopefully others) in understanding the underlying info.
>
> Jacques:  I will email you the profile directly, you can let the post here
> know about the findings, but I would prefer the profile itself stay off the
> userlist (lots of names of specifics in the profile).
>
> Thanks!
>
>
>
> On Mon, Sep 7, 2015 at 10:50 PM, Jacques Nadeau <[email protected]>
> wrote:
>
> > John,
> >
> > Can you post your profile? We should confirm that pruning is working
> > correctly and whether the time is being spent in planning or execution.
> >
> > thanks,
> > Jacques
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Mon, Sep 7, 2015 at 1:02 PM, John Omernik <[email protected]> wrote:
> >
> > > So interestingly enough, this particular table doesn't have 4k files in
> > it,
> > > it's actually pretty small, in that there is only 1 file per partition.
> > >  (tiny?)  Thus there are only 162 files vs. the 30 that drill created
> > when
> > > reprocessing the table. That probably doesn't help either given that
> this
> > > is such small data, the planning takes more time than query.  It's cool
> > > that the team is looking to improve this, I found the ability to just
> > have
> > > my data in Parquet partitioned by drill to be a huge win as well.  The
> > > enhancements sound like they will enhance this even more, I would love
> to
> > > see as close to native drill loaded parquet performance as possible
> with
> > > Hive loaded tables, that would allow us to use drill to query, and hive
> > to
> > > load. (Using complex transforms, longer running queries etc).
> > >
> > > I love drill :)
> > >
> > >
> > >
> > > On Mon, Sep 7, 2015 at 2:23 PM, Aman Sinha <[email protected]>
> wrote:
> > >
> > > > Hi John,
> > > > the partition pruning *planning* time is indeed a function of the
> > number
> > > of
> > > > files in the table. The execution time is only dependent on the
> number
> > of
> > > > files in the specified partition.   In the Drill loaded Parquet files
> > you
> > > > had 30 files whereas in the Hive loaded parquet files you probably
> had
> > > 162
> > > > directories x 24 hours = about 4000 files  ? or somewhere in that
> > > range...
> > > >
> > > > During the query planning phase, Drill partition pruning will load
> the
> > > full
> > > > pathnames of the files in memory, including materializing the
> > > partitioning
> > > > columns such as 'day'  into memory and apply the `day` >=
> '2015-01-01`
> > > > filter.  It turns out this process is expensive when there are lots
> of
> > > > files even if they are spread out over multiple directories.   I
> > believe
> > > > there's an enhancement JIRA to make this process efficient by loading
> > > only
> > > > directory names first and then the files...if not, I will create a
> > JIRA.
> > > >
> > > > Note that partition pruning is still a huge win for more complex
> > queries
> > > > when the total execution time is substantially longer than the
> planning
> > > > time.  It is only for shorter running queries against large number of
> > > files
> > > > where the planning times becomes more dominant.  There is ongoing
> > effort
> > > to
> > > > improve that.
> > > >
> > > > Aman
> > > >
> > > >
> > > > On Mon, Sep 7, 2015 at 10:15 AM, John Omernik <[email protected]>
> > wrote:
> > > >
> > > > > As a follow-up to Jacques email, I did some testing with Parquet
> > files
> > > > > created and partitioned by Apache Hive. (Not using the metastore to
> > > read
> > > > > these files, just using the directories and reading the Parquet
> files
> > > > > directly).
> > > > >
> > > > > Consider that Hive's partition scheme makes directories that have
> > > > > partitionfield=partitionvalue as the directory name like this:
> > > > >
> > > > >
> > > > > table
> > > > > ---day=2015-09-06
> > > > > -------hour=00
> > > > > -------hour=01
> > > > > -------hour=02
> > > > > ---day=2015-09-07
> > > > > -------hour=00
> > > > > -------hour=01
> > > > > -------hour=02
> > > > >
> > > > >
> > > > > Basically in this case, to use hives partition directory scheme
> (with
> > > > > parquet and without the metastore)  you would have to write queries
> > > such
> > > > > as:
> > > > >
> > > > > select * from table where dir0 >= 'day=2015-09-06' and dir1 =
> > 'hour=01'
> > > > >
> > > > > or
> > > > >
> > > > > select * from table where dir0 = 'day=2015-09-06' and dir1 <
> > 'hour=03'
> > > > >
> > > > > These are "doable" but are prone to user errors (what happens they
> > put
> > > a
> > > > > space between hour, =, and the hour) are non intuitive (likes
> become
> > > more
> > > > > complicated)  etc.
> > > > >
> > > > >
> > > > > So what I did was use the regexp_replace function in Drill to
> create
> > a
> > > > view
> > > > > that instead of using dir0 or dir1 directly, I could just work with
> > the
> > > > > "name" and the value... like this
> > > > >
> > > > > regexp_replace(dir0, 'day=', '') as `day`, regexp_replace(dir1,
> > > 'hour=',
> > > > > '') as `hour`
> > > > >
> > > > > That allowed me to use the Hive directories easily and intuitively,
> > > > without
> > > > > changing the directories.
> > > > >
> > > > > I will say that performance wasn't great compared to natively
> loaded
> > > > (drill
> > > > > loaded) parquet files.
> > > > >
> > > > > For example, where I did one query on the hive data that was:
> > > > >
> > > > > select count(1)  from table where day >= '2015-01-01' using the
> hive
> > > > loaded
> > > > > tables with drill and the drill view it took 26 seconds
> > > > >
> > > > > When I loaded the whole table into a new parquet (from the hive
> view)
> > > > table
> > > > > in drill, and specified partition by `day`, `hour` the same query
> ran
> > > in
> > > > > 1.08 seconds.  Not sure why this is, perhaps there is more work the
> > > drill
> > > > > engine has to do, perhaps Hive isn't writing parquet file stats
> well,
> > > > > perhaps just more IO in that with the drill created table, there
> was
> > 30
> > > > > files created, in the hive table there was at least 162 unique
> > > partitions
> > > > > (not even counting files) given the directory structure.  Another
> > > example
> > > > > of performance difference:
> > > > >
> > > > > select `day`, `hour` from drill_parquet_table where `day` >=
> > > > '2015-01-01`:
> > > > > 162 rows in 1.6 seconds
> > > > >
> > > > > select `day`, `hour` from hive_parquet_table where `day` >=
> > > '2015-01-01`:
> > > > > 162 rows in 27.2 seconds
> > > > >
> > > > > Interesting stuff, but the regex_replace does give partition
> pruning
> > > > based
> > > > > on testing. I.e. on the hive table, select count(1) from hivetable
> > > where
> > > > > `day` >= '2015-01-01'  runs much faster than select count(1) from
> > > > hivetable
> > > > > where `day` >= '2014-01-01' indicating to me that is indeed not
> > reading
> > > > the
> > > > > directories that were older than 2015-01-01 on the >= '2015-01-01'
> > > query.
> > > > >
> > > > > * Note my observations are that of a drill rookie, so if drill
> > experts
> > > > have
> > > > > any thoughts on what I wrote about my observations, I'd happily
> > > defer.  I
> > > > > would be interested in a drill expert commenting on the speed of
> the
> > > > Drill
> > > > > loaded Parquet files vs Hive Loaded Parquet files, and if there is
> > > > > something I can do make Hive loaded parquet less  doggy
> > comparatively,
> > > or
> > > > > if that is just a function of more files to read.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Sep 4, 2015 at 4:48 PM, Jacques Nadeau <[email protected]
> >
> > > > wrote:
> > > > >
> > > > > > You can create a view that renames the columns to whatever you
> > like.
> > > > For
> > > > > > example:
> > > > > >
> > > > > > CREATE VIEW mydata AS SELECT dir0 as "year", dir1 as "month",
> dir2
> > as
> > > > > > "day", dir3 as "hour", a, b, ..., z FROM
> > > `/warehouse/database/table/`
> > > > > >
> > > > > > Then you can query: select * from mydata where year = 2012
> > > > > >
> > > > > > --
> > > > > > Jacques Nadeau
> > > > > > CTO and Co-Founder, Dremio
> > > > > >
> > > > > > On Fri, Sep 4, 2015 at 2:35 PM, Grant Overby (groverby) <
> > > > > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > I’m using parquet files in hdfs. My files are stored thusly:
> > > > > > >
> > > > > > >
> > /warehouse/database/table/field0/field1/field2/field3/fileX.parquet
> > > > > > >
> > > > > > > I’d like to give a name to field0..3 that could be used in
> > queries
> > > in
> > > > > > > stead of dir0, dir1, dir2, dir3. Is this possible?
> > > > > > > [
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.cisco.com/web/europe/images/email/signature/est2014/logo_06.png?ct=1398192119726
> > > > > > > ]
> > > > > > >
> > > > > > > Grant Overby
> > > > > > > Software Engineer
> > > > > > > Cisco.com<http://www.cisco.com/>
> > > > > > > [email protected]<mailto:[email protected]>
> > > > > > > Mobile: 865 724 4910
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > [http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif]
> > > Think
> > > > > > > before you print.
> > > > > > >
> > > > > > > This email may contain confidential and privileged material for
> > the
> > > > > sole
> > > > > > > use of the intended recipient. Any review, use, distribution or
> > > > > > disclosure
> > > > > > > by others is strictly prohibited. If you are not the intended
> > > > recipient
> > > > > > (or
> > > > > > > authorized to receive for the recipient), please contact the
> > sender
> > > > by
> > > > > > > reply email and delete all copies of this message.
> > > > > > >
> > > > > > > Please click here<
> > > > > > >
> > http://www.cisco.com/web/about/doing_business/legal/cri/index.html
> > > >
> > > > > for
> > > > > > > Company Registration Information.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Naming directories

Reply via email to