Re: Naming directories

John Omernik Mon, 07 Sep 2015 10:16:31 -0700

As a follow-up to Jacques email, I did some testing with Parquet files
created and partitioned by Apache Hive. (Not using the metastore to read
these files, just using the directories and reading the Parquet files
directly).

Consider that Hive's partition scheme makes directories that have
partitionfield=partitionvalue as the directory name like this:

table
---day=2015-09-06
-------hour=00
-------hour=01
-------hour=02
---day=2015-09-07
-------hour=00
-------hour=01
-------hour=02

Basically in this case, to use hives partition directory scheme (with
parquet and without the metastore)  you would have to write queries such as:

select * from table where dir0 >= 'day=2015-09-06' and dir1 = 'hour=01'

or

select * from table where dir0 = 'day=2015-09-06' and dir1 < 'hour=03'

These are "doable" but are prone to user errors (what happens they put a
space between hour, =, and the hour) are non intuitive (likes become more
complicated)  etc.

So what I did was use the regexp_replace function in Drill to create a view
that instead of using dir0 or dir1 directly, I could just work with the
"name" and the value... like this

regexp_replace(dir0, 'day=', '') as `day`, regexp_replace(dir1, 'hour=',
'') as `hour`

That allowed me to use the Hive directories easily and intuitively, without
changing the directories.

I will say that performance wasn't great compared to natively loaded (drill
loaded) parquet files.

For example, where I did one query on the hive data that was:

select count(1)  from table where day >= '2015-01-01' using the hive loaded
tables with drill and the drill view it took 26 seconds

When I loaded the whole table into a new parquet (from the hive view) table
in drill, and specified partition by `day`, `hour` the same query ran in
1.08 seconds.  Not sure why this is, perhaps there is more work the drill
engine has to do, perhaps Hive isn't writing parquet file stats well,
perhaps just more IO in that with the drill created table, there was 30
files created, in the hive table there was at least 162 unique partitions
(not even counting files) given the directory structure.  Another example
of performance difference:

select `day`, `hour` from drill_parquet_table where `day` >= '2015-01-01`:
162 rows in 1.6 seconds

select `day`, `hour` from hive_parquet_table where `day` >= '2015-01-01`:
162 rows in 27.2 seconds

Interesting stuff, but the regex_replace does give partition pruning based
on testing. I.e. on the hive table, select count(1) from hivetable where
`day` >= '2015-01-01'  runs much faster than select count(1) from hivetable
where `day` >= '2014-01-01' indicating to me that is indeed not reading the
directories that were older than 2015-01-01 on the >= '2015-01-01' query.

* Note my observations are that of a drill rookie, so if drill experts have
any thoughts on what I wrote about my observations, I'd happily defer.  I
would be interested in a drill expert commenting on the speed of the Drill
loaded Parquet files vs Hive Loaded Parquet files, and if there is
something I can do make Hive loaded parquet less  doggy comparatively, or
if that is just a function of more files to read.

On Fri, Sep 4, 2015 at 4:48 PM, Jacques Nadeau <[email protected]> wrote:

> You can create a view that renames the columns to whatever you like. For
> example:
>
> CREATE VIEW mydata AS SELECT dir0 as "year", dir1 as "month", dir2 as
> "day", dir3 as "hour", a, b, ..., z FROM  `/warehouse/database/table/`
>
> Then you can query: select * from mydata where year = 2012
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Fri, Sep 4, 2015 at 2:35 PM, Grant Overby (groverby) <
> [email protected]>
> wrote:
>
> > I’m using parquet files in hdfs. My files are stored thusly:
> >
> > /warehouse/database/table/field0/field1/field2/field3/fileX.parquet
> >
> > I’d like to give a name to field0..3 that could be used in queries in
> > stead of dir0, dir1, dir2, dir3. Is this possible?
> > [
> >
> http://www.cisco.com/web/europe/images/email/signature/est2014/logo_06.png?ct=1398192119726
> > ]
> >
> > Grant Overby
> > Software Engineer
> > Cisco.com<http://www.cisco.com/>
> > [email protected]<mailto:[email protected]>
> > Mobile: 865 724 4910
> >
> >
> >
> >
> >
> >
> > [http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif] Think
> > before you print.
> >
> > This email may contain confidential and privileged material for the sole
> > use of the intended recipient. Any review, use, distribution or
> disclosure
> > by others is strictly prohibited. If you are not the intended recipient
> (or
> > authorized to receive for the recipient), please contact the sender by
> > reply email and delete all copies of this message.
> >
> > Please click here<
> > http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
> > Company Registration Information.
> >
> >
> >
> >
> >
>

Re: Naming directories

Reply via email to