Re: Naming directories

Aman Sinha Mon, 07 Sep 2015 18:26:24 -0700

John,
If there are only 162 files in corresponding directories, it is likely the
time may be consumed in regexp_replace().  Did you try without the function
in your view and directly querying the underlying directory structure ?
>From what I understand you have a view for the Hive created tables but not
for the Drill created tables.. is that right ?


I did a quick test with TPC-DS data where the store_sales table is
partitioned into 301 directories based on ss_promo_sk column.   I did this
using Drill and shell script rather than Hive  - I used Drill CTAS to
create the 301 files and then moved them into respective directories using
a shell script.  I don't have access to Hive right now but I doubt that it
would make a huge difference.

The performance was generally in sub second range:
 0: jdbc:drill:zk=local> select count(1) from `ss_partition/ss3` where
ss_promo_sk > 99;
+----------+
|  EXPR$0  |
+----------+
| 1842941  |
+----------+
1 row selected (0.596 seconds)


On Mon, Sep 7, 2015 at 3:25 PM, Ted Dunning <[email protected]> wrote:

> John,
>
> What happens when you partition the data using drill in the same pattern as
> Hive uses?  Are the Drill produced parquet files still faster?
>
>
>
> On Mon, Sep 7, 2015 at 1:02 PM, John Omernik <[email protected]> wrote:
>
> > So interestingly enough, this particular table doesn't have 4k files in
> it,
> > it's actually pretty small, in that there is only 1 file per partition.
> >  (tiny?)  Thus there are only 162 files vs. the 30 that drill created
> when
> > reprocessing the table. That probably doesn't help either given that this
> > is such small data, the planning takes more time than query.  It's cool
> > that the team is looking to improve this, I found the ability to just
> have
> > my data in Parquet partitioned by drill to be a huge win as well.  The
> > enhancements sound like they will enhance this even more, I would love to
> > see as close to native drill loaded parquet performance as possible with
> > Hive loaded tables, that would allow us to use drill to query, and hive
> to
> > load. (Using complex transforms, longer running queries etc).
> >
> > I love drill :)
> >
> >
> >
> > On Mon, Sep 7, 2015 at 2:23 PM, Aman Sinha <[email protected]> wrote:
> >
> > > Hi John,
> > > the partition pruning *planning* time is indeed a function of the
> number
> > of
> > > files in the table. The execution time is only dependent on the number
> of
> > > files in the specified partition.   In the Drill loaded Parquet files
> you
> > > had 30 files whereas in the Hive loaded parquet files you probably had
> > 162
> > > directories x 24 hours = about 4000 files  ? or somewhere in that
> > range...
> > >
> > > During the query planning phase, Drill partition pruning will load the
> > full
> > > pathnames of the files in memory, including materializing the
> > partitioning
> > > columns such as 'day'  into memory and apply the `day` >= '2015-01-01`
> > > filter.  It turns out this process is expensive when there are lots of
> > > files even if they are spread out over multiple directories.   I
> believe
> > > there's an enhancement JIRA to make this process efficient by loading
> > only
> > > directory names first and then the files...if not, I will create a
> JIRA.
> > >
> > > Note that partition pruning is still a huge win for more complex
> queries
> > > when the total execution time is substantially longer than the planning
> > > time.  It is only for shorter running queries against large number of
> > files
> > > where the planning times becomes more dominant.  There is ongoing
> effort
> > to
> > > improve that.
> > >
> > > Aman
> > >
> > >
> > > On Mon, Sep 7, 2015 at 10:15 AM, John Omernik <[email protected]>
> wrote:
> > >
> > > > As a follow-up to Jacques email, I did some testing with Parquet
> files
> > > > created and partitioned by Apache Hive. (Not using the metastore to
> > read
> > > > these files, just using the directories and reading the Parquet files
> > > > directly).
> > > >
> > > > Consider that Hive's partition scheme makes directories that have
> > > > partitionfield=partitionvalue as the directory name like this:
> > > >
> > > >
> > > > table
> > > > ---day=2015-09-06
> > > > -------hour=00
> > > > -------hour=01
> > > > -------hour=02
> > > > ---day=2015-09-07
> > > > -------hour=00
> > > > -------hour=01
> > > > -------hour=02
> > > >
> > > >
> > > > Basically in this case, to use hives partition directory scheme (with
> > > > parquet and without the metastore)  you would have to write queries
> > such
> > > > as:
> > > >
> > > > select * from table where dir0 >= 'day=2015-09-06' and dir1 =
> 'hour=01'
> > > >
> > > > or
> > > >
> > > > select * from table where dir0 = 'day=2015-09-06' and dir1 <
> 'hour=03'
> > > >
> > > > These are "doable" but are prone to user errors (what happens they
> put
> > a
> > > > space between hour, =, and the hour) are non intuitive (likes become
> > more
> > > > complicated)  etc.
> > > >
> > > >
> > > > So what I did was use the regexp_replace function in Drill to create
> a
> > > view
> > > > that instead of using dir0 or dir1 directly, I could just work with
> the
> > > > "name" and the value... like this
> > > >
> > > > regexp_replace(dir0, 'day=', '') as `day`, regexp_replace(dir1,
> > 'hour=',
> > > > '') as `hour`
> > > >
> > > > That allowed me to use the Hive directories easily and intuitively,
> > > without
> > > > changing the directories.
> > > >
> > > > I will say that performance wasn't great compared to natively loaded
> > > (drill
> > > > loaded) parquet files.
> > > >
> > > > For example, where I did one query on the hive data that was:
> > > >
> > > > select count(1)  from table where day >= '2015-01-01' using the hive
> > > loaded
> > > > tables with drill and the drill view it took 26 seconds
> > > >
> > > > When I loaded the whole table into a new parquet (from the hive view)
> > > table
> > > > in drill, and specified partition by `day`, `hour` the same query ran
> > in
> > > > 1.08 seconds.  Not sure why this is, perhaps there is more work the
> > drill
> > > > engine has to do, perhaps Hive isn't writing parquet file stats well,
> > > > perhaps just more IO in that with the drill created table, there was
> 30
> > > > files created, in the hive table there was at least 162 unique
> > partitions
> > > > (not even counting files) given the directory structure.  Another
> > example
> > > > of performance difference:
> > > >
> > > > select `day`, `hour` from drill_parquet_table where `day` >=
> > > '2015-01-01`:
> > > > 162 rows in 1.6 seconds
> > > >
> > > > select `day`, `hour` from hive_parquet_table where `day` >=
> > '2015-01-01`:
> > > > 162 rows in 27.2 seconds
> > > >
> > > > Interesting stuff, but the regex_replace does give partition pruning
> > > based
> > > > on testing. I.e. on the hive table, select count(1) from hivetable
> > where
> > > > `day` >= '2015-01-01'  runs much faster than select count(1) from
> > > hivetable
> > > > where `day` >= '2014-01-01' indicating to me that is indeed not
> reading
> > > the
> > > > directories that were older than 2015-01-01 on the >= '2015-01-01'
> > query.
> > > >
> > > > * Note my observations are that of a drill rookie, so if drill
> experts
> > > have
> > > > any thoughts on what I wrote about my observations, I'd happily
> > defer.  I
> > > > would be interested in a drill expert commenting on the speed of the
> > > Drill
> > > > loaded Parquet files vs Hive Loaded Parquet files, and if there is
> > > > something I can do make Hive loaded parquet less  doggy
> comparatively,
> > or
> > > > if that is just a function of more files to read.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Sep 4, 2015 at 4:48 PM, Jacques Nadeau <[email protected]>
> > > wrote:
> > > >
> > > > > You can create a view that renames the columns to whatever you
> like.
> > > For
> > > > > example:
> > > > >
> > > > > CREATE VIEW mydata AS SELECT dir0 as "year", dir1 as "month", dir2
> as
> > > > > "day", dir3 as "hour", a, b, ..., z FROM
> > `/warehouse/database/table/`
> > > > >
> > > > > Then you can query: select * from mydata where year = 2012
> > > > >
> > > > > --
> > > > > Jacques Nadeau
> > > > > CTO and Co-Founder, Dremio
> > > > >
> > > > > On Fri, Sep 4, 2015 at 2:35 PM, Grant Overby (groverby) <
> > > > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > I’m using parquet files in hdfs. My files are stored thusly:
> > > > > >
> > > > > >
> /warehouse/database/table/field0/field1/field2/field3/fileX.parquet
> > > > > >
> > > > > > I’d like to give a name to field0..3 that could be used in
> queries
> > in
> > > > > > stead of dir0, dir1, dir2, dir3. Is this possible?
> > > > > > [
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.cisco.com/web/europe/images/email/signature/est2014/logo_06.png?ct=1398192119726
> > > > > > ]
> > > > > >
> > > > > > Grant Overby
> > > > > > Software Engineer
> > > > > > Cisco.com<http://www.cisco.com/>
> > > > > > [email protected]<mailto:[email protected]>
> > > > > > Mobile: 865 724 4910
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > [http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif]
> > Think
> > > > > > before you print.
> > > > > >
> > > > > > This email may contain confidential and privileged material for
> the
> > > > sole
> > > > > > use of the intended recipient. Any review, use, distribution or
> > > > > disclosure
> > > > > > by others is strictly prohibited. If you are not the intended
> > > recipient
> > > > > (or
> > > > > > authorized to receive for the recipient), please contact the
> sender
> > > by
> > > > > > reply email and delete all copies of this message.
> > > > > >
> > > > > > Please click here<
> > > > > >
> http://www.cisco.com/web/about/doing_business/legal/cri/index.html
> > >
> > > > for
> > > > > > Company Registration Information.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Naming directories

Reply via email to