John, What happens when you partition the data using drill in the same pattern as Hive uses? Are the Drill produced parquet files still faster?
On Mon, Sep 7, 2015 at 1:02 PM, John Omernik <[email protected]> wrote: > So interestingly enough, this particular table doesn't have 4k files in it, > it's actually pretty small, in that there is only 1 file per partition. > (tiny?) Thus there are only 162 files vs. the 30 that drill created when > reprocessing the table. That probably doesn't help either given that this > is such small data, the planning takes more time than query. It's cool > that the team is looking to improve this, I found the ability to just have > my data in Parquet partitioned by drill to be a huge win as well. The > enhancements sound like they will enhance this even more, I would love to > see as close to native drill loaded parquet performance as possible with > Hive loaded tables, that would allow us to use drill to query, and hive to > load. (Using complex transforms, longer running queries etc). > > I love drill :) > > > > On Mon, Sep 7, 2015 at 2:23 PM, Aman Sinha <[email protected]> wrote: > > > Hi John, > > the partition pruning *planning* time is indeed a function of the number > of > > files in the table. The execution time is only dependent on the number of > > files in the specified partition. In the Drill loaded Parquet files you > > had 30 files whereas in the Hive loaded parquet files you probably had > 162 > > directories x 24 hours = about 4000 files ? or somewhere in that > range... > > > > During the query planning phase, Drill partition pruning will load the > full > > pathnames of the files in memory, including materializing the > partitioning > > columns such as 'day' into memory and apply the `day` >= '2015-01-01` > > filter. It turns out this process is expensive when there are lots of > > files even if they are spread out over multiple directories. I believe > > there's an enhancement JIRA to make this process efficient by loading > only > > directory names first and then the files...if not, I will create a JIRA. > > > > Note that partition pruning is still a huge win for more complex queries > > when the total execution time is substantially longer than the planning > > time. It is only for shorter running queries against large number of > files > > where the planning times becomes more dominant. There is ongoing effort > to > > improve that. > > > > Aman > > > > > > On Mon, Sep 7, 2015 at 10:15 AM, John Omernik <[email protected]> wrote: > > > > > As a follow-up to Jacques email, I did some testing with Parquet files > > > created and partitioned by Apache Hive. (Not using the metastore to > read > > > these files, just using the directories and reading the Parquet files > > > directly). > > > > > > Consider that Hive's partition scheme makes directories that have > > > partitionfield=partitionvalue as the directory name like this: > > > > > > > > > table > > > ---day=2015-09-06 > > > -------hour=00 > > > -------hour=01 > > > -------hour=02 > > > ---day=2015-09-07 > > > -------hour=00 > > > -------hour=01 > > > -------hour=02 > > > > > > > > > Basically in this case, to use hives partition directory scheme (with > > > parquet and without the metastore) you would have to write queries > such > > > as: > > > > > > select * from table where dir0 >= 'day=2015-09-06' and dir1 = 'hour=01' > > > > > > or > > > > > > select * from table where dir0 = 'day=2015-09-06' and dir1 < 'hour=03' > > > > > > These are "doable" but are prone to user errors (what happens they put > a > > > space between hour, =, and the hour) are non intuitive (likes become > more > > > complicated) etc. > > > > > > > > > So what I did was use the regexp_replace function in Drill to create a > > view > > > that instead of using dir0 or dir1 directly, I could just work with the > > > "name" and the value... like this > > > > > > regexp_replace(dir0, 'day=', '') as `day`, regexp_replace(dir1, > 'hour=', > > > '') as `hour` > > > > > > That allowed me to use the Hive directories easily and intuitively, > > without > > > changing the directories. > > > > > > I will say that performance wasn't great compared to natively loaded > > (drill > > > loaded) parquet files. > > > > > > For example, where I did one query on the hive data that was: > > > > > > select count(1) from table where day >= '2015-01-01' using the hive > > loaded > > > tables with drill and the drill view it took 26 seconds > > > > > > When I loaded the whole table into a new parquet (from the hive view) > > table > > > in drill, and specified partition by `day`, `hour` the same query ran > in > > > 1.08 seconds. Not sure why this is, perhaps there is more work the > drill > > > engine has to do, perhaps Hive isn't writing parquet file stats well, > > > perhaps just more IO in that with the drill created table, there was 30 > > > files created, in the hive table there was at least 162 unique > partitions > > > (not even counting files) given the directory structure. Another > example > > > of performance difference: > > > > > > select `day`, `hour` from drill_parquet_table where `day` >= > > '2015-01-01`: > > > 162 rows in 1.6 seconds > > > > > > select `day`, `hour` from hive_parquet_table where `day` >= > '2015-01-01`: > > > 162 rows in 27.2 seconds > > > > > > Interesting stuff, but the regex_replace does give partition pruning > > based > > > on testing. I.e. on the hive table, select count(1) from hivetable > where > > > `day` >= '2015-01-01' runs much faster than select count(1) from > > hivetable > > > where `day` >= '2014-01-01' indicating to me that is indeed not reading > > the > > > directories that were older than 2015-01-01 on the >= '2015-01-01' > query. > > > > > > * Note my observations are that of a drill rookie, so if drill experts > > have > > > any thoughts on what I wrote about my observations, I'd happily > defer. I > > > would be interested in a drill expert commenting on the speed of the > > Drill > > > loaded Parquet files vs Hive Loaded Parquet files, and if there is > > > something I can do make Hive loaded parquet less doggy comparatively, > or > > > if that is just a function of more files to read. > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Sep 4, 2015 at 4:48 PM, Jacques Nadeau <[email protected]> > > wrote: > > > > > > > You can create a view that renames the columns to whatever you like. > > For > > > > example: > > > > > > > > CREATE VIEW mydata AS SELECT dir0 as "year", dir1 as "month", dir2 as > > > > "day", dir3 as "hour", a, b, ..., z FROM > `/warehouse/database/table/` > > > > > > > > Then you can query: select * from mydata where year = 2012 > > > > > > > > -- > > > > Jacques Nadeau > > > > CTO and Co-Founder, Dremio > > > > > > > > On Fri, Sep 4, 2015 at 2:35 PM, Grant Overby (groverby) < > > > > [email protected]> > > > > wrote: > > > > > > > > > I’m using parquet files in hdfs. My files are stored thusly: > > > > > > > > > > /warehouse/database/table/field0/field1/field2/field3/fileX.parquet > > > > > > > > > > I’d like to give a name to field0..3 that could be used in queries > in > > > > > stead of dir0, dir1, dir2, dir3. Is this possible? > > > > > [ > > > > > > > > > > > > > > > http://www.cisco.com/web/europe/images/email/signature/est2014/logo_06.png?ct=1398192119726 > > > > > ] > > > > > > > > > > Grant Overby > > > > > Software Engineer > > > > > Cisco.com<http://www.cisco.com/> > > > > > [email protected]<mailto:[email protected]> > > > > > Mobile: 865 724 4910 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif] > Think > > > > > before you print. > > > > > > > > > > This email may contain confidential and privileged material for the > > > sole > > > > > use of the intended recipient. Any review, use, distribution or > > > > disclosure > > > > > by others is strictly prohibited. If you are not the intended > > recipient > > > > (or > > > > > authorized to receive for the recipient), please contact the sender > > by > > > > > reply email and delete all copies of this message. > > > > > > > > > > Please click here< > > > > > http://www.cisco.com/web/about/doing_business/legal/cri/index.html > > > > > for > > > > > Company Registration Information. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
