So interestingly enough, this particular table doesn't have 4k files in it, it's actually pretty small, in that there is only 1 file per partition. (tiny?) Thus there are only 162 files vs. the 30 that drill created when reprocessing the table. That probably doesn't help either given that this is such small data, the planning takes more time than query. It's cool that the team is looking to improve this, I found the ability to just have my data in Parquet partitioned by drill to be a huge win as well. The enhancements sound like they will enhance this even more, I would love to see as close to native drill loaded parquet performance as possible with Hive loaded tables, that would allow us to use drill to query, and hive to load. (Using complex transforms, longer running queries etc).
I love drill :) On Mon, Sep 7, 2015 at 2:23 PM, Aman Sinha <[email protected]> wrote: > Hi John, > the partition pruning *planning* time is indeed a function of the number of > files in the table. The execution time is only dependent on the number of > files in the specified partition. In the Drill loaded Parquet files you > had 30 files whereas in the Hive loaded parquet files you probably had 162 > directories x 24 hours = about 4000 files ? or somewhere in that range... > > During the query planning phase, Drill partition pruning will load the full > pathnames of the files in memory, including materializing the partitioning > columns such as 'day' into memory and apply the `day` >= '2015-01-01` > filter. It turns out this process is expensive when there are lots of > files even if they are spread out over multiple directories. I believe > there's an enhancement JIRA to make this process efficient by loading only > directory names first and then the files...if not, I will create a JIRA. > > Note that partition pruning is still a huge win for more complex queries > when the total execution time is substantially longer than the planning > time. It is only for shorter running queries against large number of files > where the planning times becomes more dominant. There is ongoing effort to > improve that. > > Aman > > > On Mon, Sep 7, 2015 at 10:15 AM, John Omernik <[email protected]> wrote: > > > As a follow-up to Jacques email, I did some testing with Parquet files > > created and partitioned by Apache Hive. (Not using the metastore to read > > these files, just using the directories and reading the Parquet files > > directly). > > > > Consider that Hive's partition scheme makes directories that have > > partitionfield=partitionvalue as the directory name like this: > > > > > > table > > ---day=2015-09-06 > > -------hour=00 > > -------hour=01 > > -------hour=02 > > ---day=2015-09-07 > > -------hour=00 > > -------hour=01 > > -------hour=02 > > > > > > Basically in this case, to use hives partition directory scheme (with > > parquet and without the metastore) you would have to write queries such > > as: > > > > select * from table where dir0 >= 'day=2015-09-06' and dir1 = 'hour=01' > > > > or > > > > select * from table where dir0 = 'day=2015-09-06' and dir1 < 'hour=03' > > > > These are "doable" but are prone to user errors (what happens they put a > > space between hour, =, and the hour) are non intuitive (likes become more > > complicated) etc. > > > > > > So what I did was use the regexp_replace function in Drill to create a > view > > that instead of using dir0 or dir1 directly, I could just work with the > > "name" and the value... like this > > > > regexp_replace(dir0, 'day=', '') as `day`, regexp_replace(dir1, 'hour=', > > '') as `hour` > > > > That allowed me to use the Hive directories easily and intuitively, > without > > changing the directories. > > > > I will say that performance wasn't great compared to natively loaded > (drill > > loaded) parquet files. > > > > For example, where I did one query on the hive data that was: > > > > select count(1) from table where day >= '2015-01-01' using the hive > loaded > > tables with drill and the drill view it took 26 seconds > > > > When I loaded the whole table into a new parquet (from the hive view) > table > > in drill, and specified partition by `day`, `hour` the same query ran in > > 1.08 seconds. Not sure why this is, perhaps there is more work the drill > > engine has to do, perhaps Hive isn't writing parquet file stats well, > > perhaps just more IO in that with the drill created table, there was 30 > > files created, in the hive table there was at least 162 unique partitions > > (not even counting files) given the directory structure. Another example > > of performance difference: > > > > select `day`, `hour` from drill_parquet_table where `day` >= > '2015-01-01`: > > 162 rows in 1.6 seconds > > > > select `day`, `hour` from hive_parquet_table where `day` >= '2015-01-01`: > > 162 rows in 27.2 seconds > > > > Interesting stuff, but the regex_replace does give partition pruning > based > > on testing. I.e. on the hive table, select count(1) from hivetable where > > `day` >= '2015-01-01' runs much faster than select count(1) from > hivetable > > where `day` >= '2014-01-01' indicating to me that is indeed not reading > the > > directories that were older than 2015-01-01 on the >= '2015-01-01' query. > > > > * Note my observations are that of a drill rookie, so if drill experts > have > > any thoughts on what I wrote about my observations, I'd happily defer. I > > would be interested in a drill expert commenting on the speed of the > Drill > > loaded Parquet files vs Hive Loaded Parquet files, and if there is > > something I can do make Hive loaded parquet less doggy comparatively, or > > if that is just a function of more files to read. > > > > > > > > > > > > > > > > On Fri, Sep 4, 2015 at 4:48 PM, Jacques Nadeau <[email protected]> > wrote: > > > > > You can create a view that renames the columns to whatever you like. > For > > > example: > > > > > > CREATE VIEW mydata AS SELECT dir0 as "year", dir1 as "month", dir2 as > > > "day", dir3 as "hour", a, b, ..., z FROM `/warehouse/database/table/` > > > > > > Then you can query: select * from mydata where year = 2012 > > > > > > -- > > > Jacques Nadeau > > > CTO and Co-Founder, Dremio > > > > > > On Fri, Sep 4, 2015 at 2:35 PM, Grant Overby (groverby) < > > > [email protected]> > > > wrote: > > > > > > > I’m using parquet files in hdfs. My files are stored thusly: > > > > > > > > /warehouse/database/table/field0/field1/field2/field3/fileX.parquet > > > > > > > > I’d like to give a name to field0..3 that could be used in queries in > > > > stead of dir0, dir1, dir2, dir3. Is this possible? > > > > [ > > > > > > > > > > http://www.cisco.com/web/europe/images/email/signature/est2014/logo_06.png?ct=1398192119726 > > > > ] > > > > > > > > Grant Overby > > > > Software Engineer > > > > Cisco.com<http://www.cisco.com/> > > > > [email protected]<mailto:[email protected]> > > > > Mobile: 865 724 4910 > > > > > > > > > > > > > > > > > > > > > > > > > > > > [http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif] Think > > > > before you print. > > > > > > > > This email may contain confidential and privileged material for the > > sole > > > > use of the intended recipient. Any review, use, distribution or > > > disclosure > > > > by others is strictly prohibited. If you are not the intended > recipient > > > (or > > > > authorized to receive for the recipient), please contact the sender > by > > > > reply email and delete all copies of this message. > > > > > > > > Please click here< > > > > http://www.cisco.com/web/about/doing_business/legal/cri/index.html> > > for > > > > Company Registration Information. > > > > > > > > > > > > > > > > > > > > > > > > > >
