Good find on the replace. I did testing between regexp_replace, `replace`, and substring and in my use case, the results about the same. Give the 162 partitions, perhaps it doesn't make that big of a difference.
On Tue, Sep 8, 2015 at 1:35 PM, Jason Altekruse <[email protected]> wrote: > For anyone who finds this thread, I noticed that we also have a `replace` > method that does not use regex that will be able to provide the exact same > functionality without the copy. The modification I posted earlier works in > this case, but would not be useful for general purpose replace expressions. > > Unfortunately "replace" is a considered SQL keyword by our parser, see [1] > , so to invoke the function you will have to put it in `backticks`. > > The documentation does not seem to mention this function, so I will add > this to my JIRA to enhance the docs. > > [1] - https://issues.apache.org/jira/browse/DRILL-1441 > > On Tue, Sep 8, 2015 at 11:22 AM, Jason Altekruse <[email protected] > > > wrote: > > > One thing you can do to speed up the expression evaluation is to use this > > expression instead of regex_replace. This will avoid copying each value > > into a short lived String object which unfortunately is the only > interface > > available on the java regex library we are using within the function. We > > should probably look into updating these functions to try to avoid the > > extra copy. > > > > substring(dir0, strpos(dir0, '=') + 1) > > > > For now it would probably be best to include some common alternatives to > > simple regexes in the docs to help people avoid this slowdown when it > isn't > > needed. I'll open a JIRA for this. > > > > > > On Tue, Sep 8, 2015 at 9:24 AM, John Omernik <[email protected]> wrote: > > > >> Aman - The reason I wanted to utilize the regex_replace was the use case > >> of > >> being able to take hive loaded and partitioned data, and present it to a > >> drill user without having them to understand the dir0, dir1 semantics. > >> Basically, a person using the data should be able to use the day='' that > >> they may be used to without having them grok and parse in their head > what > >> dir0='day=2015-04-01' etc. But you are correct, the view is only on > the > >> hive created data, because I don't need the view in the Drill loaded > data. > >> I appreciate the exchange on this, its very interesting and helpful to > me > >> (and hopefully others) in understanding the underlying info. > >> > >> Jacques: I will email you the profile directly, you can let the post > here > >> know about the findings, but I would prefer the profile itself stay off > >> the > >> userlist (lots of names of specifics in the profile). > >> > >> Thanks! > >> > >> > >> > >> On Mon, Sep 7, 2015 at 10:50 PM, Jacques Nadeau <[email protected]> > >> wrote: > >> > >> > John, > >> > > >> > Can you post your profile? We should confirm that pruning is working > >> > correctly and whether the time is being spent in planning or > execution. > >> > > >> > thanks, > >> > Jacques > >> > > >> > -- > >> > Jacques Nadeau > >> > CTO and Co-Founder, Dremio > >> > > >> > On Mon, Sep 7, 2015 at 1:02 PM, John Omernik <[email protected]> > wrote: > >> > > >> > > So interestingly enough, this particular table doesn't have 4k files > >> in > >> > it, > >> > > it's actually pretty small, in that there is only 1 file per > >> partition. > >> > > (tiny?) Thus there are only 162 files vs. the 30 that drill > created > >> > when > >> > > reprocessing the table. That probably doesn't help either given that > >> this > >> > > is such small data, the planning takes more time than query. It's > >> cool > >> > > that the team is looking to improve this, I found the ability to > just > >> > have > >> > > my data in Parquet partitioned by drill to be a huge win as well. > The > >> > > enhancements sound like they will enhance this even more, I would > >> love to > >> > > see as close to native drill loaded parquet performance as possible > >> with > >> > > Hive loaded tables, that would allow us to use drill to query, and > >> hive > >> > to > >> > > load. (Using complex transforms, longer running queries etc). > >> > > > >> > > I love drill :) > >> > > > >> > > > >> > > > >> > > On Mon, Sep 7, 2015 at 2:23 PM, Aman Sinha <[email protected]> > >> wrote: > >> > > > >> > > > Hi John, > >> > > > the partition pruning *planning* time is indeed a function of the > >> > number > >> > > of > >> > > > files in the table. The execution time is only dependent on the > >> number > >> > of > >> > > > files in the specified partition. In the Drill loaded Parquet > >> files > >> > you > >> > > > had 30 files whereas in the Hive loaded parquet files you probably > >> had > >> > > 162 > >> > > > directories x 24 hours = about 4000 files ? or somewhere in that > >> > > range... > >> > > > > >> > > > During the query planning phase, Drill partition pruning will load > >> the > >> > > full > >> > > > pathnames of the files in memory, including materializing the > >> > > partitioning > >> > > > columns such as 'day' into memory and apply the `day` >= > >> '2015-01-01` > >> > > > filter. It turns out this process is expensive when there are > lots > >> of > >> > > > files even if they are spread out over multiple directories. I > >> > believe > >> > > > there's an enhancement JIRA to make this process efficient by > >> loading > >> > > only > >> > > > directory names first and then the files...if not, I will create a > >> > JIRA. > >> > > > > >> > > > Note that partition pruning is still a huge win for more complex > >> > queries > >> > > > when the total execution time is substantially longer than the > >> planning > >> > > > time. It is only for shorter running queries against large number > >> of > >> > > files > >> > > > where the planning times becomes more dominant. There is ongoing > >> > effort > >> > > to > >> > > > improve that. > >> > > > > >> > > > Aman > >> > > > > >> > > > > >> > > > On Mon, Sep 7, 2015 at 10:15 AM, John Omernik <[email protected]> > >> > wrote: > >> > > > > >> > > > > As a follow-up to Jacques email, I did some testing with Parquet > >> > files > >> > > > > created and partitioned by Apache Hive. (Not using the metastore > >> to > >> > > read > >> > > > > these files, just using the directories and reading the Parquet > >> files > >> > > > > directly). > >> > > > > > >> > > > > Consider that Hive's partition scheme makes directories that > have > >> > > > > partitionfield=partitionvalue as the directory name like this: > >> > > > > > >> > > > > > >> > > > > table > >> > > > > ---day=2015-09-06 > >> > > > > -------hour=00 > >> > > > > -------hour=01 > >> > > > > -------hour=02 > >> > > > > ---day=2015-09-07 > >> > > > > -------hour=00 > >> > > > > -------hour=01 > >> > > > > -------hour=02 > >> > > > > > >> > > > > > >> > > > > Basically in this case, to use hives partition directory scheme > >> (with > >> > > > > parquet and without the metastore) you would have to write > >> queries > >> > > such > >> > > > > as: > >> > > > > > >> > > > > select * from table where dir0 >= 'day=2015-09-06' and dir1 = > >> > 'hour=01' > >> > > > > > >> > > > > or > >> > > > > > >> > > > > select * from table where dir0 = 'day=2015-09-06' and dir1 < > >> > 'hour=03' > >> > > > > > >> > > > > These are "doable" but are prone to user errors (what happens > they > >> > put > >> > > a > >> > > > > space between hour, =, and the hour) are non intuitive (likes > >> become > >> > > more > >> > > > > complicated) etc. > >> > > > > > >> > > > > > >> > > > > So what I did was use the regexp_replace function in Drill to > >> create > >> > a > >> > > > view > >> > > > > that instead of using dir0 or dir1 directly, I could just work > >> with > >> > the > >> > > > > "name" and the value... like this > >> > > > > > >> > > > > regexp_replace(dir0, 'day=', '') as `day`, regexp_replace(dir1, > >> > > 'hour=', > >> > > > > '') as `hour` > >> > > > > > >> > > > > That allowed me to use the Hive directories easily and > >> intuitively, > >> > > > without > >> > > > > changing the directories. > >> > > > > > >> > > > > I will say that performance wasn't great compared to natively > >> loaded > >> > > > (drill > >> > > > > loaded) parquet files. > >> > > > > > >> > > > > For example, where I did one query on the hive data that was: > >> > > > > > >> > > > > select count(1) from table where day >= '2015-01-01' using the > >> hive > >> > > > loaded > >> > > > > tables with drill and the drill view it took 26 seconds > >> > > > > > >> > > > > When I loaded the whole table into a new parquet (from the hive > >> view) > >> > > > table > >> > > > > in drill, and specified partition by `day`, `hour` the same > query > >> ran > >> > > in > >> > > > > 1.08 seconds. Not sure why this is, perhaps there is more work > >> the > >> > > drill > >> > > > > engine has to do, perhaps Hive isn't writing parquet file stats > >> well, > >> > > > > perhaps just more IO in that with the drill created table, there > >> was > >> > 30 > >> > > > > files created, in the hive table there was at least 162 unique > >> > > partitions > >> > > > > (not even counting files) given the directory structure. > Another > >> > > example > >> > > > > of performance difference: > >> > > > > > >> > > > > select `day`, `hour` from drill_parquet_table where `day` >= > >> > > > '2015-01-01`: > >> > > > > 162 rows in 1.6 seconds > >> > > > > > >> > > > > select `day`, `hour` from hive_parquet_table where `day` >= > >> > > '2015-01-01`: > >> > > > > 162 rows in 27.2 seconds > >> > > > > > >> > > > > Interesting stuff, but the regex_replace does give partition > >> pruning > >> > > > based > >> > > > > on testing. I.e. on the hive table, select count(1) from > hivetable > >> > > where > >> > > > > `day` >= '2015-01-01' runs much faster than select count(1) > from > >> > > > hivetable > >> > > > > where `day` >= '2014-01-01' indicating to me that is indeed not > >> > reading > >> > > > the > >> > > > > directories that were older than 2015-01-01 on the >= > '2015-01-01' > >> > > query. > >> > > > > > >> > > > > * Note my observations are that of a drill rookie, so if drill > >> > experts > >> > > > have > >> > > > > any thoughts on what I wrote about my observations, I'd happily > >> > > defer. I > >> > > > > would be interested in a drill expert commenting on the speed of > >> the > >> > > > Drill > >> > > > > loaded Parquet files vs Hive Loaded Parquet files, and if there > is > >> > > > > something I can do make Hive loaded parquet less doggy > >> > comparatively, > >> > > or > >> > > > > if that is just a function of more files to read. > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > On Fri, Sep 4, 2015 at 4:48 PM, Jacques Nadeau < > >> [email protected]> > >> > > > wrote: > >> > > > > > >> > > > > > You can create a view that renames the columns to whatever you > >> > like. > >> > > > For > >> > > > > > example: > >> > > > > > > >> > > > > > CREATE VIEW mydata AS SELECT dir0 as "year", dir1 as "month", > >> dir2 > >> > as > >> > > > > > "day", dir3 as "hour", a, b, ..., z FROM > >> > > `/warehouse/database/table/` > >> > > > > > > >> > > > > > Then you can query: select * from mydata where year = 2012 > >> > > > > > > >> > > > > > -- > >> > > > > > Jacques Nadeau > >> > > > > > CTO and Co-Founder, Dremio > >> > > > > > > >> > > > > > On Fri, Sep 4, 2015 at 2:35 PM, Grant Overby (groverby) < > >> > > > > > [email protected]> > >> > > > > > wrote: > >> > > > > > > >> > > > > > > I’m using parquet files in hdfs. My files are stored thusly: > >> > > > > > > > >> > > > > > > > >> > /warehouse/database/table/field0/field1/field2/field3/fileX.parquet > >> > > > > > > > >> > > > > > > I’d like to give a name to field0..3 that could be used in > >> > queries > >> > > in > >> > > > > > > stead of dir0, dir1, dir2, dir3. Is this possible? > >> > > > > > > [ > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > http://www.cisco.com/web/europe/images/email/signature/est2014/logo_06.png?ct=1398192119726 > >> > > > > > > ] > >> > > > > > > > >> > > > > > > Grant Overby > >> > > > > > > Software Engineer > >> > > > > > > Cisco.com<http://www.cisco.com/> > >> > > > > > > [email protected]<mailto:[email protected]> > >> > > > > > > Mobile: 865 724 4910 > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > [ > http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif] > >> > > Think > >> > > > > > > before you print. > >> > > > > > > > >> > > > > > > This email may contain confidential and privileged material > >> for > >> > the > >> > > > > sole > >> > > > > > > use of the intended recipient. Any review, use, distribution > >> or > >> > > > > > disclosure > >> > > > > > > by others is strictly prohibited. If you are not the > intended > >> > > > recipient > >> > > > > > (or > >> > > > > > > authorized to receive for the recipient), please contact the > >> > sender > >> > > > by > >> > > > > > > reply email and delete all copies of this message. > >> > > > > > > > >> > > > > > > Please click here< > >> > > > > > > > >> > http://www.cisco.com/web/about/doing_business/legal/cri/index.html > >> > > > > >> > > > > for > >> > > > > > > Company Registration Information. > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > > > >
