Re: Hive query on ORC table is really slow compared to Presto

2017-06-22 Thread Gopal Vijayaraghavan
> 1711647 -1032220119 Ok, so this is the hashCode skew issue, probably the one we already know about. https://github.com/apache/hive/commit/fcc737f729e60bba5a241cf0f607d44f7eac7ca4 String hashcode distribution is much better in master after that. Hopefully that fixes the distinct speed issue

Re: Hive query on ORC table is really slow compared to Presto

2017-06-21 Thread Mich Talebzadeh
With ORC tables have you tried set hive.vectorized.execution.enabled = true; set hive.vectorized.execution.reduce.enabled = true; SET hive.exec.parallel=true; -- set hive.optimize.ppd=true; HTH Dr Mich Talebzadeh LinkedIn *

Re: Hive query on ORC table is really slow compared to Presto

2017-06-21 Thread Premal Shah
Gopal, Thanx for the debugging steps. Here's the output *hive> select count(1) as collisions, hash(ip) from table group by hash(ip) order by collisions desc limit 10;* 4 -1432955330 4 -317748560 4 -1460629578 4 1486313154 4 -320519155 4 1875999753 4

Re: Hive query on ORC table is really slow compared to Presto

2017-06-14 Thread Gopal Vijayaraghavan
> SELECT COUNT(DISTINCT ip) FROM table - 71 seconds > SELECT COUNT(DISTINCT id) FROM table - 12,399 seconds Ok, I misunderstood your gist. > While ip is more unique that id, ip runs many times faster than id. > > How can I debug this ? Nearly the same way - just replace "ip" with "id" in my

Re: Hive query on ORC table is really slow compared to Presto

2017-06-14 Thread Premal Shah
Hi Gopal, Thanx for the reply. I just want to clarify a few things. 1. The count distinct ip query runs fast and so it's not a problem 2. I would not expect the ip column to use DICTIONARY encoding too 3. I am more concerned about the count distinct id or count distinct master_id column which if

Re: Hive query on ORC table is really slow compared to Presto

2017-06-12 Thread Gopal Vijayaraghavan
Hi, I think this is worth fixing because this seems to be triggered by the data quality itself - so let me dig in a bit into a couple more scenarios. > hive.optimize.distinct.rewrite is True by default FYI, we're tackling the count(1) + count(distinct col) case in the Optimizer now (which

Re: Hive query on ORC table is really slow compared to Presto

2017-06-12 Thread Michael Segel
Silly question… What about using COUNT() and a GROUP BY() instead? I’m going from memory…. this may or may not work. Since you want the row_id only in order to de-dupe, right? On Jun 12, 2017, at 3:59 PM, Premal Shah > wrote: Thanx

Re: Hive query on ORC table is really slow compared to Presto

2017-06-12 Thread Premal Shah
Thanx Gopal. Sorry, took me a few days to respond. Here are some findings. hive.optimize.distinct.rewrite is True by default I do see Reducer 2 + 3. However, this might be worth mentioning. The distinct query on an ORC table takes a ton of time. I created a table with the TEXTFILE format from

Re: Hive query on ORC table is really slow compared to Presto

2017-04-04 Thread Gopal Vijayaraghavan
> SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts; … > 0:01 [8.59M rows, 113MB] [11M rows/s, 146MB/s] I'm hoping this is not rewriting to the approx_distinct() in Presto. > I got similar performance with Hive + LLAP too. This is a logical plan issue, so I don't know if LLAP helps a lot. A

Hive query on ORC table is really slow compared to Presto

2017-04-04 Thread Premal Shah
Hi, I have an ORC table with around 9 million rows. It has an ID column. We are running a query to make sure that are no duplicate IDs. This is the query *SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts;* This is the output from the presto shell presto:test> SELECT COUNT(*), COUNT(DISTINCT