subject:"Hive query on ORC table is really slow compared to Presto"

Re: Hive query on ORC table is really slow compared to Presto

2017-06-22 Thread Gopal Vijayaraghavan

> 1711647 -1032220119 Ok, so this is the hashCode skew issue, probably the one we already know about. https://github.com/apache/hive/commit/fcc737f729e60bba5a241cf0f607d44f7eac7ca4 String hashcode distribution is much better in master after that. Hopefully that fixes the distinct speed issue

Re: Hive query on ORC table is really slow compared to Presto

2017-06-21 Thread Mich Talebzadeh

With ORC tables have you tried set hive.vectorized.execution.enabled = true; set hive.vectorized.execution.reduce.enabled = true; SET hive.exec.parallel=true; -- set hive.optimize.ppd=true; HTH Dr Mich Talebzadeh LinkedIn *

Re: Hive query on ORC table is really slow compared to Presto

2017-06-21 Thread Premal Shah

Gopal, Thanx for the debugging steps. Here's the output *hive> select count(1) as collisions, hash(ip) from table group by hash(ip) order by collisions desc limit 10;* 4 -1432955330 4 -317748560 4 -1460629578 4 1486313154 4 -320519155 4 1875999753 4

Re: Hive query on ORC table is really slow compared to Presto

2017-06-14 Thread Gopal Vijayaraghavan

> SELECT COUNT(DISTINCT ip) FROM table - 71 seconds > SELECT COUNT(DISTINCT id) FROM table - 12,399 seconds Ok, I misunderstood your gist. > While ip is more unique that id, ip runs many times faster than id. > > How can I debug this ? Nearly the same way - just replace "ip" with "id" in my

Re: Hive query on ORC table is really slow compared to Presto

2017-06-14 Thread Premal Shah

Hi Gopal, Thanx for the reply. I just want to clarify a few things. 1. The count distinct ip query runs fast and so it's not a problem 2. I would not expect the ip column to use DICTIONARY encoding too 3. I am more concerned about the count distinct id or count distinct master_id column which if

Re: Hive query on ORC table is really slow compared to Presto

2017-06-12 Thread Gopal Vijayaraghavan

Hi, I think this is worth fixing because this seems to be triggered by the data quality itself - so let me dig in a bit into a couple more scenarios. > hive.optimize.distinct.rewrite is True by default FYI, we're tackling the count(1) + count(distinct col) case in the Optimizer now (which

Re: Hive query on ORC table is really slow compared to Presto

2017-06-12 Thread Michael Segel

Silly question… What about using COUNT() and a GROUP BY() instead? I’m going from memory…. this may or may not work. Since you want the row_id only in order to de-dupe, right? On Jun 12, 2017, at 3:59 PM, Premal Shah > wrote: Thanx

Re: Hive query on ORC table is really slow compared to Presto

2017-06-12 Thread Premal Shah

Thanx Gopal. Sorry, took me a few days to respond. Here are some findings. hive.optimize.distinct.rewrite is True by default I do see Reducer 2 + 3. However, this might be worth mentioning. The distinct query on an ORC table takes a ton of time. I created a table with the TEXTFILE format from

Re: Hive query on ORC table is really slow compared to Presto

2017-04-04 Thread Gopal Vijayaraghavan

> SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts; … > 0:01 [8.59M rows, 113MB] [11M rows/s, 146MB/s] I'm hoping this is not rewriting to the approx_distinct() in Presto. > I got similar performance with Hive + LLAP too. This is a logical plan issue, so I don't know if LLAP helps a lot. A

Hive query on ORC table is really slow compared to Presto

2017-04-04 Thread Premal Shah

Hi, I have an ORC table with around 9 million rows. It has an ID column. We are running a query to make sure that are no duplicate IDs. This is the query *SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts;* This is the output from the presto shell presto:test> SELECT COUNT(*), COUNT(DISTINCT

Re: Hive query on ORC table is really slow compared to Presto

Re: Hive query on ORC table is really slow compared to Presto

Re: Hive query on ORC table is really slow compared to Presto

Re: Hive query on ORC table is really slow compared to Presto

Re: Hive query on ORC table is really slow compared to Presto

Re: Hive query on ORC table is really slow compared to Presto

Re: Hive query on ORC table is really slow compared to Presto

Re: Hive query on ORC table is really slow compared to Presto

Re: Hive query on ORC table is really slow compared to Presto

Hive query on ORC table is really slow compared to Presto

10 matches

Site Navigation

Mail list logo

Footer information