> 1711647 -1032220119
Ok, so this is the hashCode skew issue, probably the one we already know about.
https://github.com/apache/hive/commit/fcc737f729e60bba5a241cf0f607d44f7eac7ca4
String hashcode distribution is much better in master after that. Hopefully
that fixes the distinct speed issue
With ORC tables have you tried
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
SET hive.exec.parallel=true;
--
set hive.optimize.ppd=true;
HTH
Dr Mich Talebzadeh
LinkedIn *
Gopal,
Thanx for the debugging steps.
Here's the output
*hive> select count(1) as collisions, hash(ip) from table group by hash(ip)
order by collisions desc limit 10;*
4 -1432955330
4 -317748560
4 -1460629578
4 1486313154
4 -320519155
4 1875999753
4
> SELECT COUNT(DISTINCT ip) FROM table - 71 seconds
> SELECT COUNT(DISTINCT id) FROM table - 12,399 seconds
Ok, I misunderstood your gist.
> While ip is more unique that id, ip runs many times faster than id.
>
> How can I debug this ?
Nearly the same way - just replace "ip" with "id" in my
Hi Gopal,
Thanx for the reply.
I just want to clarify a few things.
1. The count distinct ip query runs fast and so it's not a problem
2. I would not expect the ip column to use DICTIONARY encoding too
3. I am more concerned about the count distinct id or count distinct
master_id column which if
Hi,
I think this is worth fixing because this seems to be triggered by the data
quality itself - so let me dig in a bit into a couple more scenarios.
> hive.optimize.distinct.rewrite is True by default
FYI, we're tackling the count(1) + count(distinct col) case in the Optimizer
now (which
Silly question…
What about using COUNT() and a GROUP BY() instead?
I’m going from memory…. this may or may not work. Since you want the row_id
only in order to de-dupe, right?
On Jun 12, 2017, at 3:59 PM, Premal Shah
> wrote:
Thanx
Thanx Gopal.
Sorry, took me a few days to respond. Here are some findings.
hive.optimize.distinct.rewrite is True by default
I do see Reducer 2 + 3.
However, this might be worth mentioning. The distinct query on an ORC table
takes a ton of time. I created a table with the TEXTFILE format from
> SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts;
…
> 0:01 [8.59M rows, 113MB] [11M rows/s, 146MB/s]
I'm hoping this is not rewriting to the approx_distinct() in Presto.
> I got similar performance with Hive + LLAP too.
This is a logical plan issue, so I don't know if LLAP helps a lot.
A
Hi,
I have an ORC table with around 9 million rows. It has an ID column. We are
running a query to make sure that are no duplicate IDs. This is the query
*SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts;*
This is the output from the presto shell
presto:test> SELECT COUNT(*), COUNT(DISTINCT
10 matches
Mail list logo