> 1711647 -1032220119

Ok, so this is the hashCode skew issue, probably the one we already know about.

https://github.com/apache/hive/commit/fcc737f729e60bba5a241cf0f607d44f7eac7ca4

String hashcode distribution is much better in master after that. Hopefully 
that fixes the distinct speed issue here.

> Turning off map side aggregations definitely helped the query on id . The 
> query time went to 1 minute from the earlier 3+ hours. 
> 
> Based on the output above, both id and name have a lot of collisions, but the 
> name query was fast earlier too which is interesting.

The String equals check has a fast-path for length == length, so equal width id 
columns and different width name columns might have very different performance 
characteristics.

The collisions also build a binary tree in each hash bucket (JEP 180), which is 
sensitive to the order of inserts for its CPU usage (balancing trees do a lot 
of rebalancing if you insert pre-sorted data into them).

All that code exists only if the map.aggr=true, if that is disabled then all 
data is shuffled to reducers using Murmur3 hash (Tez ReduceSinks are marked 
UNIFORM|AUTOPARALLEL, to indicate this).

Cheers,
Gopal


Reply via email to