> 1711647 -1032220119 Ok, so this is the hashCode skew issue, probably the one we already know about.
https://github.com/apache/hive/commit/fcc737f729e60bba5a241cf0f607d44f7eac7ca4 String hashcode distribution is much better in master after that. Hopefully that fixes the distinct speed issue here. > Turning off map side aggregations definitely helped the query on id . The > query time went to 1 minute from the earlier 3+ hours. > > Based on the output above, both id and name have a lot of collisions, but the > name query was fast earlier too which is interesting. The String equals check has a fast-path for length == length, so equal width id columns and different width name columns might have very different performance characteristics. The collisions also build a binary tree in each hash bucket (JEP 180), which is sensitive to the order of inserts for its CPU usage (balancing trees do a lot of rebalancing if you insert pre-sorted data into them). All that code exists only if the map.aggr=true, if that is disabled then all data is shuffled to reducers using Murmur3 hash (Tez ReduceSinks are marked UNIFORM|AUTOPARALLEL, to indicate this). Cheers, Gopal