Gopal, Thanx for the debugging steps. Here's the output
*hive> select count(1) as collisions, hash(ip) from table group by hash(ip) order by collisions desc limit 10;* 4 -1432955330 4 -317748560 4 -1460629578 4 1486313154 4 -320519155 4 1875999753 4 -1410139032 4 1596671554 4 503687909 4 989075923 *hive> select count(1) as collisions, hash(id) from table group by hash(id) order by collisions desc limit 10;* 1711647 -1032220119 1439738 -1316837863 851204 -330948739 838145 535385402 512621 165206418 308968 -36549075 306190 -1568034366 302932 -1386594327 284935 -1991768757 218979 922811836 *hive> select count(1) as collisions, hash(name) from table group by hash(name) order by collisions desc limit 10;* 1712041 -1906315012 1439738 -1583171535 512630 193448621 340485 2094529 308988 68745436 306240 79997099 289465 -1824055323 218263 1074334059 216464 -466945424 Turning off map side aggregations definitely helped the query on *id . *The query time went to 1 minute from the earlier 3+ hours. Based on the output above, both id and name have a lot of collisions, but the name query was fast earlier too which is interesting. On Wed, Jun 14, 2017 at 10:34 AM, Gopal Vijayaraghavan <gop...@apache.org> wrote: > > > SELECT COUNT(DISTINCT ip) FROM table - 71 seconds > > SELECT COUNT(DISTINCT id) FROM table - 12,399 seconds > > Ok, I misunderstood your gist. > > > While ip is more unique that id, ip runs many times faster than id. > > > > How can I debug this ? > > Nearly the same way - just replace "ip" with "id" in my exploratory > queries. > > count(distinct hash(id)) from the table? > > count count(1) as collisions, hash(id) from table group by hash(id) order > by collisions desc limit 10; > > And, if those show many collisions > > set tez.runtime.io.sort.mb=640; > set hive.map.aggr=false; > set tez.runtime.pipelined.shuffle=true; // this reduces failure tolerance > (i.e retries are more expensive, happy path is faster) > > select count(distinct id) from ip_table; > > Java's hashCode() implementation is pretty horrible (& Hive defaults to > using it). If you're seeing a high collision count, I think I might know > what's happening here. > > Cheers, > Gopal > > > -- Regards, Premal Shah.