Correct! It is TCP listen overflow issue! Thanks for your help, Gopal V and Kuhu!
Using below command, i can see many overflows. # netstat -s | grep -i overflow [root@node-ana-coreLKpD0001 ~]# netstat -s | grep -i overflow 96282 times the listen queue of a socket overflowed TCPTimeWaitOverflow: 2499680 net.core.somaxconn default is 128 on my cluster, which is too small. I ran about 10 queries concurrently with the following settings set hive.tez.auto.reducer.parallelism=true which made the situation worse. Follow Gopal V's instructions below ,fetch phase long pause disappeared. ------------------ Original ------------------ From: "Gopal V";<gop...@apache.org>; Send time: Monday, Jun 1, 2020 1:49 PM To: "user"<user@tez.apache.org>; Subject: Re: tez shuffle fetch phase has long pause > In Reduce5, i see long pause during fetch occasionally This is likely the TCP listen overflow issue, but just doesn't get reported as a packet loss issue because the retry works okay. https://issues.apache.org/jira/browse/MAPREDUCE-6763 That's the fix to be applied on the YARN Shuffle handler. You can confirm the change by running # ss -tln I usually diagnose it by checking for TCP cookies in the dmesg or looking at the snmp data. # netstat -s | grep -i overflow This issue also affects HDFS namenode, which is also usually unreported by users. https://issues.apache.org/jira/browse/HADOOP-16504 The delay is usually 2 * tcp max-segment-length & is usually reduced by increasing the OS half-open connection count. I end up doing # sysctl -w net.core.somaxconn=16384 # sysctl -w net.ipv4.tcp_fin_timeout=2 to speed up the retries & restarting daemons. This affects Tez a little worse than MRv2, because the same JVM runs multiple instances of the same vertex sequentially, instead of a new JVM for every task (which runs way slower, reducing the concurrency of connections). Cheers, Gopal