In Reduce5, i see long pause during fetch occasionally
This is likely the TCP listen overflow issue, but just doesn't get
reported as a packet loss issue because the retry works okay.
https://issues.apache.org/jira/browse/MAPREDUCE-6763
That's the fix to be applied on the YARN Shuffle handler.
You can confirm the change by running
# ss -tln
I usually diagnose it by checking for TCP cookies in the dmesg or
looking at the snmp data.
# netstat -s | grep -i overflow
This issue also affects HDFS namenode, which is also usually unreported
by users.
https://issues.apache.org/jira/browse/HADOOP-16504
The delay is usually 2 * tcp max-segment-length & is usually reduced by
increasing the OS half-open connection count.
I end up doing
# sysctl -w net.core.somaxconn=16384
# sysctl -w net.ipv4.tcp_fin_timeout=2
to speed up the retries & restarting daemons.
This affects Tez a little worse than MRv2, because the same JVM runs
multiple instances of the same vertex sequentially, instead of a new JVM
for every task (which runs way slower, reducing the concurrency of
connections).
Cheers,
Gopal