This has indeed been caused by the network backend that dropped several
outgoing packets. I'm not sure why this wasn't "caught" by TCP.
We ended up with setting send_queue_size=256 recv_queue_size=512 for
ib_ipoib and krcvqs=4 fpr hfi1. We also updated our OmniPath switch
firmware to the current
Hello!
I've some weird problems with Spark running on top of Yarn. (Spark 2.2
on Cloudera CDH 5.12)
There are a lot of "java.net.SocketException: Network is unreachable" in
the executors, part of a log file:
http://support.l3s.de/~zab/spark-errors.txt and the jobs also fail at
rather random