Re: Failing jobs with Spark 2.2 running on Yarn with HDFS

2017-08-31 Thread Jan-Hendrik Zab
This has indeed been caused by the network backend that dropped several outgoing packets. I'm not sure why this wasn't "caught" by TCP. We ended up with setting send_queue_size=256 recv_queue_size=512 for ib_ipoib and krcvqs=4 fpr hfi1. We also updated our OmniPath switch firmware to the current

Failing jobs with Spark 2.2 running on Yarn with HDFS

2017-08-18 Thread Jan-Hendrik Zab
Hello! I've some weird problems with Spark running on top of Yarn. (Spark 2.2 on Cloudera CDH 5.12) There are a lot of "java.net.SocketException: Network is unreachable" in the executors, part of a log file: http://support.l3s.de/~zab/spark-errors.txt and the jobs also fail at rather random