[ https://issues.apache.org/jira/browse/SPARK-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shivaram Venkataraman updated SPARK-2563: ----------------------------------------- Description: In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions. If the connection attempt times out, the socket gets closed and from [1] we get a ClosedChannelException. We should check if the Socket was closed due to a timeout and open a new socket and try to connect. FWIW, I was able to work around my problems by increasing the number of SYN retries in Linux. (I ran echo 8 > /proc/sys/net/ipv4/tcp_syn_retries) [1] http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573 was:In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions. We should make the number of retries before failing configurable to handle these cases. Summary: Re-open sockets to handle connect timeouts (was: Make number of connection retries configurable) > Re-open sockets to handle connect timeouts > ------------------------------------------ > > Key: SPARK-2563 > URL: https://issues.apache.org/jira/browse/SPARK-2563 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Reporter: Shivaram Venkataraman > Priority: Minor > > In a large EC2 cluster, I often see the first shuffle stage in a job fail due > to connection timeout exceptions. > If the connection attempt times out, the socket gets closed and from [1] we > get a ClosedChannelException. We should check if the Socket was closed due > to a timeout and open a new socket and try to connect. > FWIW, I was able to work around my problems by increasing the number of SYN > retries in Linux. (I ran echo 8 > /proc/sys/net/ipv4/tcp_syn_retries) > [1] > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573 -- This message was sent by Atlassian JIRA (v6.2#6252)