I am running a spark job on ~ 124 GB of data in a S3 bucket. The Job runs fine but occasionally returns the following exception during the first map stage which involves reading and transforming the data from S3. Is there a config parameter I can set to increase this timeout limit?
*14/08/23 04:45:46 WARN scheduler.TaskSetManager: Lost task 1379.0 in stage 1.0 (TID 1379, ip-10-237-195-11.ec2.internal): java.net.SocketTimeoutException: Read timed out* * java.net.SocketInputStream.socketRead0(Native Method)* * java.net.SocketInputStream.read(SocketInputStream.java:152)* * java.net.SocketInputStream.read(SocketInputStream.java:122)* * sun.security.ssl.InputRecord.readFully(InputRecord.java:442)* * sun.security.ssl.InputRecord.read(InputRecord.java:480)* * sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)* * sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)* * sun.security.ssl.AppInputStream.read(AppInputStream.java:102)* * java.io.BufferedInputStream.read1(BufferedInputStream.java:273)* * java.io.BufferedInputStream.read(BufferedInputStream.java:334)* * org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:170)* * java.io.FilterInputStream.read(FilterInputStream.java:133)* * org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:108)* * org.jets3t.service.io.InterruptableInputStream.read(InterruptableInputStream.java:76)* * org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.read(HttpMethodReleaseInputStream.java:136)* * org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.read(NativeS3FileSystem.java:98)* * java.io.BufferedInputStream.read1(BufferedInputStream.java:273)* * java.io.BufferedInputStream.read(BufferedInputStream.java:334)* * java.io.DataInputStream.read(DataInputStream.java:100)* * org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)* * org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133)* * org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38)* * org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:219)* * org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:188)* * org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)* * org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)* * scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)* * scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)* * scala.collection.Iterator$class.foreach(Iterator.scala:727)* * scala.collection.AbstractIterator.foreach(Iterator.scala:1157)* * org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:340)* * org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)* * org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)* * org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)* * org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)* * org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) *