Re: IOException and appcache FileNotFoundException in Spark 1.02

2014-10-14 Thread Ilya Ganelin
Hello all . Does anyone else have any suggestions? Even understanding what
this error is from would help a lot.
On Oct 11, 2014 12:56 AM, Ilya Ganelin ilgan...@gmail.com wrote:

 Hi Akhil - I tried your suggestions and tried varying my partition sizes.
 Reducing the number of partitions led to memory errors (presumably - I saw
 IOExceptions much sooner).

 With the settings you provided the program ran for longer but ultimately
 crashes in the same way. I would like to understand what is going on
 internally leading to this.

 Could this be related to garbage collection?
 On Oct 10, 2014 3:19 AM, Akhil Das ak...@sigmoidanalytics.com wrote:

 You could be hitting this issue
 https://issues.apache.org/jira/browse/SPARK-3633 (or similar). You can
 try the following workarounds:

 sc.set(spark.core.connection.ack.wait.timeout,600)
 sc.set(spark.akka.frameSize,50)
 Also reduce the number of partitions, you could be hitting the kernel's
 ulimit. I faced this issue and it was gone when i dropped the partitions
 from 1600 to 200.

 Thanks
 Best Regards

 On Fri, Oct 10, 2014 at 5:58 AM, Ilya Ganelin ilgan...@gmail.com wrote:

 Hi all – I could use some help figuring out a couple of exceptions I’ve
 been getting regularly.

 I have been running on a fairly large dataset (150 gigs). With smaller
 datasets I don't have any issues.

 My sequence of operations is as follows – unless otherwise specified, I
 am not caching:

 Map a 30 million row x 70 col string table to approx 30 mil x  5 string
 (For read as textFile I am using 1500 partitions)

 From that, map to ((a,b), score) and reduceByKey, numPartitions = 180

 Then, extract distinct values for A and distinct values for B. (I cache
 the output of distinct), , numPartitions = 180

 Zip with index for A and for B (to remap strings to int)

 Join remapped ids with original table

 This is then fed into MLLIBs ALS algorithm.

 I am running with:

 Spark version 1.02 with CDH5.1

 numExecutors = 8, numCores = 14

 Memory = 12g

 MemoryFration = 0.7

 KryoSerialization

 My issue is that the code runs fine for a while but then will
 non-deterministically crash with either file IOExceptions or the following
 obscure error:

 14/10/08 13:29:59 INFO TaskSetManager: Loss was due to
 java.io.IOException: Filesystem closed [duplicate 10]

 14/10/08 13:30:08 WARN TaskSetManager: Loss was due to
 java.io.FileNotFoundException

 java.io.FileNotFoundException:
 /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1412717093951_0024/spark-local-20141008131827-c082/1c/shuffle_3_117_354
 (No such file or directory)

 Looking through the logs, I see the IOException in other places but it
 appears to be non-catastrophic. The FileNotFoundException, however, is. I
 have found the following stack overflow that at least seems to address the
 IOException:


 http://stackoverflow.com/questions/24038908/spark-fails-on-big-shuffle-jobs-with-java-io-ioexception-filesystem-closed

 But I have not found anything useful at all with regards to the app
 cache error.

 Any help would be much appreciated.





Re: IOException and appcache FileNotFoundException in Spark 1.02

2014-10-10 Thread Akhil Das
You could be hitting this issue
https://issues.apache.org/jira/browse/SPARK-3633 (or similar). You can
try the following workarounds:

sc.set(spark.core.connection.ack.wait.timeout,600)
sc.set(spark.akka.frameSize,50)
Also reduce the number of partitions, you could be hitting the kernel's
ulimit. I faced this issue and it was gone when i dropped the partitions
from 1600 to 200.

Thanks
Best Regards

On Fri, Oct 10, 2014 at 5:58 AM, Ilya Ganelin ilgan...@gmail.com wrote:

 Hi all – I could use some help figuring out a couple of exceptions I’ve
 been getting regularly.

 I have been running on a fairly large dataset (150 gigs). With smaller
 datasets I don't have any issues.

 My sequence of operations is as follows – unless otherwise specified, I am
 not caching:

 Map a 30 million row x 70 col string table to approx 30 mil x  5 string
 (For read as textFile I am using 1500 partitions)

 From that, map to ((a,b), score) and reduceByKey, numPartitions = 180

 Then, extract distinct values for A and distinct values for B. (I cache
 the output of distinct), , numPartitions = 180

 Zip with index for A and for B (to remap strings to int)

 Join remapped ids with original table

 This is then fed into MLLIBs ALS algorithm.

 I am running with:

 Spark version 1.02 with CDH5.1

 numExecutors = 8, numCores = 14

 Memory = 12g

 MemoryFration = 0.7

 KryoSerialization

 My issue is that the code runs fine for a while but then will
 non-deterministically crash with either file IOExceptions or the following
 obscure error:

 14/10/08 13:29:59 INFO TaskSetManager: Loss was due to
 java.io.IOException: Filesystem closed [duplicate 10]

 14/10/08 13:30:08 WARN TaskSetManager: Loss was due to
 java.io.FileNotFoundException

 java.io.FileNotFoundException:
 /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1412717093951_0024/spark-local-20141008131827-c082/1c/shuffle_3_117_354
 (No such file or directory)

 Looking through the logs, I see the IOException in other places but it
 appears to be non-catastrophic. The FileNotFoundException, however, is. I
 have found the following stack overflow that at least seems to address the
 IOException:


 http://stackoverflow.com/questions/24038908/spark-fails-on-big-shuffle-jobs-with-java-io-ioexception-filesystem-closed

 But I have not found anything useful at all with regards to the app cache
 error.

 Any help would be much appreciated.



Re: IOException and appcache FileNotFoundException in Spark 1.02

2014-10-10 Thread Ilya Ganelin
Thank you - I will try this. If I drop the partition count am I not more
likely to hit memory issues? Especially if the dataset is rather large?
On Oct 10, 2014 3:19 AM, Akhil Das ak...@sigmoidanalytics.com wrote:

 You could be hitting this issue
 https://issues.apache.org/jira/browse/SPARK-3633 (or similar). You can
 try the following workarounds:

 sc.set(spark.core.connection.ack.wait.timeout,600)
 sc.set(spark.akka.frameSize,50)
 Also reduce the number of partitions, you could be hitting the kernel's
 ulimit. I faced this issue and it was gone when i dropped the partitions
 from 1600 to 200.

 Thanks
 Best Regards

 On Fri, Oct 10, 2014 at 5:58 AM, Ilya Ganelin ilgan...@gmail.com wrote:

 Hi all – I could use some help figuring out a couple of exceptions I’ve
 been getting regularly.

 I have been running on a fairly large dataset (150 gigs). With smaller
 datasets I don't have any issues.

 My sequence of operations is as follows – unless otherwise specified, I
 am not caching:

 Map a 30 million row x 70 col string table to approx 30 mil x  5 string
 (For read as textFile I am using 1500 partitions)

 From that, map to ((a,b), score) and reduceByKey, numPartitions = 180

 Then, extract distinct values for A and distinct values for B. (I cache
 the output of distinct), , numPartitions = 180

 Zip with index for A and for B (to remap strings to int)

 Join remapped ids with original table

 This is then fed into MLLIBs ALS algorithm.

 I am running with:

 Spark version 1.02 with CDH5.1

 numExecutors = 8, numCores = 14

 Memory = 12g

 MemoryFration = 0.7

 KryoSerialization

 My issue is that the code runs fine for a while but then will
 non-deterministically crash with either file IOExceptions or the following
 obscure error:

 14/10/08 13:29:59 INFO TaskSetManager: Loss was due to
 java.io.IOException: Filesystem closed [duplicate 10]

 14/10/08 13:30:08 WARN TaskSetManager: Loss was due to
 java.io.FileNotFoundException

 java.io.FileNotFoundException:
 /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1412717093951_0024/spark-local-20141008131827-c082/1c/shuffle_3_117_354
 (No such file or directory)

 Looking through the logs, I see the IOException in other places but it
 appears to be non-catastrophic. The FileNotFoundException, however, is. I
 have found the following stack overflow that at least seems to address the
 IOException:


 http://stackoverflow.com/questions/24038908/spark-fails-on-big-shuffle-jobs-with-java-io-ioexception-filesystem-closed

 But I have not found anything useful at all with regards to the app cache
 error.

 Any help would be much appreciated.





Re: IOException and appcache FileNotFoundException in Spark 1.02

2014-10-10 Thread Ilya Ganelin
Hi Akhil - I tried your suggestions and tried varying my partition sizes.
Reducing the number of partitions led to memory errors (presumably - I saw
IOExceptions much sooner).

With the settings you provided the program ran for longer but ultimately
crashes in the same way. I would like to understand what is going on
internally leading to this.

Could this be related to garbage collection?
On Oct 10, 2014 3:19 AM, Akhil Das ak...@sigmoidanalytics.com wrote:

 You could be hitting this issue
 https://issues.apache.org/jira/browse/SPARK-3633 (or similar). You can
 try the following workarounds:

 sc.set(spark.core.connection.ack.wait.timeout,600)
 sc.set(spark.akka.frameSize,50)
 Also reduce the number of partitions, you could be hitting the kernel's
 ulimit. I faced this issue and it was gone when i dropped the partitions
 from 1600 to 200.

 Thanks
 Best Regards

 On Fri, Oct 10, 2014 at 5:58 AM, Ilya Ganelin ilgan...@gmail.com wrote:

 Hi all – I could use some help figuring out a couple of exceptions I’ve
 been getting regularly.

 I have been running on a fairly large dataset (150 gigs). With smaller
 datasets I don't have any issues.

 My sequence of operations is as follows – unless otherwise specified, I
 am not caching:

 Map a 30 million row x 70 col string table to approx 30 mil x  5 string
 (For read as textFile I am using 1500 partitions)

 From that, map to ((a,b), score) and reduceByKey, numPartitions = 180

 Then, extract distinct values for A and distinct values for B. (I cache
 the output of distinct), , numPartitions = 180

 Zip with index for A and for B (to remap strings to int)

 Join remapped ids with original table

 This is then fed into MLLIBs ALS algorithm.

 I am running with:

 Spark version 1.02 with CDH5.1

 numExecutors = 8, numCores = 14

 Memory = 12g

 MemoryFration = 0.7

 KryoSerialization

 My issue is that the code runs fine for a while but then will
 non-deterministically crash with either file IOExceptions or the following
 obscure error:

 14/10/08 13:29:59 INFO TaskSetManager: Loss was due to
 java.io.IOException: Filesystem closed [duplicate 10]

 14/10/08 13:30:08 WARN TaskSetManager: Loss was due to
 java.io.FileNotFoundException

 java.io.FileNotFoundException:
 /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1412717093951_0024/spark-local-20141008131827-c082/1c/shuffle_3_117_354
 (No such file or directory)

 Looking through the logs, I see the IOException in other places but it
 appears to be non-catastrophic. The FileNotFoundException, however, is. I
 have found the following stack overflow that at least seems to address the
 IOException:


 http://stackoverflow.com/questions/24038908/spark-fails-on-big-shuffle-jobs-with-java-io-ioexception-filesystem-closed

 But I have not found anything useful at all with regards to the app cache
 error.

 Any help would be much appreciated.