On Oct 9, 2014 10:18 AM, "Ilya Ganelin" <ilgan...@gmail.com> wrote:
Hi all – I could use some help figuring out a couple of exceptions I’ve been getting regularly. I have been running on a fairly large dataset (150 gigs). With smaller datasets I don't have any issues. My sequence of operations is as follows – unless otherwise specified, I am not caching: Map a 30 million row x 70 col string table to approx 30 mil x 5 string (For read as textFile I am using 1500 partitions) >From that, map to ((a,b), score) and reduceByKey, numPartitions = 180 Then, extract distinct values for A and distinct values for B. (I cache the output of distinct), , numPartitions = 180 Zip with index for A and for B (to remap strings to int) Join remapped ids with original table This is then fed into MLLIBs ALS algorithm. I am running with: Spark version 1.02 with CDH5.1 numExecutors = 8, numCores = 14 Memory = 12g MemoryFration = 0.7 KryoSerialization My issue is that the code runs fine for a while but then will non-deterministically crash with either file IOExceptions or the following obscure error: 14/10/08 13:29:59 INFO TaskSetManager: Loss was due to java.io.IOException: Filesystem closed [duplicate 10] 14/10/08 13:30:08 WARN TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1412717093951_0024/spark-local-20141008131827-c082/1c/shuffle_3_117_354 (No such file or directory) Looking through the logs, I see the IOException in other places but it appears to be non-catastrophic. The FileNotFoundException, however, is. I have found the following stack overflow that at least seems to address the IOException: http://stackoverflow.com/questions/24038908/spark-fails-on-big-shuffle-jobs-with-java-io-ioexception-filesystem-closed But I have not found anything useful at all with regards to the app cache error. Any help would be much appreciated. -Ilya Ganelin