Hi, Folks. I was wondering if anyone has encountered the following error before; I've been staring at this all day and can't figure out what it means.
In my client log, I get: [INFO] 17 Dec 2013 22:31:09 - org.apache.spark.Logging$class - Lost TID 282 (task 3.0:63) [INFO] 17 Dec 2013 22:31:09 - org.apache.spark.Logging$class - Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: /tmp/spark-local-20131217192829-da6b/2e/shuffle_1_63_88 (Too many open files) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.<init>(FileOutputStream.java:221) at org.apache.spark.storage.DiskStore$DiskBlockObjectWriter.open(DiskStore.scala:58) at org.apache.spark.storage.DiskStore$DiskBlockObjectWriter.write(DiskStore.scala:107) at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$run$1.apply(ShuffleMapTask.scala:152) at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$run$1.apply(ShuffleMapTask.scala:149) at scala.collection.Iterator$class.foreach(Iterator.scala:772) at scala.collection.JavaConversions$JMapWrapperLike$$anon$2.foreach(JavaConversions.scala:781) at org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:149) at org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) In the worker log, I see 13/12/17 19:31:07 INFO executor.Executor: Running task ID 282 13/12/17 19:31:07 INFO rdd.HadoopRDD: Input split: hdfs://dataset/part-00027:0+2311666 13/12/17 19:31:07 INFO spark.CacheManager: Computing partition org.apache.spark.rdd.HadoopPartition@6b4 13/12/17 19:31:07 INFO spark.CacheManager: Computing partition org.apache.spark.rdd.HadoopPartition@6af 13/12/17 19:31:07 INFO rdd.HadoopRDD: Input split: hdfs://dataset/part-00035:0+2311720 13/12/17 19:31:07 INFO rdd.HadoopRDD: Input split: hdfs://dataset/part-00030:0+2311094 Any clue what this means? I've checked open files on the worker node while this task is going (by running lsof | wc -l every 5 seconds) and I don't even see a blip - it looks nice and steady, with no problems. -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: [email protected]
