Hi all,
I'm experiencing the following issue with spark 0.8.0-incubating. I'm
running spark in local mode on a multicore machine. As long as I use up
to 16 processors, everything works fine. If I try to run the same code
with 32 processors, I get the following two exceptions
|18:18:32.268 [delete Spark local dirs] ERROR
org.apache.spark.storage.DiskStore - Exception while deleting local spark dir:
/ext/ceccarel/spark-graph/tmp/spark-local-20131129181824-2d99
java.io.IOException: Failed to list files for dir:
/ext/ceccarel/spark-graph/tmp/spark-local-20131129181824-2d99
at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:463)
~[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:473)
~[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at
org.apache.spark.storage.DiskStore$$anon$1$$anonfun$run$2.apply(DiskStore.scala:303)
[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at
org.apache.spark.storage.DiskStore$$anon$1$$anonfun$run$2.apply(DiskStore.scala:301)
[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:34)
[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at scala.collection.mutable.ArrayOps.foreach(ArrayOps.scala:38)
[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at org.apache.spark.storage.DiskStore$$anon$1.run(DiskStore.scala:301)
[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]|
and
|18:18:32.277 [spark-akka.actor.default-dispatcher-5] ERROR
o.a.spark.scheduler.local.LocalActor - key not found: 67
java.util.NoSuchElementException: key not found: 67
at scala.collection.MapLike$class.default(MapLike.scala:225)
~[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at scala.collection.mutable.HashMap.default(HashMap.scala:45)
~[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at scala.collection.MapLike$class.apply(MapLike.scala:135)
~[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at scala.collection.mutable.HashMap.apply(HashMap.scala:45)
~[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at
org.apache.spark.scheduler.local.LocalScheduler.statusUpdate(LocalScheduler.scala:261)
~[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at
org.apache.spark.scheduler.local.LocalActor$$anonfun$receive$1.apply(LocalScheduler.scala:59)
~[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at
org.apache.spark.scheduler.local.LocalActor$$anonfun$receive$1.apply(LocalScheduler.scala:54)
~[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at akka.actor.Actor$class.apply(Actor.scala:318)
~[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at
org.apache.spark.scheduler.local.LocalActor.apply(LocalScheduler.scala:52)
~[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at akka.actor.ActorCell.invoke(ActorCell.scala:626)
[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197)
[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at akka.dispatch.Mailbox.run(Mailbox.scala:179)
[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at
akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]
at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
[spark-graph-assembly-0.4.0-SNAPSHOT.jar:0.4.0-SNAPSHOT]|
I have observed that these exceptions (there are several of the second
type) occur as soon as I perform some kind of shuffle operation. That
is, as soon as a |reduceByKey| operation is performed, I get the
exception. If I try to partition the dataset at the very beginning of
the algorithm, using either |HashPartitioner| or |RangePartitioner|, the
program fails immediately. So I guess that it's something related to
shuffling.
Has anyone experienced something similar? Do you have any pointers on
how to solve this problem?
Thank you very much
Matteo