Hi All, Can someone help on this?
I'm encountering exactly the same issue in a very similar scenario with the same spark version. Thanks Alessandro On Fri, Jul 18, 2014 at 8:30 PM, Shannon Quinn <squ...@gatech.edu> wrote: > Hi all, > > I'm dealing with some strange error messages that I *think* comes down to > a memory issue, but I'm having a hard time pinning it down and could use > some guidance from the experts. > > I have a 2-machine Spark (1.0.1) cluster. Both machines have 8 cores; one > has 16GB memory, the other 32GB (which is the master). My application > involves computing pairwise pixel affinities in images, though the images > I've tested so far only get as big as 1920x1200, and as small as 16x16. > > I did have to change a few memory and parallelism settings, otherwise I > was getting explicit OutOfMemoryExceptions. In spark-default.conf: > > spark.executor.memory 14g > spark.default.parallelism 32 > spark.akka.frameSize 1000 > > In spark-env.sh: > > SPARK_DRIVER_MEMORY=10G > > With those settings, however, I get a bunch of WARN statements about "Lost > TIDs" (no task is successfully completed) in addition to lost Executors, > which are repeated 4 times until I finally get the following error message > and crash: > > --- > > 14/07/18 12:06:20 INFO TaskSchedulerImpl: Cancelling stage 0 > 14/07/18 12:06:20 INFO DAGScheduler: Failed to run collect at > /home/user/Programming/PySpark-Affinities/affinity.py:243 > Traceback (most recent call last): > File "/home/user/Programming/PySpark-Affinities/affinity.py", line 243, > in <module> > lambda x: np.abs(IMAGE.value[x[0]] - IMAGE.value[x[1]]) > File > "/net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/pyspark/rdd.py", > line 583, in collect > bytesInJava = self._jrdd.collect().iterator() > File > "/net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", > line 537, in __call__ > File > "/net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o27.collect. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task > 0.0:13 failed 4 times, most recent failure: *TID 32 on host > master.host.univ.edu <http://master.host.univ.edu> failed for unknown > reason* > Driver stacktrace: > at org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > 14/07/18 12:06:20 INFO DAGScheduler: Executor lost: 4 (epoch 4) > 14/07/18 12:06:20 INFO BlockManagerMasterActor: Trying to remove executor > 4 from BlockManagerMaster. > 14/07/18 12:06:20 INFO BlockManagerMaster: Removed 4 successfully in > removeExecutor > user@master:~/Programming/PySpark-Affinities$ > > --- > > If I run the really small image instead (16x16), it *appears* to run to > completion (gives me the output I expect without any exceptions being > thrown). However, in the stderr logs for the app that was run, it lists the > state as "KILLED" with the final message a "ERROR > CoarseGrainedExecutorBackend: Driver Disassociated". If I run any larger > images, I get the exception I pasted above. > > Furthermore, if I just do a spark-submit with master=local[*], aside from > still needing to set the aforementioned memory options, it will work for an > image of *any* size (I've tested both machines independently; they both > do this when running as local[*]), whereas working on a cluster will result > in the aforementioned crash at stage 0 with anything but the smallest > images. > > Any ideas what is going on? > > Thank you very much in advance! > > Regards, > Shannon >