Most likely the data is not "just too big". For most operations the data is
processed partition by partition. The partitions may be too big. This is
what your last question hints at too:

> val numWorkers = 10
> val data = sc.textFile("somedirectory/data.csv", numWorkers)

This will work, but not quite what you want to do. The second parameter to
textFile is the number of partitions you want. Given the error you are
seeing, I'd recommend asking for more partitions -- they will be smaller.

Also make sure you set spark.executor.memory to the capacity of the worker
machines.


On Tue, Apr 22, 2014 at 11:09 PM, jaeholee <jho...@lbl.gov> wrote:

> Spark is running fine, but I get this message. Does this mean that my data
> is
> just too big?
>
> 14/04/22 17:06:20 ERROR TaskSchedulerImpl: Lost executor 2 on WORKER#2:
> OutOfMemoryError
> 14/04/22 17:06:20 ERROR TaskSetManager: Task 550.0:2 failed 4 times;
> aborting job
> org.apache.spark.SparkException: Job aborted: Task 550.0:2 failed 4 times
> (most recent failure: unknown)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
>         at
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>         at scala.Option.foreach(Option.scala:236)
>         at
>
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>         at
>
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
>
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
>
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-TaskSchedulerImpl-Lost-an-executor-tp4566p4618.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to