Re: spark 0.8.0 fails on larger data set (Failed to run reduce at GradientDescent.scala:144)

Walrus theCat Mon, 16 Dec 2013 06:57:34 -0800

@Taka -- SPARK_MEM is set, but not SPARK_WORKER_MEM.  Would that make a
difference?


@Patrick I've combed the logs, and the only thing that looks out of order
is this strange phenomenon (of which I have posted) of about 1/3 of the
slaves not actually launching.  They just post the command they should have
run to launch, and then apparently do nothing.  None of the other slaves
were throwing error messages that I remember.


On Thu, Dec 12, 2013 at 9:38 PM, Patrick Wendell <[email protected]> wrote:

> See if there are any logs on the slaves that suggest why the tasks are
> failing. Right now the master log is just saying "some stuff is failing"
> but it's not clear why.
>
>
> On Thu, Dec 12, 2013 at 9:36 AM, Taka Shinagawa <[email protected]>wrote:
>
>> How big is your data set?
>>
>> Did you set SPARK_MEM and SPARK_WORKER_MEMORY environmental variables?
>>
>>
>>
>> On Thu, Dec 12, 2013 at 9:07 AM, Walrus theCat <[email protected]>wrote:
>>
>>> Hi all,
>>>
>>> I've had smashing success with Spark 0.7.x with this code, and this same
>>> code on Spark 0.8.0 using a smaller data set.  However, when I try to use a
>>> larger data set, some strange behavior occurs.
>>>
>>> I'm trying to do L2 regularization with Logistic Regression using the
>>> new ML Lib.
>>>
>>> Reading through the logs, everything looks and works fine with the
>>> smaller data set.  The larger data set, which works just fine with Spark
>>> 0.7.x, evidences some bizarre behavior.  8 of my 25 slaves had STDERR logs
>>> that looked something like this (only the command they should have
>>> executed):
>>>
>>> Spark Executor Command: "java" "-cp"
>>> ":/root/jars/aspectjrt.jar:/root/jars/aspectjweaver.jar:/root/jars/aws-java-sdk-1.4.5.jar:/root/jars/aws-java-sdk-1.4.5-javadoc.jar:/root/jars/aws-java-sdk-1.4.5-sources.jar:/root/jars/aws-java-sdk-flow-build-tools-1.4.5.jar:/root/jars/commons-codec-1.3.jar:/root/jars/commons-logging-1.1.1.jar:/root/jars/freemarker-2.3.18.jar:/root/jars/httpclient-4.1.1.jar:/root/jars/httpcore-4.1.jar:/root/jars/jackson-core-asl-1.8.7.jar:/root/jars/mail-1.4.3.jar:/root/jars/spring-beans-3.0.7.jar:/root/jars/spring-context-3.0.7.jar:/root/jars/spring-core-3.0.7.jar:/root/jars/stax-1.2.0.jar:/root/jars/stax-api-1.0.1.jar:/root/spark/conf:/root/spark/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar"
>>> "-Djava.library.path=/root/ephemeral-hdfs/lib/native/"
>>> "-Dspark.default.parallelism=400" "-Dspark.akka.threads=8"
>>> "-Dspark.local.dir=/mnt/spark" "-Dspark.worker.timeout=60000"
>>> "-Dspark.akka.timeout=60000"
>>> "-Dspark.storage.blockManagerHeartBeatMs=60000"
>>> "-Dspark.akka.retry.wait=60000" "-Dspark.akka.frameSize=10000" "-Xms61G"
>>> "-Xmx61G" "-Dspark.default.parallelism=400" "-Dspark.akka.threads=8"
>>> "-Dspark.local.dir=/mnt/spark" "-Dspark.worker.timeout=60000"
>>> "-Dspark.akka.timeout=60000"
>>> "-Dspark.storage.blockManagerHeartBeatMs=60000"
>>> "-Dspark.akka.retry.wait=60000" "-Dspark.akka.frameSize=10000" "-Xms61G"
>>> "-Xmx61G" "-Dspark.default.parallelism=400" "-Dspark.akka.threads=8"
>>> "-Dspark.local.dir=/mnt/spark" "-Dspark.worker.timeout=60000"
>>> "-Dspark.akka.timeout=60000"
>>> "-Dspark.storage.blockManagerHeartBeatMs=60000"
>>> "-Dspark.akka.retry.wait=60000" "-Dspark.akka.frameSize=10000" "-Xms61G"
>>> "-Xmx61G" "-Xms62464M" "-Xmx62464M"
>>> "org.apache.spark.executor.StandaloneExecutorBackend"
>>> "akka://[email protected]:34981/user/StandaloneScheduler"
>>> "33" "ip-10-33-139-73.ec2.internal" "8"
>>> ========================================
>>>
>>>
>>> The log starts complaining that it's losing executors and then dies in a
>>> ball of fire, no reference to anything in my code whatsoever.  Stack is
>>> below.  Please help!
>>>
>>> Thanks
>>>
>>> 13/12/12 16:23:12 INFO scheduler.DAGScheduler: Failed to run reduce at
>>> GradientDescent.scala:144
>>> Exception in thread "main" org.apache.spark.SparkException: Job failed:
>>> Error: Disconnected from Spark cluster
>>>     at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:760)
>>>     at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:758)
>>>     at
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
>>>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>     at
>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:758)
>>>     at
>>> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:379)
>>>     at org.apache.spark.scheduler.DAGScheduler.org
>>> $apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:441)
>>>     at
>>> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:149)
>>>
>>>
>>>
>>>
>>
>

Re: spark 0.8.0 fails on larger data set (Failed to run reduce at GradientDescent.scala:144)

Reply via email to