See if there are any logs on the slaves that suggest why the tasks are
failing. Right now the master log is just saying "some stuff is failing"
but it's not clear why.


On Thu, Dec 12, 2013 at 9:36 AM, Taka Shinagawa <[email protected]>wrote:

> How big is your data set?
>
> Did you set SPARK_MEM and SPARK_WORKER_MEMORY environmental variables?
>
>
>
> On Thu, Dec 12, 2013 at 9:07 AM, Walrus theCat <[email protected]>wrote:
>
>> Hi all,
>>
>> I've had smashing success with Spark 0.7.x with this code, and this same
>> code on Spark 0.8.0 using a smaller data set.  However, when I try to use a
>> larger data set, some strange behavior occurs.
>>
>> I'm trying to do L2 regularization with Logistic Regression using the new
>> ML Lib.
>>
>> Reading through the logs, everything looks and works fine with the
>> smaller data set.  The larger data set, which works just fine with Spark
>> 0.7.x, evidences some bizarre behavior.  8 of my 25 slaves had STDERR logs
>> that looked something like this (only the command they should have
>> executed):
>>
>> Spark Executor Command: "java" "-cp"
>> ":/root/jars/aspectjrt.jar:/root/jars/aspectjweaver.jar:/root/jars/aws-java-sdk-1.4.5.jar:/root/jars/aws-java-sdk-1.4.5-javadoc.jar:/root/jars/aws-java-sdk-1.4.5-sources.jar:/root/jars/aws-java-sdk-flow-build-tools-1.4.5.jar:/root/jars/commons-codec-1.3.jar:/root/jars/commons-logging-1.1.1.jar:/root/jars/freemarker-2.3.18.jar:/root/jars/httpclient-4.1.1.jar:/root/jars/httpcore-4.1.jar:/root/jars/jackson-core-asl-1.8.7.jar:/root/jars/mail-1.4.3.jar:/root/jars/spring-beans-3.0.7.jar:/root/jars/spring-context-3.0.7.jar:/root/jars/spring-core-3.0.7.jar:/root/jars/stax-1.2.0.jar:/root/jars/stax-api-1.0.1.jar:/root/spark/conf:/root/spark/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar"
>> "-Djava.library.path=/root/ephemeral-hdfs/lib/native/"
>> "-Dspark.default.parallelism=400" "-Dspark.akka.threads=8"
>> "-Dspark.local.dir=/mnt/spark" "-Dspark.worker.timeout=60000"
>> "-Dspark.akka.timeout=60000"
>> "-Dspark.storage.blockManagerHeartBeatMs=60000"
>> "-Dspark.akka.retry.wait=60000" "-Dspark.akka.frameSize=10000" "-Xms61G"
>> "-Xmx61G" "-Dspark.default.parallelism=400" "-Dspark.akka.threads=8"
>> "-Dspark.local.dir=/mnt/spark" "-Dspark.worker.timeout=60000"
>> "-Dspark.akka.timeout=60000"
>> "-Dspark.storage.blockManagerHeartBeatMs=60000"
>> "-Dspark.akka.retry.wait=60000" "-Dspark.akka.frameSize=10000" "-Xms61G"
>> "-Xmx61G" "-Dspark.default.parallelism=400" "-Dspark.akka.threads=8"
>> "-Dspark.local.dir=/mnt/spark" "-Dspark.worker.timeout=60000"
>> "-Dspark.akka.timeout=60000"
>> "-Dspark.storage.blockManagerHeartBeatMs=60000"
>> "-Dspark.akka.retry.wait=60000" "-Dspark.akka.frameSize=10000" "-Xms61G"
>> "-Xmx61G" "-Xms62464M" "-Xmx62464M"
>> "org.apache.spark.executor.StandaloneExecutorBackend"
>> "akka://[email protected]:34981/user/StandaloneScheduler"
>> "33" "ip-10-33-139-73.ec2.internal" "8"
>> ========================================
>>
>>
>> The log starts complaining that it's losing executors and then dies in a
>> ball of fire, no reference to anything in my code whatsoever.  Stack is
>> below.  Please help!
>>
>> Thanks
>>
>> 13/12/12 16:23:12 INFO scheduler.DAGScheduler: Failed to run reduce at
>> GradientDescent.scala:144
>> Exception in thread "main" org.apache.spark.SparkException: Job failed:
>> Error: Disconnected from Spark cluster
>>     at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:760)
>>     at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:758)
>>     at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
>>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>     at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:758)
>>     at
>> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:379)
>>     at org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:441)
>>     at
>> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:149)
>>
>>
>>
>>
>

Reply via email to