Hi all,

I've had smashing success with Spark 0.7.x with this code, and this same
code on Spark 0.8.0 using a smaller data set.  However, when I try to use a
larger data set, some strange behavior occurs.

I'm trying to do L2 regularization with Logistic Regression using the new
ML Lib.

Reading through the logs, everything looks and works fine with the smaller
data set.  The larger data set, which works just fine with Spark 0.7.x,
evidences some bizarre behavior.  8 of my 25 slaves had STDERR logs that
looked something like this (only the command they should have executed):

Spark Executor Command: "java" "-cp"
":/root/jars/aspectjrt.jar:/root/jars/aspectjweaver.jar:/root/jars/aws-java-sdk-1.4.5.jar:/root/jars/aws-java-sdk-1.4.5-javadoc.jar:/root/jars/aws-java-sdk-1.4.5-sources.jar:/root/jars/aws-java-sdk-flow-build-tools-1.4.5.jar:/root/jars/commons-codec-1.3.jar:/root/jars/commons-logging-1.1.1.jar:/root/jars/freemarker-2.3.18.jar:/root/jars/httpclient-4.1.1.jar:/root/jars/httpcore-4.1.jar:/root/jars/jackson-core-asl-1.8.7.jar:/root/jars/mail-1.4.3.jar:/root/jars/spring-beans-3.0.7.jar:/root/jars/spring-context-3.0.7.jar:/root/jars/spring-core-3.0.7.jar:/root/jars/stax-1.2.0.jar:/root/jars/stax-api-1.0.1.jar:/root/spark/conf:/root/spark/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar"
"-Djava.library.path=/root/ephemeral-hdfs/lib/native/"
"-Dspark.default.parallelism=400" "-Dspark.akka.threads=8"
"-Dspark.local.dir=/mnt/spark" "-Dspark.worker.timeout=60000"
"-Dspark.akka.timeout=60000"
"-Dspark.storage.blockManagerHeartBeatMs=60000"
"-Dspark.akka.retry.wait=60000" "-Dspark.akka.frameSize=10000" "-Xms61G"
"-Xmx61G" "-Dspark.default.parallelism=400" "-Dspark.akka.threads=8"
"-Dspark.local.dir=/mnt/spark" "-Dspark.worker.timeout=60000"
"-Dspark.akka.timeout=60000"
"-Dspark.storage.blockManagerHeartBeatMs=60000"
"-Dspark.akka.retry.wait=60000" "-Dspark.akka.frameSize=10000" "-Xms61G"
"-Xmx61G" "-Dspark.default.parallelism=400" "-Dspark.akka.threads=8"
"-Dspark.local.dir=/mnt/spark" "-Dspark.worker.timeout=60000"
"-Dspark.akka.timeout=60000"
"-Dspark.storage.blockManagerHeartBeatMs=60000"
"-Dspark.akka.retry.wait=60000" "-Dspark.akka.frameSize=10000" "-Xms61G"
"-Xmx61G" "-Xms62464M" "-Xmx62464M"
"org.apache.spark.executor.StandaloneExecutorBackend"
"akka://spark@ip-10-233-26-113.ec2.internal:34981/user/StandaloneScheduler"
"33" "ip-10-33-139-73.ec2.internal" "8"
========================================


The log starts complaining that it's losing executors and then dies in a
ball of fire, no reference to anything in my code whatsoever.  Stack is
below.  Please help!

Thanks

13/12/12 16:23:12 INFO scheduler.DAGScheduler: Failed to run reduce at
GradientDescent.scala:144
Exception in thread "main" org.apache.spark.SparkException: Job failed:
Error: Disconnected from Spark cluster
    at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:760)
    at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:758)
    at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:758)
    at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:379)
    at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:441)
    at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:149)

Reply via email to