Hi all, I've had smashing success with Spark 0.7.x with this code, and this same code on Spark 0.8.0 using a smaller data set. However, when I try to use a larger data set, some strange behavior occurs.
I'm trying to do L2 regularization with Logistic Regression using the new ML Lib. Reading through the logs, everything looks and works fine with the smaller data set. The larger data set, which works just fine with Spark 0.7.x, evidences some bizarre behavior. 8 of my 25 slaves had STDERR logs that looked something like this (only the command they should have executed): Spark Executor Command: "java" "-cp" ":/root/jars/aspectjrt.jar:/root/jars/aspectjweaver.jar:/root/jars/aws-java-sdk-1.4.5.jar:/root/jars/aws-java-sdk-1.4.5-javadoc.jar:/root/jars/aws-java-sdk-1.4.5-sources.jar:/root/jars/aws-java-sdk-flow-build-tools-1.4.5.jar:/root/jars/commons-codec-1.3.jar:/root/jars/commons-logging-1.1.1.jar:/root/jars/freemarker-2.3.18.jar:/root/jars/httpclient-4.1.1.jar:/root/jars/httpcore-4.1.jar:/root/jars/jackson-core-asl-1.8.7.jar:/root/jars/mail-1.4.3.jar:/root/jars/spring-beans-3.0.7.jar:/root/jars/spring-context-3.0.7.jar:/root/jars/spring-core-3.0.7.jar:/root/jars/stax-1.2.0.jar:/root/jars/stax-api-1.0.1.jar:/root/spark/conf:/root/spark/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar" "-Djava.library.path=/root/ephemeral-hdfs/lib/native/" "-Dspark.default.parallelism=400" "-Dspark.akka.threads=8" "-Dspark.local.dir=/mnt/spark" "-Dspark.worker.timeout=60000" "-Dspark.akka.timeout=60000" "-Dspark.storage.blockManagerHeartBeatMs=60000" "-Dspark.akka.retry.wait=60000" "-Dspark.akka.frameSize=10000" "-Xms61G" "-Xmx61G" "-Dspark.default.parallelism=400" "-Dspark.akka.threads=8" "-Dspark.local.dir=/mnt/spark" "-Dspark.worker.timeout=60000" "-Dspark.akka.timeout=60000" "-Dspark.storage.blockManagerHeartBeatMs=60000" "-Dspark.akka.retry.wait=60000" "-Dspark.akka.frameSize=10000" "-Xms61G" "-Xmx61G" "-Dspark.default.parallelism=400" "-Dspark.akka.threads=8" "-Dspark.local.dir=/mnt/spark" "-Dspark.worker.timeout=60000" "-Dspark.akka.timeout=60000" "-Dspark.storage.blockManagerHeartBeatMs=60000" "-Dspark.akka.retry.wait=60000" "-Dspark.akka.frameSize=10000" "-Xms61G" "-Xmx61G" "-Xms62464M" "-Xmx62464M" "org.apache.spark.executor.StandaloneExecutorBackend" "akka://spark@ip-10-233-26-113.ec2.internal:34981/user/StandaloneScheduler" "33" "ip-10-33-139-73.ec2.internal" "8" ======================================== The log starts complaining that it's losing executors and then dies in a ball of fire, no reference to anything in my code whatsoever. Stack is below. Please help! Thanks 13/12/12 16:23:12 INFO scheduler.DAGScheduler: Failed to run reduce at GradientDescent.scala:144 Exception in thread "main" org.apache.spark.SparkException: Job failed: Error: Disconnected from Spark cluster at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:760) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:758) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:758) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:379) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:441) at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:149)