It's in standalone mode.  The number of slaves is optimal for my task ...
i.e. any fewer and things will start blowing up.  I suppose I could slice
down my data set and try it with fewer nodes and see at what threshold
things go badly...


On Mon, Dec 16, 2013 at 10:44 PM, Taka Shinagawa <[email protected]>wrote:

> >>I've had smashing success with Spark 0.7.x with this code, and this
> same code on Spark 0.8.0 using a smaller data set.
> I'm curious to know data set size above which you start seeing the error
> as well as the value set to SPARK_MEM.
>
> Have you tested this in the standalone mode and with fewer nodes? Do you
> see the same error?
>
>
>
> On Mon, Dec 16, 2013 at 6:56 AM, Walrus theCat <[email protected]>wrote:
>
>> @Taka -- SPARK_MEM is set, but not SPARK_WORKER_MEM.  Would that make a
>> difference?
>>
>> @Patrick I've combed the logs, and the only thing that looks out of order
>> is this strange phenomenon (of which I have posted) of about 1/3 of the
>> slaves not actually launching.  They just post the command they should have
>> run to launch, and then apparently do nothing.  None of the other slaves
>> were throwing error messages that I remember.
>>
>>
>> On Thu, Dec 12, 2013 at 9:38 PM, Patrick Wendell <[email protected]>wrote:
>>
>>> See if there are any logs on the slaves that suggest why the tasks are
>>> failing. Right now the master log is just saying "some stuff is failing"
>>> but it's not clear why.
>>>
>>>
>>> On Thu, Dec 12, 2013 at 9:36 AM, Taka Shinagawa 
>>> <[email protected]>wrote:
>>>
>>>> How big is your data set?
>>>>
>>>> Did you set SPARK_MEM and SPARK_WORKER_MEMORY environmental variables?
>>>>
>>>>
>>>>
>>>> On Thu, Dec 12, 2013 at 9:07 AM, Walrus theCat 
>>>> <[email protected]>wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I've had smashing success with Spark 0.7.x with this code, and this
>>>>> same code on Spark 0.8.0 using a smaller data set.  However, when I try to
>>>>> use a larger data set, some strange behavior occurs.
>>>>>
>>>>> I'm trying to do L2 regularization with Logistic Regression using the
>>>>> new ML Lib.
>>>>>
>>>>> Reading through the logs, everything looks and works fine with the
>>>>> smaller data set.  The larger data set, which works just fine with Spark
>>>>> 0.7.x, evidences some bizarre behavior.  8 of my 25 slaves had STDERR logs
>>>>> that looked something like this (only the command they should have
>>>>> executed):
>>>>>
>>>>> Spark Executor Command: "java" "-cp"
>>>>> ":/root/jars/aspectjrt.jar:/root/jars/aspectjweaver.jar:/root/jars/aws-java-sdk-1.4.5.jar:/root/jars/aws-java-sdk-1.4.5-javadoc.jar:/root/jars/aws-java-sdk-1.4.5-sources.jar:/root/jars/aws-java-sdk-flow-build-tools-1.4.5.jar:/root/jars/commons-codec-1.3.jar:/root/jars/commons-logging-1.1.1.jar:/root/jars/freemarker-2.3.18.jar:/root/jars/httpclient-4.1.1.jar:/root/jars/httpcore-4.1.jar:/root/jars/jackson-core-asl-1.8.7.jar:/root/jars/mail-1.4.3.jar:/root/jars/spring-beans-3.0.7.jar:/root/jars/spring-context-3.0.7.jar:/root/jars/spring-core-3.0.7.jar:/root/jars/stax-1.2.0.jar:/root/jars/stax-api-1.0.1.jar:/root/spark/conf:/root/spark/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar"
>>>>> "-Djava.library.path=/root/ephemeral-hdfs/lib/native/"
>>>>> "-Dspark.default.parallelism=400" "-Dspark.akka.threads=8"
>>>>> "-Dspark.local.dir=/mnt/spark" "-Dspark.worker.timeout=60000"
>>>>> "-Dspark.akka.timeout=60000"
>>>>> "-Dspark.storage.blockManagerHeartBeatMs=60000"
>>>>> "-Dspark.akka.retry.wait=60000" "-Dspark.akka.frameSize=10000" "-Xms61G"
>>>>> "-Xmx61G" "-Dspark.default.parallelism=400" "-Dspark.akka.threads=8"
>>>>> "-Dspark.local.dir=/mnt/spark" "-Dspark.worker.timeout=60000"
>>>>> "-Dspark.akka.timeout=60000"
>>>>> "-Dspark.storage.blockManagerHeartBeatMs=60000"
>>>>> "-Dspark.akka.retry.wait=60000" "-Dspark.akka.frameSize=10000" "-Xms61G"
>>>>> "-Xmx61G" "-Dspark.default.parallelism=400" "-Dspark.akka.threads=8"
>>>>> "-Dspark.local.dir=/mnt/spark" "-Dspark.worker.timeout=60000"
>>>>> "-Dspark.akka.timeout=60000"
>>>>> "-Dspark.storage.blockManagerHeartBeatMs=60000"
>>>>> "-Dspark.akka.retry.wait=60000" "-Dspark.akka.frameSize=10000" "-Xms61G"
>>>>> "-Xmx61G" "-Xms62464M" "-Xmx62464M"
>>>>> "org.apache.spark.executor.StandaloneExecutorBackend"
>>>>> "akka://[email protected]:34981/user/StandaloneScheduler"
>>>>> "33" "ip-10-33-139-73.ec2.internal" "8"
>>>>> ========================================
>>>>>
>>>>>
>>>>> The log starts complaining that it's losing executors and then dies in
>>>>> a ball of fire, no reference to anything in my code whatsoever.  Stack is
>>>>> below.  Please help!
>>>>>
>>>>> Thanks
>>>>>
>>>>> 13/12/12 16:23:12 INFO scheduler.DAGScheduler: Failed to run reduce at
>>>>> GradientDescent.scala:144
>>>>> Exception in thread "main" org.apache.spark.SparkException: Job
>>>>> failed: Error: Disconnected from Spark cluster
>>>>>     at
>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:760)
>>>>>     at
>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:758)
>>>>>     at
>>>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
>>>>>     at
>>>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>>>     at
>>>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:758)
>>>>>     at
>>>>> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:379)
>>>>>     at org.apache.spark.scheduler.DAGScheduler.org
>>>>> $apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:441)
>>>>>     at
>>>>> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:149)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to