Here is the application log for this spark job.
http://pastebin.com/2UJS9L4e

Thanks,
Weifeng


From: "Aulakh, Sahib" <aula...@a9.com>
Date: Friday, May 20, 2016 at 12:43 PM
To: Ted Yu <yuzhih...@gmail.com>
Cc: Rodrick Brown <rodr...@orchardplatform.com>, Cui Weifeng <weife...@a9.com>, 
user <user@spark.apache.org>, "Zhao, Jun" <junz...@a9.com>
Subject: Re: Can not set spark dynamic resource allocation

Yes it is yarn. We have configured spark shuffle service w yarn node manager 
but something must be off.

We will send u app log on paste bin.

Sent from my iPhone

On May 20, 2016, at 12:35 PM, Ted Yu 
<yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote:
Since yarn-site.xml was cited, I assume the cluster runs YARN.

On Fri, May 20, 2016 at 12:30 PM, Rodrick Brown 
<rodr...@orchardplatform.com<mailto:rodr...@orchardplatform.com>> wrote:
Is this Yarn or Mesos? For the later you need to start an external shuffle 
service.
Get Outlook for iOS<https://aka.ms/o0ukef>



On Fri, May 20, 2016 at 11:48 AM -0700, "Cui, Weifeng" 
<weife...@a9.com<mailto:weife...@a9.com>> wrote:

Hi guys,



Our team has a hadoop 2.6.0 cluster with Spark 1.6.1. We want to set dynamic 
resource allocation for spark and we followed the following link. After the 
changes, all spark jobs failed.
https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

This test was on a test cluster which has 1 master machine (running namenode, 
resourcemanager and hive server), 1 worker machine (running datanode and 
nodemanager) and 1 machine as client( running spark shell).



What I updated in config :



1. Update in spark-defaults.conf

        spark.dynamicAllocation.enabled     true
        spark.shuffle.service.enabled            true



2. Update yarn-site.xml

        <property>
             <name>yarn.nodemanager.aux-services</name>
              <value>mapreduce_shuffle,spark_shuffle</value>
        </property>

        <property>
            <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
            <value>org.apache.spark.network.yarn.YarnShuffleService</value>
        </property>

        <property>
            <name>spark.shuffle.service.enabled</name>
             <value>true</value>
        </property>

3. Copy  spark-1.6.1-yarn-shuffle.jar to yarn.application.classpath 
($HADOOP_HOME/share/hadoop/yarn/*) in python code

4. Restart namenode, datanode, resourcemanager, nodemanger... retart everything

5. The config will update in all machines, resourcemanager and nodemanager. We 
update the config in one place and copy to all machines.



What I tested:



1. I started a scala spark shell and check its environment variables, 
spark.dynamicAllocation.enabled is true.

2. I used the following code:

        scala > val line = 
sc.textFile("/spark-events/application_1463681113470_0006")

                    line: org.apache.spark.rdd.RDD[String] = 
/spark-events/application_1463681113470_0006 MapPartitionsRDD[1] at textFile at 
<console>:27

        scala > line.count # This command just stuck here



3. In the beginning, there is only 1 executor(this is for driver) and after 
line.count, I could see 3 executors, then dropped to 1.

4. Several jobs were launched and all of them failed.   Tasks (for all stages): 
Succeeded/Total : 0/2 (4 failed)



Error messages:



I found the following messages in spark web UI. I found this in spark.log on 
nodemanager machine as well.


ExecutorLostFailure (executor 1 exited caused by one of the running tasks) 
Reason: Container marked as failed: container_1463692924309_0002_01_000002 on 
host: xxxxxxxxxxxxxxx.com<http://xxxxxxxxxxxxxxx.com>. Exit status: 1. 
Diagnostics: Exception from container-launch.
Container id: container_1463692924309_0002_01_000002
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1



Thanks a lot for help. We can provide more information if needed.



Thanks,
Weifeng











NOTICE TO RECIPIENTS: This communication is confidential and intended for the 
use of the addressee only. If you are not an intended recipient of this 
communication, please delete it immediately and notify the sender by return 
email. Unauthorized reading, dissemination, distribution or copying of this 
communication is prohibited. This communication does not constitute an offer to 
sell or a solicitation of an indication of interest to purchase any loan, 
security or any other financial product or instrument, nor is it an offer to 
sell or a solicitation of an indication of interest to purchase any products or 
services to any persons who are prohibited from receiving such information 
under applicable law. The contents of this communication may not be accurate or 
complete and are subject to change without notice. As such, Orchard App, Inc. 
(including its subsidiaries and affiliates, "Orchard") makes no representation 
regarding the accuracy or completeness of the information contained herein. The 
intended recipient is advised to consult its own professional advisors, 
including those specializing in legal, tax and accounting matters. Orchard does 
not provide legal, tax or accounting advice.

Reply via email to