Since yarn-site.xml was cited, I assume the cluster runs YARN. On Fri, May 20, 2016 at 12:30 PM, Rodrick Brown <rodr...@orchardplatform.com > wrote:
> Is this Yarn or Mesos? For the later you need to start an external shuffle > service. > > Get Outlook for iOS <https://aka.ms/o0ukef> > > > > > On Fri, May 20, 2016 at 11:48 AM -0700, "Cui, Weifeng" <weife...@a9.com> > wrote: > > Hi guys, >> >> >> >> Our team has a hadoop 2.6.0 cluster with Spark 1.6.1. We want to set >> dynamic resource allocation for spark and we followed the following link. >> After the changes, all spark jobs failed. >> >> >> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation >> >> This test was on a test cluster which has 1 master machine (running >> namenode, resourcemanager and hive server), 1 worker machine (running >> datanode and nodemanager) and 1 machine as client( running spark shell). >> >> >> >> *What I updated in config :* >> >> >> >> 1. Update in spark-defaults.conf >> >> spark.dynamicAllocation.enabled true >> >> spark.shuffle.service.enabled true >> >> >> >> 2. Update yarn-site.xml >> >> <property> >> >> <name>yarn.nodemanager.aux-services</name> >> <value>mapreduce_shuffle,*spark_shuffle*</value> >> </property> >> >> <property> >> <name>yarn.nodemanager.aux-services.spark_shuffle.class</name> >> >> <value>org.apache.spark.network.yarn.YarnShuffleService</value> >> </property> >> >> <property> >> <name>spark.shuffle.service.enabled</name> >> <value>true</value> >> </property> >> >> 3. Copy spark-1.6.1-yarn-shuffle.jar to yarn.application.classpath >> ($HADOOP_HOME/share/hadoop/yarn/*) in python code >> >> 4. Restart namenode, datanode, resourcemanager, nodemanger... >> retart everything >> >> 5. The config will update in all machines, resourcemanager >> and nodemanager. We update the config in one place and copy to all machines. >> >> >> >> *What I tested:* >> >> >> >> 1. I started a scala spark shell and check its environment variables, >> spark.dynamicAllocation.enabled is true. >> >> 2. I used the following code: >> >> scala > val line = >> sc.textFile("/spark-events/application_1463681113470_0006") >> >> line: org.apache.spark.rdd.RDD[String] = >> /spark-events/application_1463681113470_0006 MapPartitionsRDD[1] at >> textFile at <console>:27 >> >> scala > line.count # This command just stuck here >> >> >> >> 3. In the beginning, there is only 1 executor(this is for driver) and >> after line.count, I could see 3 executors, then dropped to 1. >> >> 4. Several jobs were launched and all of them failed. Tasks (for all >> stages): Succeeded/Total : 0/2 (4 failed) >> >> >> >> *Error messages:* >> >> >> >> I found the following messages in spark web UI. I found this in spark.log >> on nodemanager machine as well. >> >> >> >> *ExecutorLostFailure (executor 1 exited caused by one of the running >> tasks) Reason: Container marked as failed: >> container_1463692924309_0002_01_000002 on host: xxxxxxxxxxxxxxx.com >> <http://xxxxxxxxxxxxxxx.com>. Exit status: 1. Diagnostics: Exception from >> container-launch.* >> *Container id: container_1463692924309_0002_01_000002* >> *Exit code: 1* >> *Stack trace: ExitCodeException exitCode=1: * >> *at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)* >> *at org.apache.hadoop.util.Shell.run(Shell.java:455)* >> *at >> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)* >> *at >> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)* >> *at >> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)* >> *at >> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)* >> *at java.util.concurrent.FutureTask.run(FutureTask.java:266)* >> *at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)* >> *at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)* >> *at java.lang.Thread.run(Thread.java:745)* >> >> *Container exited with a non-zero exit code 1* >> >> >> >> Thanks a lot for help. We can provide more information if needed. >> >> >> >> Thanks, >> Weifeng >> >> >> >> >> >> >> >> >> >> >> > > *NOTICE TO RECIPIENTS*: This communication is confidential and intended > for the use of the addressee only. If you are not an intended recipient of > this communication, please delete it immediately and notify the sender by > return email. Unauthorized reading, dissemination, distribution or copying > of this communication is prohibited. This communication does not constitute > an offer to sell or a solicitation of an indication of interest to purchase > any loan, security or any other financial product or instrument, nor is it > an offer to sell or a solicitation of an indication of interest to purchase > any products or services to any persons who are prohibited from receiving > such information under applicable law. The contents of this communication > may not be accurate or complete and are subject to change without notice. > As such, Orchard App, Inc. (including its subsidiaries and affiliates, > "Orchard") makes no representation regarding the accuracy or completeness > of the information contained herein. The intended recipient is advised to > consult its own professional advisors, including those specializing in > legal, tax and accounting matters. Orchard does not provide legal, tax or > accounting advice. >