[
https://issues.apache.org/jira/browse/YARN-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tsuyoshi OZAWA resolved YARN-2476.
----------------------------------
Resolution: Duplicate
> Apps are scheduled in random order after RM failover
> ----------------------------------------------------
>
> Key: YARN-2476
> URL: https://issues.apache.org/jira/browse/YARN-2476
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.4.1
> Environment: Linux
> Reporter: Santosh Marella
> Labels: ha, high-availability, resourcemanager
>
> RM HA is configured with 2 RMs. Used FileSystemRMStateStore.
> Fairscheduler allocation file is configured in yarn-site.xml:
> <property>
> <name>yarn.scheduler.fair.allocation.file</name>
> <value>/opt/mapr/hadoop/hadoop-2.4.1/etc/hadoop/allocation-pools.xml</value>
> </property>
> FS allocation-pools.xml:
> <?xml version="1.0"?>
> <allocations>
> <queue name="dev">
> <minResources>10000 mb,10vcores</minResources>
> <maxResources>19000 mb,100vcores</maxResources>
> <maxRunningApps>5525</maxRunningApps>
> <weight>4.5</weight>
> <schedulingPolicy>fair</schedulingPolicy>
> <fairSharePreemptionTimeout>3600</fairSharePreemptionTimeout>
> </queue>
> <queue name="default">
> <minResources>10000 mb,10vcores</minResources>
> <maxResources>19000 mb,100vcores</maxResources>
> <maxRunningApps>5525</maxRunningApps>
> <weight>1.5</weight>
> <schedulingPolicy>fair</schedulingPolicy>
> <fairSharePreemptionTimeout>3600</fairSharePreemptionTimeout>
> </queue>
> <defaultMinSharePreemptionTimeout>600</defaultMinSharePreemptionTimeout>
> <fairSharePreemptionTimeout>600</fairSharePreemptionTimeout>
> </allocations>
> Submitted 10 sleep jobs to a FS queue using the command:
> hadoop jar hadoop-mapreduce-examples-2.4.1-mapr-4.0.1-SNAPSHOT.jar sleep
> -Dmapreduce.job.queuename=root.dev -m 10 -r 10 -mt 10000 -rt 10000
> All the jobs were submitted by the same user, with the same priority and
> to the
> same queue. No other jobs were running in the cluster. Jobs started
> executing
> in the order in which they were submitted (jobs 6 to 10 were active,
> while 11
> to 15 were waiting):
> root@perfnode131:/opt/mapr/hadoop/hadoop-2.4.1/logs# yarn application
> -list
> Total number of applications (application-types: [] and states:
> [SUBMITTED,ACCEPTED, RUNNING]):10
> Application-Id Application-Name Application-Type User
> Queue State Final-State Progress
> Tracking-URL
> application_1408572781346_0012 Sleep job
> MAPREDUCE userA root.dev ACCEPTED
> UNDEFINED 0% N/A
> application_1408572781346_0014 Sleep job
> MAPREDUCE userA root.dev ACCEPTED
> UNDEFINED 0% N/A
> application_1408572781346_0011 Sleep job
> MAPREDUCE userA root.dev ACCEPTED
> UNDEFINED 0% N/A
> application_1408572781346_0010 Sleep job
> MAPREDUCE userA root.dev RUNNING
> UNDEFINED 5% http://perfnode132:52799
> application_1408572781346_0008 Sleep job
> MAPREDUCE userA root.dev RUNNING
> UNDEFINED 5% http://perfnode131:33766
> application_1408572781346_0009 Sleep job
> MAPREDUCE userA root.dev RUNNING
> UNDEFINED 5% http://perfnode132:50964
> application_1408572781346_0007 Sleep job
> MAPREDUCE userA root.dev RUNNING
> UNDEFINED 5% http://perfnode134:52966
> application_1408572781346_0015 Sleep job
> MAPREDUCE userA root.dev ACCEPTED
> UNDEFINED 0% N/A
> application_1408572781346_0006 Sleep job
> MAPREDUCE userA root.dev RUNNING
> UNDEFINED 9.5% http://perfnode134:34094
> application_1408572781346_0013 Sleep job
> MAPREDUCE userA root.dev ACCEPTED
> UNDEFINED 0% N/A
> Stopped RM1. There was a failover and RM2 became active. But the jobs
> seem to
> have started in a different order:
> root@perfnode131:~/scratch/raw_rm_logs_fs_hang# yarn application -list
> 14/08/21 07:26:13 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to rm2
> Total number of applications (application-types: [] and states:
> [SUBMITTED,ACCEPTED, RUNNING]):10
> Application-Id Application-Name Application-Type User
> Queue State Final-State Progress
> Tracking-URL
> application_1408572781346_0012 Sleep job
> MAPREDUCE userA root.dev RUNNING
> UNDEFINED 5%http://perfnode134:59351
> application_1408572781346_0014 Sleep job
> MAPREDUCE userA root.dev RUNNING
> UNDEFINED 5%http://perfnode132:37866
> application_1408572781346_0011 Sleep job
> MAPREDUCE userA root.dev RUNNING
> UNDEFINED 5%http://perfnode131:59744
> application_1408572781346_0010 Sleep job
> MAPREDUCE userA root.dev ACCEPTED
> UNDEFINED 0%N/A
> application_1408572781346_0008 Sleep job
> MAPREDUCE userA root.dev ACCEPTED
> UNDEFINED 0%N/A
> application_1408572781346_0009 Sleep job
> MAPREDUCE userA root.dev ACCEPTED
> UNDEFINED 0%N/A
> application_1408572781346_0007 Sleep job
> MAPREDUCE userA root.dev ACCEPTED
> UNDEFINED 0%N/A
> application_1408572781346_0015 Sleep job
> MAPREDUCE userA root.dev RUNNING
> UNDEFINED 5%http://perfnode134:39754
> application_1408572781346_0006 Sleep job
> MAPREDUCE userA root.dev ACCEPTED
> UNDEFINED 0%N/A
> application_1408572781346_0013 Sleep job
> MAPREDUCE userA root.dev RUNNING
> UNDEFINED 5%http://perfnode132:34714
> The problem is this:
> - The jobs that were previously in RUNNING state moved to ACCEPTED after
> failover.
> - The jobs that were previously in ACCEPTED state moved to RUNNING after
> failover.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)