Santosh Marella created YARN-2476:
-------------------------------------

             Summary: Apps are scheduled in random order after RM failover
                 Key: YARN-2476
                 URL: https://issues.apache.org/jira/browse/YARN-2476
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 2.4.1
         Environment: Linux
            Reporter: Santosh Marella


RM HA is configured with 2 RMs. Used FileSystemRMStateStore.

Fairscheduler allocation file is configured in yarn-site.xml:
<property>
  <name>yarn.scheduler.fair.allocation.file</name>
  <value>/opt/mapr/hadoop/hadoop-2.4.1/etc/hadoop/allocation-pools.xml</value>
</property>

FS allocation-pools.xml:
<?xml version="1.0"?>
<allocations>
   <queue name="dev">
      <minResources>10000 mb,10vcores</minResources>
          <maxResources>19000 mb,100vcores</maxResources>
          <maxRunningApps>5525</maxRunningApps>
          <weight>4.5</weight>
          <schedulingPolicy>fair</schedulingPolicy>
          <fairSharePreemptionTimeout>3600</fairSharePreemptionTimeout>
   </queue>
   <queue name="default">
      <minResources>10000 mb,10vcores</minResources>
          <maxResources>19000 mb,100vcores</maxResources>
          <maxRunningApps>5525</maxRunningApps>
          <weight>1.5</weight>
          <schedulingPolicy>fair</schedulingPolicy>
          <fairSharePreemptionTimeout>3600</fairSharePreemptionTimeout>
   </queue>
    <defaultMinSharePreemptionTimeout>600</defaultMinSharePreemptionTimeout>
    <fairSharePreemptionTimeout>600</fairSharePreemptionTimeout>
</allocations>


    Submitted 10 sleep jobs to a FS queue using the command:
    hadoop jar hadoop-mapreduce-examples-2.4.1-mapr-4.0.1-SNAPSHOT.jar sleep
    -Dmapreduce.job.queuename=root.dev  -m 10 -r 10 -mt 10000 -rt 10000

    All the jobs were submitted by the same user, with the same priority and to 
the
    same queue. No other jobs were running in the cluster. Jobs started 
executing
    in the order in which they were submitted (jobs 6 to 10 were active, while 
11
    to 15 were waiting):
    root@perfnode131:/opt/mapr/hadoop/hadoop-2.4.1/logs# yarn application -list
    Total number of applications (application-types: [] and states: 
[SUBMITTED,ACCEPTED, RUNNING]):10
    Application-Id      Application-Name        Application-Type User           
Queue                   State             Final-State Progress                  
      Tracking-URL
    application_1408572781346_0012             Sleep job               
MAPREDUCE userA        root.dev                ACCEPTED               UNDEFINED 
0% N/A
    application_1408572781346_0014             Sleep job               
MAPREDUCE userA        root.dev                ACCEPTED               UNDEFINED 
0% N/A
    application_1408572781346_0011             Sleep job               
MAPREDUCE userA        root.dev                ACCEPTED               UNDEFINED 
0% N/A
    application_1408572781346_0010             Sleep job               
MAPREDUCE userA        root.dev                 RUNNING               UNDEFINED 
5% http://perfnode132:52799
    application_1408572781346_0008             Sleep job               
MAPREDUCE userA        root.dev                 RUNNING               UNDEFINED 
5% http://perfnode131:33766
    application_1408572781346_0009             Sleep job               
MAPREDUCE userA        root.dev                 RUNNING               UNDEFINED 
5% http://perfnode132:50964
    application_1408572781346_0007             Sleep job               
MAPREDUCE userA        root.dev                 RUNNING               UNDEFINED 
5% http://perfnode134:52966
    application_1408572781346_0015             Sleep job               
MAPREDUCE userA        root.dev                ACCEPTED               UNDEFINED 
0% N/A
    application_1408572781346_0006             Sleep job               
MAPREDUCE userA        root.dev                 RUNNING               UNDEFINED 
9.5% http://perfnode134:34094
    application_1408572781346_0013             Sleep job               
MAPREDUCE userA        root.dev                ACCEPTED               UNDEFINED 
0%  N/A


    Stopped RM1. There was a failover and RM2 became active. But the jobs seem 
to
    have started in a different order:
    root@perfnode131:~/scratch/raw_rm_logs_fs_hang# yarn application -list
    14/08/21 07:26:13 INFO client.ConfiguredRMFailoverProxyProvider: Failing 
over to rm2
    Total number of applications (application-types: [] and states: 
[SUBMITTED,ACCEPTED, RUNNING]):10
    Application-Id      Application-Name        Application-Type User           
Queue                   State             Final-State Progress                  
      Tracking-URL
    application_1408572781346_0012             Sleep job               
MAPREDUCE userA        root.dev                 RUNNING               UNDEFINED 
5%http://perfnode134:59351
    application_1408572781346_0014             Sleep job               
MAPREDUCE userA        root.dev                 RUNNING               UNDEFINED 
5%http://perfnode132:37866
    application_1408572781346_0011             Sleep job               
MAPREDUCE userA        root.dev                 RUNNING               UNDEFINED 
5%http://perfnode131:59744
    application_1408572781346_0010             Sleep job               
MAPREDUCE userA        root.dev                ACCEPTED               UNDEFINED 
0%N/A
    application_1408572781346_0008             Sleep job               
MAPREDUCE userA        root.dev                ACCEPTED               UNDEFINED 
0%N/A
    application_1408572781346_0009             Sleep job               
MAPREDUCE userA        root.dev                ACCEPTED               UNDEFINED 
0%N/A
    application_1408572781346_0007             Sleep job               
MAPREDUCE userA        root.dev                ACCEPTED               UNDEFINED 
0%N/A
    application_1408572781346_0015             Sleep job               
MAPREDUCE userA        root.dev                 RUNNING               UNDEFINED 
5%http://perfnode134:39754
    application_1408572781346_0006             Sleep job               
MAPREDUCE userA        root.dev                ACCEPTED               UNDEFINED 
0%N/A
    application_1408572781346_0013             Sleep job               
MAPREDUCE userA        root.dev                 RUNNING               UNDEFINED 
5%http://perfnode132:34714



The problem is this:
- The jobs that were previously in RUNNING state moved to ACCEPTED after 
failover.
- The jobs that were previously in ACCEPTED state moved to RUNNING after 
failover.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to