[jira] [Commented] (YARN-7474) Yarn resourcemanager stop allocating container when cluster resource is sufficient

wuchang (JIRA) Tue, 14 Nov 2017 02:49:52 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251226#comment-16251226
 ]


wuchang commented on YARN-7474:
-------------------------------

[~yufeigu] [~templedf]

>From the ResourceManager log，I see:
At 09:04 when the problem start to occur , all NodeManagers in my yarn cluster 
has just been reserved ,below is the result of grepping the *Making 
reservation* from the log:

{code:java}
2017-11-11 09:00:30,343 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.106 app_id=application_1507795051888_183354
2017-11-11 09:00:30,346 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.105 app_id=application_1507795051888_183354
2017-11-11 09:00:30,401 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.84 app_id=application_1507795051888_183354
2017-11-11 09:00:30,412 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.85 app_id=application_1507795051888_183354
2017-11-11 09:00:30,535 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.102 app_id=application_1507795051888_183354
2017-11-11 09:00:30,687 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.86 app_id=application_1507795051888_183354
2017-11-11 09:00:30,824 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.108 app_id=application_1507795051888_183354
2017-11-11 09:00:30,865 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.104 app_id=application_1507795051888_183354
2017-11-11 09:00:30,991 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.103 app_id=application_1507795051888_183354
2017-11-11 09:00:31,232 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.107 app_id=application_1507795051888_183354
2017-11-11 09:00:31,249 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.101 app_id=application_1507795051888_183354
2017-11-11 09:00:34,547 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.100 app_id=application_1507795051888_183358
2017-11-11 09:01:06,277 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.100 app_id=application_1507795051888_183342
2017-11-11 09:01:16,525 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.100 app_id=application_1507795051888_183342
2017-11-11 09:01:25,348 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.100 app_id=application_1507795051888_183342
2017-11-11 09:01:28,351 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.100 app_id=application_1507795051888_183342
2017-11-11 09:02:29,658 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.100 app_id=application_1507795051888_183342
2017-11-11 09:04:14,788 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.100 app_id=application_1507795051888_183376
2017-11-11 09:04:26,307 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.100 app_id=application_1507795051888_183380
2017-11-11 09:04:51,200 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Making reservation: node=10.120.117.100 app_id=application_1507795051888_183383
{code}

So, I guess if it is caused by a reservation deadlock , which means ,  all 
nodes has been reserved , these reserved containers cannot be turned to 
allocated , and new-coming application cannot make reservation anymore so they 
are all pending, thus , my yarn cluster become dead.


> Yarn resourcemanager stop allocating container when cluster resource is 
> sufficient 
> -----------------------------------------------------------------------------------
>
>                 Key: YARN-7474
>                 URL: https://issues.apache.org/jira/browse/YARN-7474
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.7.2
>            Reporter: wuchang
>            Priority: Critical
>         Attachments: rm.log
>
>
> Hadoop Version: *2.7.2*
> My Yarn cluster have *(1100TB,368vCores)*  totallly with 15 nodemangers . 
> My cluster use fair-scheduler and I have 4 queues for different kinds of jobs:
>  
> {quote}
> <allocations>
>     <queue name="queue1">
>        <minResources>100000 mb, 30 vcores</minResources>
>        <maxResources>422280 mb, 132 vcores</maxResources>
>        <maxAMShare>0.5f</maxAMShare>
>        <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
>        <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
>        <maxRunningApps>50</maxRunningApps>
>     </queue>
>     <queue name="queue2">
>        <minResources>25000 mb, 20 vcores</minResources>
>        <maxResources>600280 mb, 150 vcores</maxResources>
>        <maxAMShare>0.6f</maxAMShare>
>        <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
>        <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
>        <maxRunningApps>50</maxRunningApps>
>     </queue>
>     <queue name="queue3">
>        <minResources>100000 mb, 30 vcores</minResources>
>        <maxResources>647280 mb, 132 vcores</maxResources>
>        <maxAMShare>0.8f</maxAMShare>
>        <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
>        <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
>        <maxRunningApps>50</maxRunningApps>
>     </queue>
>   
>     <queue name="queue4">
>        <minResources>80000 mb, 20 vcores</minResources>
>        <maxResources>120000 mb, 30 vcores</maxResources>
>        <maxAMShare>0.5f</maxAMShare>
>        <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
>        <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
>        <maxRunningApps>50</maxRunningApps>
>      </queue>
> </allocations>
>  {quote}
> from about 9:00 am, all new-coming applications get stuck for nearly 5 hours, 
> but the cluster resource usage is about *(600GB,120vCores)*，it means，the 
> cluster resource is still *sufficient*.
> *The resource usage of the whole yarn cluster AND of each single queue stay 
> unchanged for 5 hours*, really strange. Obviously , if it a resource 
> insufficiency problem , it's impossible that used resource of all queues 
> didn't have any change for 5 hours. So , I is the problem of ResourceManager.
> Since my cluster scale is not large, only 15 nodes with 1100G memory ,I 
> exclude the possibility showed in [YARN-4618].
>  
> besides that , all the running applications seems never finished, the Yarn RM 
> seems static ,the RM log  have no more state change logs about running 
> applications，except for the log about more and more application is submitted 
> and become ACCEPTED, but never from ACCEPTED to RUNNING.
> *The resource usage of the whole yarn cluster AND of each single queue stay 
> unchanged for 5 hours*, really strange.
> The cluster seems like a zombie.
>  
> I haved checked the ApplicationMaster log of some running but stucked 
> application , 
>  
>  {quote}
> 2017-11-11 09:04:55,896 INFO [IPC Server handler 0 on 42899] 
> org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task 
> report for MAP job_1507795051888_183385. Report-size will be 4
> 2017-11-11 09:04:55,957 INFO [IPC Server handler 0 on 42899] 
> org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task 
> report for REDUCE job_1507795051888_183385. Report-size will be 0
> 2017-11-11 09:04:56,037 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before 
> Scheduling: PendingReds:0 ScheduledMaps:4 ScheduledReds:0 AssignedMaps:0 
> AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0 
> HostLocal:0 RackLocal:0
> 2017-11-11 09:04:56,061 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() 
> for application_1507795051888_183385: ask=6 release= 0 newContainers=0 
> finishedContainers=0 resourcelimit=<memory:109760, vCores:25> knownNMs=15
> 2017-11-11 13:58:56,736 INFO [IPC Server handler 0 on 42899] 
> org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Kill job 
> job_1507795051888_183385 received from appuser (auth:SIMPLE) at 10.120.207.11
>  {quote}
>  
> You can see that at  *2017-11-11 09:04:56,061* It send resource request to 
> ResourceManager but RM allocate zero containers. Then ,no more logs  for 5 
> hours. At  13:58， I have to kill it manually.
>  
> After 5 hours , I kill some pending applications and then everything 
> recovered，remaining cluster resources can be allocated again, ResourceManager 
> seems  to be alive again.
>  
> I have exclude the possibility of  the restriction of maxRunningApps and 
> maxAMShare config because they will just affect a single queue, but my 
> problem is that whole yarn cluster application get stuck.
>  
>  
>  
> Also , I exclude the possibility of a  resourcemanger  full gc problem 
> because I check that with gcutil，no full gc happened , resource manager 
> memory is OK.
>  
> So , anyone could give me some suggestions?
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-7474) Yarn resourcemanager stop allocating container when cluster resource is sufficient

Reply via email to