[
https://issues.apache.org/jira/browse/YARN-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251226#comment-16251226
]
wuchang commented on YARN-7474:
-------------------------------
[~yufeigu] [~templedf]
>From the ResourceManager log,I see:
At 09:04 when the problem start to occur , all NodeManagers in my yarn cluster
has just been reserved ,below is the result of grepping the *Making
reservation* from the log:
{code:java}
2017-11-11 09:00:30,343 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.106 app_id=application_1507795051888_183354
2017-11-11 09:00:30,346 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.105 app_id=application_1507795051888_183354
2017-11-11 09:00:30,401 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.84 app_id=application_1507795051888_183354
2017-11-11 09:00:30,412 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.85 app_id=application_1507795051888_183354
2017-11-11 09:00:30,535 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.102 app_id=application_1507795051888_183354
2017-11-11 09:00:30,687 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.86 app_id=application_1507795051888_183354
2017-11-11 09:00:30,824 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.108 app_id=application_1507795051888_183354
2017-11-11 09:00:30,865 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.104 app_id=application_1507795051888_183354
2017-11-11 09:00:30,991 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.103 app_id=application_1507795051888_183354
2017-11-11 09:00:31,232 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.107 app_id=application_1507795051888_183354
2017-11-11 09:00:31,249 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.101 app_id=application_1507795051888_183354
2017-11-11 09:00:34,547 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.100 app_id=application_1507795051888_183358
2017-11-11 09:01:06,277 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.100 app_id=application_1507795051888_183342
2017-11-11 09:01:16,525 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.100 app_id=application_1507795051888_183342
2017-11-11 09:01:25,348 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.100 app_id=application_1507795051888_183342
2017-11-11 09:01:28,351 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.100 app_id=application_1507795051888_183342
2017-11-11 09:02:29,658 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.100 app_id=application_1507795051888_183342
2017-11-11 09:04:14,788 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.100 app_id=application_1507795051888_183376
2017-11-11 09:04:26,307 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.100 app_id=application_1507795051888_183380
2017-11-11 09:04:51,200 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=10.120.117.100 app_id=application_1507795051888_183383
{code}
So, I guess if it is caused by a reservation deadlock , which means , all
nodes has been reserved , these reserved containers cannot be turned to
allocated , and new-coming application cannot make reservation anymore so they
are all pending, thus , my yarn cluster become dead.
> Yarn resourcemanager stop allocating container when cluster resource is
> sufficient
> -----------------------------------------------------------------------------------
>
> Key: YARN-7474
> URL: https://issues.apache.org/jira/browse/YARN-7474
> Project: Hadoop YARN
> Issue Type: Bug
> Components: fairscheduler
> Affects Versions: 2.7.2
> Reporter: wuchang
> Priority: Critical
> Attachments: rm.log
>
>
> Hadoop Version: *2.7.2*
> My Yarn cluster have *(1100TB,368vCores)* totallly with 15 nodemangers .
> My cluster use fair-scheduler and I have 4 queues for different kinds of jobs:
>
> {quote}
> <allocations>
> <queue name="queue1">
> <minResources>100000 mb, 30 vcores</minResources>
> <maxResources>422280 mb, 132 vcores</maxResources>
> <maxAMShare>0.5f</maxAMShare>
> <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
> <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
> <maxRunningApps>50</maxRunningApps>
> </queue>
> <queue name="queue2">
> <minResources>25000 mb, 20 vcores</minResources>
> <maxResources>600280 mb, 150 vcores</maxResources>
> <maxAMShare>0.6f</maxAMShare>
> <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
> <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
> <maxRunningApps>50</maxRunningApps>
> </queue>
> <queue name="queue3">
> <minResources>100000 mb, 30 vcores</minResources>
> <maxResources>647280 mb, 132 vcores</maxResources>
> <maxAMShare>0.8f</maxAMShare>
> <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
> <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
> <maxRunningApps>50</maxRunningApps>
> </queue>
>
> <queue name="queue4">
> <minResources>80000 mb, 20 vcores</minResources>
> <maxResources>120000 mb, 30 vcores</maxResources>
> <maxAMShare>0.5f</maxAMShare>
> <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
> <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
> <maxRunningApps>50</maxRunningApps>
> </queue>
> </allocations>
> {quote}
> from about 9:00 am, all new-coming applications get stuck for nearly 5 hours,
> but the cluster resource usage is about *(600GB,120vCores)*,it means,the
> cluster resource is still *sufficient*.
> *The resource usage of the whole yarn cluster AND of each single queue stay
> unchanged for 5 hours*, really strange. Obviously , if it a resource
> insufficiency problem , it's impossible that used resource of all queues
> didn't have any change for 5 hours. So , I is the problem of ResourceManager.
> Since my cluster scale is not large, only 15 nodes with 1100G memory ,I
> exclude the possibility showed in [YARN-4618].
>
> besides that , all the running applications seems never finished, the Yarn RM
> seems static ,the RM log have no more state change logs about running
> applications,except for the log about more and more application is submitted
> and become ACCEPTED, but never from ACCEPTED to RUNNING.
> *The resource usage of the whole yarn cluster AND of each single queue stay
> unchanged for 5 hours*, really strange.
> The cluster seems like a zombie.
>
> I haved checked the ApplicationMaster log of some running but stucked
> application ,
>
> {quote}
> 2017-11-11 09:04:55,896 INFO [IPC Server handler 0 on 42899]
> org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task
> report for MAP job_1507795051888_183385. Report-size will be 4
> 2017-11-11 09:04:55,957 INFO [IPC Server handler 0 on 42899]
> org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task
> report for REDUCE job_1507795051888_183385. Report-size will be 0
> 2017-11-11 09:04:56,037 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
> Scheduling: PendingReds:0 ScheduledMaps:4 ScheduledReds:0 AssignedMaps:0
> AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0
> HostLocal:0 RackLocal:0
> 2017-11-11 09:04:56,061 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources()
> for application_1507795051888_183385: ask=6 release= 0 newContainers=0
> finishedContainers=0 resourcelimit=<memory:109760, vCores:25> knownNMs=15
> 2017-11-11 13:58:56,736 INFO [IPC Server handler 0 on 42899]
> org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Kill job
> job_1507795051888_183385 received from appuser (auth:SIMPLE) at 10.120.207.11
> {quote}
>
> You can see that at *2017-11-11 09:04:56,061* It send resource request to
> ResourceManager but RM allocate zero containers. Then ,no more logs for 5
> hours. At 13:58, I have to kill it manually.
>
> After 5 hours , I kill some pending applications and then everything
> recovered,remaining cluster resources can be allocated again, ResourceManager
> seems to be alive again.
>
> I have exclude the possibility of the restriction of maxRunningApps and
> maxAMShare config because they will just affect a single queue, but my
> problem is that whole yarn cluster application get stuck.
>
>
>
> Also , I exclude the possibility of a resourcemanger full gc problem
> because I check that with gcutil,no full gc happened , resource manager
> memory is OK.
>
> So , anyone could give me some suggestions?
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]