[
https://issues.apache.org/jira/browse/YARN-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251092#comment-16251092
]
wuchang commented on YARN-7474:
-------------------------------
[~yufeigu] [~templedf] I have attatched my ResourceManager log during the time
problem occurs.
> Yarn resourcemanager stop allocating container when cluster resource is
> sufficient
> -----------------------------------------------------------------------------------
>
> Key: YARN-7474
> URL: https://issues.apache.org/jira/browse/YARN-7474
> Project: Hadoop YARN
> Issue Type: Bug
> Components: fairscheduler
> Affects Versions: 2.7.2
> Reporter: wuchang
> Priority: Critical
> Attachments: rm.log
>
>
> Hadoop Version: *2.7.2*
> My Yarn cluster have *(1100TB,368vCores)* totallly with 15 nodemangers .
> My cluster use fair-scheduler and I have 4 queues for different kinds of jobs:
>
> {quote}
> <allocations>
> <queue name="queue1">
> <minResources>100000 mb, 30 vcores</minResources>
> <maxResources>422280 mb, 132 vcores</maxResources>
> <maxAMShare>0.5f</maxAMShare>
> <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
> <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
> <maxRunningApps>50</maxRunningApps>
> </queue>
> <queue name="queue2">
> <minResources>25000 mb, 20 vcores</minResources>
> <maxResources>600280 mb, 150 vcores</maxResources>
> <maxAMShare>0.6f</maxAMShare>
> <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
> <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
> <maxRunningApps>50</maxRunningApps>
> </queue>
> <queue name="queue3">
> <minResources>100000 mb, 30 vcores</minResources>
> <maxResources>647280 mb, 132 vcores</maxResources>
> <maxAMShare>0.8f</maxAMShare>
> <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
> <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
> <maxRunningApps>50</maxRunningApps>
> </queue>
>
> <queue name="queue4">
> <minResources>80000 mb, 20 vcores</minResources>
> <maxResources>120000 mb, 30 vcores</maxResources>
> <maxAMShare>0.5f</maxAMShare>
> <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
> <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
> <maxRunningApps>50</maxRunningApps>
> </queue>
> </allocations>
> {quote}
> from about 9:00 am, all new-coming applications get stuck for nearly 5 hours,
> but the cluster resource usage is about *(600GB,120vCores)*,it means,the
> cluster resource is still *sufficient*.
> *The resource usage of the whole yarn cluster AND of each single queue stay
> unchanged for 5 hours*, really strange. Obviously , if it a resource
> insufficiency problem , it's impossible that used resource of all queues
> didn't have any change for 5 hours. So , I is the problem of ResourceManager.
> Since my cluster scale is not large, only 15 nodes with 1100G memory ,I
> exclude the possibility showed in [YARN-4618].
>
> besides that , all the running applications seems never finished, the Yarn RM
> seems static ,the RM log have no more state change logs about running
> applications,except for the log about more and more application is submitted
> and become ACCEPTED, but never from ACCEPTED to RUNNING.
> *The resource usage of the whole yarn cluster AND of each single queue stay
> unchanged for 5 hours*, really strange.
> The cluster seems like a zombie.
>
> I haved checked the ApplicationMaster log of some running but stucked
> application ,
>
> {quote}
> 2017-11-11 09:04:55,896 INFO [IPC Server handler 0 on 42899]
> org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task
> report for MAP job_1507795051888_183385. Report-size will be 4
> 2017-11-11 09:04:55,957 INFO [IPC Server handler 0 on 42899]
> org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task
> report for REDUCE job_1507795051888_183385. Report-size will be 0
> 2017-11-11 09:04:56,037 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
> Scheduling: PendingReds:0 ScheduledMaps:4 ScheduledReds:0 AssignedMaps:0
> AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0
> HostLocal:0 RackLocal:0
> 2017-11-11 09:04:56,061 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources()
> for application_1507795051888_183385: ask=6 release= 0 newContainers=0
> finishedContainers=0 resourcelimit=<memory:109760, vCores:25> knownNMs=15
> 2017-11-11 13:58:56,736 INFO [IPC Server handler 0 on 42899]
> org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Kill job
> job_1507795051888_183385 received from appuser (auth:SIMPLE) at 10.120.207.11
> {quote}
>
> You can see that at *2017-11-11 09:04:56,061* It send resource request to
> ResourceManager but RM allocate zero containers. Then ,no more logs for 5
> hours. At 13:58, I have to kill it manually.
>
> After 5 hours , I kill some pending applications and then everything
> recovered,remaining cluster resources can be allocated again, ResourceManager
> seems to be alive again.
>
> I have exclude the possibility of the restriction of maxRunningApps and
> maxAMShare config because they will just affect a single queue, but my
> problem is that whole yarn cluster application get stuck.
>
>
>
> Also , I exclude the possibility of a resourcemanger full gc problem
> because I check that with gcutil,no full gc happened , resource manager
> memory is OK.
>
> So , anyone could give me some suggestions?
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]