[
https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106507#comment-15106507
]
Karam Singh commented on YARN-4606:
-----------------------------------
>From offline discussion with [~wangda]:
After looked at log & code, I think I understand what happened:
The root cause is: we shouldn't activate application when it's in pending
state. This is not a new issue, at least branch-2.6 contains this issue.
This leads to #active-users in a queue increased, but new added active user
cannot get resource (because application is in pending state) and old user hits
user-limit (new added user lowers user-limits).
> Sometimes Fairness inconjuncttions with UserLimitPercent and UserLimitFactor
> in queue leads to situation where it appears that applications in queue are
> getting starved or stuck
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-4606
> URL: https://issues.apache.org/jira/browse/YARN-4606
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacity scheduler, capacityscheduler
> Affects Versions: 2.8.0, 2.7.1
> Reporter: Karam Singh
>
> Encountered while studying behaviour fairness with UserLimitPercent and
> UserLimitFactor during following test:
> Ran GridMix with Queue settings: Capacity=10, MaxCap=80, UserLimit=25
> UserLimitFactor=32, FairOrderingPolicy only. Encountered a application
> starving situation where 33 application (190 apps completed out of 761 apps,
> queue can 345 containers) are running with total of 45 containers running,
> and that 12 extra only one app(the app was having around 18000 tasks) , all
> other apps were having AM running only no other containers were given any
> apps. After that app finished, there were 32 AMs that kept running without
> any containers for task being launched
> GridMix was run with following settings:
> gridmix.client.pending.queue.depth=10, gridmix.job-submission.policy=REPLAY,
> gridmix.client.submit.threads=5, gridmix.submit.multiplier=0.0001,
> gridmix.job.type=SLEEPJOB, mapreduce.framework.name=yarn,
> mapreduce.job.queuename=hive1, mapred.job.queue.name=hive1,
> gridmix.sleep.max-map-time=5000, gridmix.sleep.max-reduce-time=5000,
> gridmix.user.resolve.class=org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver
> With Users file containing 4 users for RoundRobinUserResolver
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)