[jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app

2021-09-07 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410967#comment-17410967
 ] 

Hadoop QA commented on YARN-8958:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  8s{color} 
| {color:red}{color} | {color:red} YARN-8958 does not apply to trunk. Rebase 
required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for 
help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-8958 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12946245/YARN-8958.002.patch |
| Console output | 
https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1203/console |
| versions | git=2.17.1 |
| Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org |


This message was automatically generated.



> Schedulable entities leak in fair ordering policy when recovering containers 
> between remove app attempt and remove app
> --
>
> Key: YARN-8958
> URL: https://issues.apache.org/jira/browse/YARN-8958
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8958.001.patch, YARN-8958.002.patch
>
>
> We found a NPE in ClientRMService#getApplications when querying apps with 
> specified queue. The cause is that there is one app which can't be found by 
> calling RMContextImpl#getRMApps(is finished and swapped out of memory) but 
> still can be queried from fair ordering policy.
> To reproduce schedulable entities leak in fair ordering policy:
> (1) create app1 and launch container1 on node1
> (2) restart RM
> (3) remove app1 attempt, app1 is removed from the schedulable entities.
> (4) recover container1 after node1 reconnected to RM, then the state of 
> contianer1 is changed to COMPLETED, app1 is bring back to entitiesToReorder 
> after container released, then app1 will be added back into schedulable 
> entities after calling FairOrderingPolicy#getAssignmentIterator by scheduler.
> (5) remove app1
> To solve this problem, we should make sure schedulableEntities can only be 
> affected by add or remove app attempt, new entity should not be added into 
> schedulableEntities by reordering process.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> schedulableEntities.add(schedulableEntity);
>   }
> {code}
> Related codes above can be improved as follow to make sure only existent 
> entity can be re-add into schedulableEntities.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> boolean exists = schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> if (exists) {
>   schedulableEntities.add(schedulableEntity);
> } else {
>   LOG.info("Skip reordering non-existent schedulable entity: "
>   + schedulableEntity.getId());
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app

2021-09-06 Thread Shingo Furuyama (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410467#comment-17410467
 ] 

Shingo Furuyama commented on YARN-8958:
---

Hello,

I'm trying to introduce the fair ordering policy in our environment, and I'm 
interested in the progress of this issue. Any updates on the issue?

> Schedulable entities leak in fair ordering policy when recovering containers 
> between remove app attempt and remove app
> --
>
> Key: YARN-8958
> URL: https://issues.apache.org/jira/browse/YARN-8958
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8958.001.patch, YARN-8958.002.patch
>
>
> We found a NPE in ClientRMService#getApplications when querying apps with 
> specified queue. The cause is that there is one app which can't be found by 
> calling RMContextImpl#getRMApps(is finished and swapped out of memory) but 
> still can be queried from fair ordering policy.
> To reproduce schedulable entities leak in fair ordering policy:
> (1) create app1 and launch container1 on node1
> (2) restart RM
> (3) remove app1 attempt, app1 is removed from the schedulable entities.
> (4) recover container1 after node1 reconnected to RM, then the state of 
> contianer1 is changed to COMPLETED, app1 is bring back to entitiesToReorder 
> after container released, then app1 will be added back into schedulable 
> entities after calling FairOrderingPolicy#getAssignmentIterator by scheduler.
> (5) remove app1
> To solve this problem, we should make sure schedulableEntities can only be 
> affected by add or remove app attempt, new entity should not be added into 
> schedulableEntities by reordering process.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> schedulableEntities.add(schedulableEntity);
>   }
> {code}
> Related codes above can be improved as follow to make sure only existent 
> entity can be re-add into schedulableEntities.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> boolean exists = schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> if (exists) {
>   schedulableEntities.add(schedulableEntity);
> } else {
>   LOG.info("Skip reordering non-existent schedulable entity: "
>   + schedulableEntity.getId());
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app

2018-11-16 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689143#comment-16689143
 ] 

Tao Yang commented on YARN-8958:


Hi,[~cheersyang]
I think the race condition may happen when app is moving to another queue (call 
AbstractComparatorOrderingPolicy#removeSchedulableEntity inside) and meanwhile 
this app is updating its requests (call 
AbstractComparatorOrderingPolicy#entityRequiresReordering inside), there is no 
lock in this scenario. Right?

> Schedulable entities leak in fair ordering policy when recovering containers 
> between remove app attempt and remove app
> --
>
> Key: YARN-8958
> URL: https://issues.apache.org/jira/browse/YARN-8958
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8958.001.patch, YARN-8958.002.patch
>
>
> We found a NPE in ClientRMService#getApplications when querying apps with 
> specified queue. The cause is that there is one app which can't be found by 
> calling RMContextImpl#getRMApps(is finished and swapped out of memory) but 
> still can be queried from fair ordering policy.
> To reproduce schedulable entities leak in fair ordering policy:
> (1) create app1 and launch container1 on node1
> (2) restart RM
> (3) remove app1 attempt, app1 is removed from the schedulable entities.
> (4) recover container1 after node1 reconnected to RM, then the state of 
> contianer1 is changed to COMPLETED, app1 is bring back to entitiesToReorder 
> after container released, then app1 will be added back into schedulable 
> entities after calling FairOrderingPolicy#getAssignmentIterator by scheduler.
> (5) remove app1
> To solve this problem, we should make sure schedulableEntities can only be 
> affected by add or remove app attempt, new entity should not be added into 
> schedulableEntities by reordering process.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> schedulableEntities.add(schedulableEntity);
>   }
> {code}
> Related codes above can be improved as follow to make sure only existent 
> entity can be re-add into schedulableEntities.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> boolean exists = schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> if (exists) {
>   schedulableEntities.add(schedulableEntity);
> } else {
>   LOG.info("Skip reordering non-existent schedulable entity: "
>   + schedulableEntity.getId());
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app

2018-11-12 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683566#comment-16683566
 ] 

Weiwei Yang commented on YARN-8958:
---

Hi [~Tao Yang]

When invoke FairOrderingPolicy#containerAllocated, #containerReleased from 
\{{LeafQueue}}, they all hold the writeLock of the \{{LeafQueue}}, similarly, 
#addSchedulableEntity and #removeSchedulableEntity also hold the same 
writeLock. In this case, how this race condition would happen?

> Schedulable entities leak in fair ordering policy when recovering containers 
> between remove app attempt and remove app
> --
>
> Key: YARN-8958
> URL: https://issues.apache.org/jira/browse/YARN-8958
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8958.001.patch, YARN-8958.002.patch
>
>
> We found a NPE in ClientRMService#getApplications when querying apps with 
> specified queue. The cause is that there is one app which can't be found by 
> calling RMContextImpl#getRMApps(is finished and swapped out of memory) but 
> still can be queried from fair ordering policy.
> To reproduce schedulable entities leak in fair ordering policy:
> (1) create app1 and launch container1 on node1
> (2) restart RM
> (3) remove app1 attempt, app1 is removed from the schedulable entities.
> (4) recover container1 after node1 reconnected to RM, then the state of 
> contianer1 is changed to COMPLETED, app1 is bring back to entitiesToReorder 
> after container released, then app1 will be added back into schedulable 
> entities after calling FairOrderingPolicy#getAssignmentIterator by scheduler.
> (5) remove app1
> To solve this problem, we should make sure schedulableEntities can only be 
> affected by add or remove app attempt, new entity should not be added into 
> schedulableEntities by reordering process.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> schedulableEntities.add(schedulableEntity);
>   }
> {code}
> Related codes above can be improved as follow to make sure only existent 
> entity can be re-add into schedulableEntities.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> boolean exists = schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> if (exists) {
>   schedulableEntities.add(schedulableEntity);
> } else {
>   LOG.info("Skip reordering non-existent schedulable entity: "
>   + schedulableEntity.getId());
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app

2018-11-02 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672875#comment-16672875
 ] 

Tao Yang commented on YARN-8958:


{quote}
I am not sure about that... The cached usage is used by the FairComparator to 
determine the ordering of schedulable entities, we need to make sure that is 
updated correctly.
{quote}
Inside this update is refreshing its cached pending/used to be its current 
pending/used, that's why it can't be updated incorrectly.
{quote}
so we don't need to change #reorderScheduleEntities logic, doesn't that make 
sense?
{quote}
I think this change can solve a lot but not all. It's possible in a race 
condition scenario that schedulable entity did exist when put it to 
entitiesToReorder but removed by other thread immediately. For example:
(1) Thread1 -> AbstractComparatorOrderingPolicy#removeSchedulableEntity 
synchronized block just execute done but haven't removed this schedulable 
entity from schedulableEntities
(2) Thread2 -> AbstractComparatorOrderingPolicy#entityRequiresReordering 
synchronized block execute done: put schedulable entity back to 
entitiesToReorder
(3) Thread1 -> AbstractComparatorOrderingPolicy#removeSchedulableEntity removed 
this schedulable entity from schedulableEntities
Thoughts?

> Schedulable entities leak in fair ordering policy when recovering containers 
> between remove app attempt and remove app
> --
>
> Key: YARN-8958
> URL: https://issues.apache.org/jira/browse/YARN-8958
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8958.001.patch, YARN-8958.002.patch
>
>
> We found a NPE in ClientRMService#getApplications when querying apps with 
> specified queue. The cause is that there is one app which can't be found by 
> calling RMContextImpl#getRMApps(is finished and swapped out of memory) but 
> still can be queried from fair ordering policy.
> To reproduce schedulable entities leak in fair ordering policy:
> (1) create app1 and launch container1 on node1
> (2) restart RM
> (3) remove app1 attempt, app1 is removed from the schedulable entities.
> (4) recover container1 after node1 reconnected to RM, then the state of 
> contianer1 is changed to COMPLETED, app1 is bring back to entitiesToReorder 
> after container released, then app1 will be added back into schedulable 
> entities after calling FairOrderingPolicy#getAssignmentIterator by scheduler.
> (5) remove app1
> To solve this problem, we should make sure schedulableEntities can only be 
> affected by add or remove app attempt, new entity should not be added into 
> schedulableEntities by reordering process.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> schedulableEntities.add(schedulableEntity);
>   }
> {code}
> Related codes above can be improved as follow to make sure only existent 
> entity can be re-add into schedulableEntities.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> boolean exists = schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> if (exists) {
>   schedulableEntities.add(schedulableEntity);
> } else {
>   LOG.info("Skip reordering non-existent schedulable entity: "
>   + schedulableEntity.getId());
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app

2018-11-02 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672817#comment-16672817
 ] 

Weiwei Yang commented on YARN-8958:
---

Hi [~Tao Yang]
{quote}This calling can make schedulable entity more correctly because inside 
it just cached resource usage of itself get fresher and no harm to others.
{quote}
I am not sure about that... The cached usage is used by the FairComparator to 
determine the ordering of schedulable entities, we need to make sure that is 
updated correctly. Go back to the fix, I come up with an alternative approach:
 # {{schedulableEntities}} mains the full list of apps for the ordering policy, 
entities are added/removed when app added, removed or updated;
 # {{entitiesToReorder}} maintains the apps that needs to re-order, entities 
are added/removed when container allocated, released or updated (resource usage 
changes)

Under such context, {{entitiesToReorder}} should be a sub-set of 
{{schedulableEntities}}. So why not we ensure that by following change?
{code:java}
protected void entityRequiresReordering(S schedulableEntity) {
  synchronized (entitiesToReorder) {
    if (schedulableEntities.contains(schedulableEntity)) {
       entitiesToReorder.put(schedulableEntity.getId(), schedulableEntity);
    }
  }
}
{code}
so we don't need to change #reorderScheduleEntities logic, doesn't that make 
sense?

 

> Schedulable entities leak in fair ordering policy when recovering containers 
> between remove app attempt and remove app
> --
>
> Key: YARN-8958
> URL: https://issues.apache.org/jira/browse/YARN-8958
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8958.001.patch, YARN-8958.002.patch
>
>
> We found a NPE in ClientRMService#getApplications when querying apps with 
> specified queue. The cause is that there is one app which can't be found by 
> calling RMContextImpl#getRMApps(is finished and swapped out of memory) but 
> still can be queried from fair ordering policy.
> To reproduce schedulable entities leak in fair ordering policy:
> (1) create app1 and launch container1 on node1
> (2) restart RM
> (3) remove app1 attempt, app1 is removed from the schedulable entities.
> (4) recover container1 after node1 reconnected to RM, then the state of 
> contianer1 is changed to COMPLETED, app1 is bring back to entitiesToReorder 
> after container released, then app1 will be added back into schedulable 
> entities after calling FairOrderingPolicy#getAssignmentIterator by scheduler.
> (5) remove app1
> To solve this problem, we should make sure schedulableEntities can only be 
> affected by add or remove app attempt, new entity should not be added into 
> schedulableEntities by reordering process.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> schedulableEntities.add(schedulableEntity);
>   }
> {code}
> Related codes above can be improved as follow to make sure only existent 
> entity can be re-add into schedulableEntities.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> boolean exists = schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> if (exists) {
>   schedulableEntities.add(schedulableEntity);
> } else {
>   LOG.info("Skip reordering non-existent schedulable entity: "
>   + schedulableEntity.getId());
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app

2018-11-02 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672606#comment-16672606
 ] 

Tao Yang commented on YARN-8958:


Hi, [~cheersyang], thanks for the review.
{quote}
Can we also only do the resource usage updates when the schedulable entity 
exists?
otherwise, would the resource usage gets incorrectly updated?
{quote}
I think it's OK to update resource usage for non-existent schedulable entity, 
maybe this schedulable entity is not finished but moved from ordering-policy to 
pending-ordering-policy, it may need this update. This calling can make 
schedulable entity more correctly because inside it just cached resource usage 
of itself get fresher and no harm to others.

> Schedulable entities leak in fair ordering policy when recovering containers 
> between remove app attempt and remove app
> --
>
> Key: YARN-8958
> URL: https://issues.apache.org/jira/browse/YARN-8958
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8958.001.patch, YARN-8958.002.patch
>
>
> We found a NPE in ClientRMService#getApplications when querying apps with 
> specified queue. The cause is that there is one app which can't be found by 
> calling RMContextImpl#getRMApps(is finished and swapped out of memory) but 
> still can be queried from fair ordering policy.
> To reproduce schedulable entities leak in fair ordering policy:
> (1) create app1 and launch container1 on node1
> (2) restart RM
> (3) remove app1 attempt, app1 is removed from the schedulable entities.
> (4) recover container1 after node1 reconnected to RM, then the state of 
> contianer1 is changed to COMPLETED, app1 is bring back to entitiesToReorder 
> after container released, then app1 will be added back into schedulable 
> entities after calling FairOrderingPolicy#getAssignmentIterator by scheduler.
> (5) remove app1
> To solve this problem, we should make sure schedulableEntities can only be 
> affected by add or remove app attempt, new entity should not be added into 
> schedulableEntities by reordering process.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> schedulableEntities.add(schedulableEntity);
>   }
> {code}
> Related codes above can be improved as follow to make sure only existent 
> entity can be re-add into schedulableEntities.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> boolean exists = schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> if (exists) {
>   schedulableEntities.add(schedulableEntity);
> } else {
>   LOG.info("Skip reordering non-existent schedulable entity: "
>   + schedulableEntity.getId());
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app

2018-11-01 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672583#comment-16672583
 ] 

Weiwei Yang commented on YARN-8958:
---

Hi [~Tao Yang]

Can we also only do the resource usage updates when the schedulable entity 
exists?
{code:java}
protected void reorderSchedulableEntity(S schedulableEntity) {
if (schedulableEntities.remove(schedulableEntity)) {
  updateSchedulingResourceUsage(schedulableEntity.getSchedulingResourceUsage());
  schedulableEntities.add(schedulableEntity);
} else {
  LOG.info("Skip reordering non-existent schedulable entity: " + 
schedulableEntity.getId());
}{code}

otherwise, would the resource usage gets incorrectly updated?
Please take a look, thanks.

> Schedulable entities leak in fair ordering policy when recovering containers 
> between remove app attempt and remove app
> --
>
> Key: YARN-8958
> URL: https://issues.apache.org/jira/browse/YARN-8958
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8958.001.patch, YARN-8958.002.patch
>
>
> We found a NPE in ClientRMService#getApplications when querying apps with 
> specified queue. The cause is that there is one app which can't be found by 
> calling RMContextImpl#getRMApps(is finished and swapped out of memory) but 
> still can be queried from fair ordering policy.
> To reproduce schedulable entities leak in fair ordering policy:
> (1) create app1 and launch container1 on node1
> (2) restart RM
> (3) remove app1 attempt, app1 is removed from the schedulable entities.
> (4) recover container1 after node1 reconnected to RM, then the state of 
> contianer1 is changed to COMPLETED, app1 is bring back to entitiesToReorder 
> after container released, then app1 will be added back into schedulable 
> entities after calling FairOrderingPolicy#getAssignmentIterator by scheduler.
> (5) remove app1
> To solve this problem, we should make sure schedulableEntities can only be 
> affected by add or remove app attempt, new entity should not be added into 
> schedulableEntities by reordering process.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> schedulableEntities.add(schedulableEntity);
>   }
> {code}
> Related codes above can be improved as follow to make sure only existent 
> entity can be re-add into schedulableEntities.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> boolean exists = schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> if (exists) {
>   schedulableEntities.add(schedulableEntity);
> } else {
>   LOG.info("Skip reordering non-existent schedulable entity: "
>   + schedulableEntity.getId());
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app

2018-10-31 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671044#comment-16671044
 ] 

Weiwei Yang commented on YARN-8958:
---

Hi [~Tao Yang], No worries. I saw it several times, it should not be caused by 
this patch.

I think the fix makes sense to me, I'll take one more look today. Thanks

> Schedulable entities leak in fair ordering policy when recovering containers 
> between remove app attempt and remove app
> --
>
> Key: YARN-8958
> URL: https://issues.apache.org/jira/browse/YARN-8958
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8958.001.patch, YARN-8958.002.patch
>
>
> We found a NPE in ClientRMService#getApplications when querying apps with 
> specified queue. The cause is that there is one app which can't be found by 
> calling RMContextImpl#getRMApps(is finished and swapped out of memory) but 
> still can be queried from fair ordering policy.
> To reproduce schedulable entities leak in fair ordering policy:
> (1) create app1 and launch container1 on node1
> (2) restart RM
> (3) remove app1 attempt, app1 is removed from the schedulable entities.
> (4) recover container1 after node1 reconnected to RM, then the state of 
> contianer1 is changed to COMPLETED, app1 is bring back to entitiesToReorder 
> after container released, then app1 will be added back into schedulable 
> entities after calling FairOrderingPolicy#getAssignmentIterator by scheduler.
> (5) remove app1
> To solve this problem, we should make sure schedulableEntities can only be 
> affected by add or remove app attempt, new entity should not be added into 
> schedulableEntities by reordering process.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> schedulableEntities.add(schedulableEntity);
>   }
> {code}
> Related codes above can be improved as follow to make sure only existent 
> entity can be re-add into schedulableEntities.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> boolean exists = schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> if (exists) {
>   schedulableEntities.add(schedulableEntity);
> } else {
>   LOG.info("Skip reordering non-existent schedulable entity: "
>   + schedulableEntity.getId());
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app

2018-10-31 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671030#comment-16671030
 ] 

Tao Yang commented on YARN-8958:


There is no UT failure but still got -1 for unit by Hadoop QA. 
[~cheersyang], Can you help to see what happened?

> Schedulable entities leak in fair ordering policy when recovering containers 
> between remove app attempt and remove app
> --
>
> Key: YARN-8958
> URL: https://issues.apache.org/jira/browse/YARN-8958
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8958.001.patch, YARN-8958.002.patch
>
>
> We found a NPE in ClientRMService#getApplications when querying apps with 
> specified queue. The cause is that there is one app which can't be found by 
> calling RMContextImpl#getRMApps(is finished and swapped out of memory) but 
> still can be queried from fair ordering policy.
> To reproduce schedulable entities leak in fair ordering policy:
> (1) create app1 and launch container1 on node1
> (2) restart RM
> (3) remove app1 attempt, app1 is removed from the schedulable entities.
> (4) recover container1 after node1 reconnected to RM, then the state of 
> contianer1 is changed to COMPLETED, app1 is bring back to entitiesToReorder 
> after container released, then app1 will be added back into schedulable 
> entities after calling FairOrderingPolicy#getAssignmentIterator by scheduler.
> (5) remove app1
> To solve this problem, we should make sure schedulableEntities can only be 
> affected by add or remove app attempt, new entity should not be added into 
> schedulableEntities by reordering process.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> schedulableEntities.add(schedulableEntity);
>   }
> {code}
> Related codes above can be improved as follow to make sure only existent 
> entity can be re-add into schedulableEntities.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> boolean exists = schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> if (exists) {
>   schedulableEntities.add(schedulableEntity);
> } else {
>   LOG.info("Skip reordering non-existent schedulable entity: "
>   + schedulableEntity.getId());
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app

2018-10-31 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671024#comment-16671024
 ] 

Tao Yang commented on YARN-8958:


Thanks [~cheersyang] for the review.
{quote}
In testSchedulableEntitiesLeak, why the app attempt is finished, but then you 
try to recover a container for this app? I suppose by then all containers of 
this app attempt are done correct?
{quote}
This can happen after RM restart which is step 2 of the reproduce process. 
Remove app attempt(step 3) may happen before NM reconnect to RM and recover 
containers (step 4), so that not all containers are done when app attempt 
finished.

> Schedulable entities leak in fair ordering policy when recovering containers 
> between remove app attempt and remove app
> --
>
> Key: YARN-8958
> URL: https://issues.apache.org/jira/browse/YARN-8958
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8958.001.patch, YARN-8958.002.patch
>
>
> We found a NPE in ClientRMService#getApplications when querying apps with 
> specified queue. The cause is that there is one app which can't be found by 
> calling RMContextImpl#getRMApps(is finished and swapped out of memory) but 
> still can be queried from fair ordering policy.
> To reproduce schedulable entities leak in fair ordering policy:
> (1) create app1 and launch container1 on node1
> (2) restart RM
> (3) remove app1 attempt, app1 is removed from the schedulable entities.
> (4) recover container1, then the state of contianer1 is changed to COMPLETED, 
> app1 is bring back to entitiesToReorder after container released, then app1 
> will be added back into schedulable entities after calling 
> FairOrderingPolicy#getAssignmentIterator by scheduler.
> (5) remove app1
> To solve this problem, we should make sure schedulableEntities can only be 
> affected by add or remove app attempt, new entity should not be added into 
> schedulableEntities by reordering process.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> schedulableEntities.add(schedulableEntity);
>   }
> {code}
> Related codes above can be improved as follow to make sure only existent 
> entity can be re-add into schedulableEntities.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> boolean exists = schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> if (exists) {
>   schedulableEntities.add(schedulableEntity);
> } else {
>   LOG.info("Skip reordering non-existent schedulable entity: "
>   + schedulableEntity.getId());
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app

2018-10-31 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670851#comment-16670851
 ] 

Hadoop QA commented on YARN-8958:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
28s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 28s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m  9s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}105m 51s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}164m 21s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-8958 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12946245/YARN-8958.002.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 31032a523279 3.13.0-144-generic #193-Ubuntu SMP Thu Mar 15 
17:03:53 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 6668c19 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/22388/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/22388/testReport/ |
| Max. process+thread count | 944 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
| Console output | 

[jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app

2018-10-31 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670258#comment-16670258
 ] 

Weiwei Yang commented on YARN-8958:
---

Hi [~Tao Yang]

Thanks for creating the issue and the fix. I am trying to understand this 
issue, got a question about UT,

In \{{testSchedulableEntitiesLeak}}, why the app attempt is finished, but then 
you try to recover a container for this app? I suppose by then all containers 
of this app attempt are done correct?

> Schedulable entities leak in fair ordering policy when recovering containers 
> between remove app attempt and remove app
> --
>
> Key: YARN-8958
> URL: https://issues.apache.org/jira/browse/YARN-8958
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8958.001.patch, YARN-8958.002.patch
>
>
> We found a NPE in ClientRMService#getApplications when querying apps with 
> specified queue. The cause is that there is one app which can't be found by 
> calling RMContextImpl#getRMApps(is finished and swapped out of memory) but 
> still can be queried from fair ordering policy.
> To reproduce schedulable entities leak in fair ordering policy:
> (1) create app1 and launch container1 on node1
> (2) restart RM
> (3) remove app1 attempt, app1 is removed from the schedulable entities.
> (4) recover container1, then the state of contianer1 is changed to COMPLETED, 
> app1 is bring back to entitiesToReorder after container released, then app1 
> will be added back into schedulable entities after calling 
> FairOrderingPolicy#getAssignmentIterator by scheduler.
> (5) remove app1
> To solve this problem, we should make sure schedulableEntities can only be 
> affected by add or remove app attempt, new entity should not be added into 
> schedulableEntities by reordering process.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> schedulableEntities.add(schedulableEntity);
>   }
> {code}
> Related codes above can be improved as follow to make sure only existent 
> entity can be re-add into schedulableEntities.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> boolean exists = schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> if (exists) {
>   schedulableEntities.add(schedulableEntity);
> } else {
>   LOG.info("Skip reordering non-existent schedulable entity: "
>   + schedulableEntity.getId());
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app

2018-10-30 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16669111#comment-16669111
 ] 

Tao Yang commented on YARN-8958:


Attached v2 patch to fix UT failures
(1) Set 
{{yarn.resourcemanager.store.class=org.apache.hadoop.yarn.server.resourcemanager.recovery.MemoryRMStateStore}}
 for TestFairOrderingPolicy#testSchedulableEntitiesLeak to avoid RM recovering 
apps from state left by former test case.
(2) TestCapacityScheduler#testAllocateReorder always have a problem that only 
activate one app but expect both two apps, it can pass before because app2 will 
be add into schedulable entities through calling CapacityScheduler#allocate 
explicitly in this test case (add app2 into entitiesToReorder then add it into 
schedulableEntities) even though app2 is still not activated. So that this 
problem is exposed because of this patch, and if set 
{{yarn.scheduler.capacity.maximum-am-resource-percent=1.0}} then both two apps 
can be activated in this test case, This test case can pass again.

> Schedulable entities leak in fair ordering policy when recovering containers 
> between remove app attempt and remove app
> --
>
> Key: YARN-8958
> URL: https://issues.apache.org/jira/browse/YARN-8958
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8958.001.patch, YARN-8958.002.patch
>
>
> We found a NPE in ClientRMService#getApplications when querying apps with 
> specified queue. The cause is that there is one app which can't be found by 
> calling RMContextImpl#getRMApps(is finished and swapped out of memory) but 
> still can be queried from fair ordering policy.
> To reproduce schedulable entities leak in fair ordering policy:
> (1) create app1 and launch container1 on node1
> (2) restart RM
> (3) remove app1 attempt, app1 is removed from the schedulable entities.
> (4) recover container1, then the state of contianer1 is changed to COMPLETED, 
> app1 is bring back to entitiesToReorder after container released, then app1 
> will be added back into schedulable entities after calling 
> FairOrderingPolicy#getAssignmentIterator by scheduler.
> (5) remove app1
> To solve this problem, we should make sure schedulableEntities can only be 
> affected by add or remove app attempt, new entity should not be added into 
> schedulableEntities by reordering process.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> schedulableEntities.add(schedulableEntity);
>   }
> {code}
> Related codes above can be improved as follow to make sure only existent 
> entity can be re-add into schedulableEntities.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> boolean exists = schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> if (exists) {
>   schedulableEntities.add(schedulableEntity);
> } else {
>   LOG.info("Skip reordering non-existent schedulable entity: "
>   + schedulableEntity.getId());
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app

2018-10-30 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668954#comment-16668954
 ] 

Hadoop QA commented on YARN-8958:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 22s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
30s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 22s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}105m  2s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}161m 32s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.policy.TestFairOrderingPolicy |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-8958 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12946177/YARN-8958.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 8e70d66ca204 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 7757331 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/22379/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/22379/testReport/ |
| Max. process+thread count | 894 (vs. ulimit of 1) |
| modules | C: 

[jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app

2018-10-30 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668305#comment-16668305
 ] 

Tao Yang commented on YARN-8958:


Attached v1 patch for review.

> Schedulable entities leak in fair ordering policy when recovering containers 
> between remove app attempt and remove app
> --
>
> Key: YARN-8958
> URL: https://issues.apache.org/jira/browse/YARN-8958
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8958.001.patch
>
>
> We found a NPE in ClientRMService#getApplications when querying apps with 
> specified queue. The cause is that there is one app which can't be found by 
> calling RMContextImpl#getRMApps(is finished and swapped out of memory) but 
> still can be queried from fair ordering policy.
> To reproduce schedulable entities leak in fair ordering policy:
> (1) create app1 and launch container1 on node1
> (2) restart RM
> (3) remove app1 attempt, app1 is removed from the schedulable entities.
> (4) recover container1, then the state of contianer1 is changed to COMPLETED, 
> app1 is bring back to entitiesToReorder after container released, then app1 
> will be added back into schedulable entities after calling 
> FairOrderingPolicy#getAssignmentIterator by scheduler.
> (5) remove app1
> To solve this problem, we should make sure schedulableEntities can only be 
> affected by add or remove app attempt, new entity should not be added into 
> schedulableEntities by reordering process.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> schedulableEntities.add(schedulableEntity);
>   }
> {code}
> Related codes above can be improved as follow to make sure only existent 
> entity can be re-add into schedulableEntities.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> boolean exists = schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> if (exists) {
>   schedulableEntities.add(schedulableEntity);
> } else {
>   LOG.info("Skip reordering non-existent schedulable entity: "
>   + schedulableEntity.getId());
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org