[jira] [Updated] (YARN-10934) LeafQueue activateApplications NPE

2021-09-07 Thread Yuan LUO (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuan LUO updated YARN-10934:

Summary: LeafQueue activateApplications NPE  (was: activateApplications NPE)

> LeafQueue activateApplications NPE
> --
>
> Key: YARN-10934
> URL: https://issues.apache.org/jira/browse/YARN-10934
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.3.1
>Reporter: Yuan LUO
>Priority: Major
>
> Our prod Yarn cluster is hadoop version 3.3.1 ,  we changed 
> DefaultResourceCalculator -> DominantResourceCalculator and restart RM, then 
> our RM crashed, the Exception stack like below.  I think this is a serious 
> bug and hope someone can follow up and fix it.
> 2021-08-30 21:00:59,114 ERROR event.EventDispatcher 
> (MarkerIgnoringBase.java:error(159)) - Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.activateApplications(LeafQueue.java:868)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.removeApplicationAttempt(LeafQueue.java:1014)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.finishApplicationAttempt(LeafQueue.java:972)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1188)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1904)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:79)
> at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10934) activateApplications NPE

2021-09-07 Thread Yuan LUO (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411636#comment-17411636
 ] 

Yuan LUO commented on YARN-10934:
-

[~snemeth] Thanks for your reply, have fixed title, it is a NPE Error. I will 
add some information in the attachment.  

> activateApplications NPE
> 
>
> Key: YARN-10934
> URL: https://issues.apache.org/jira/browse/YARN-10934
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.3.1
>Reporter: Yuan LUO
>Priority: Major
>
> Our prod Yarn cluster is hadoop version 3.3.1 ,  we changed 
> DefaultResourceCalculator -> DominantResourceCalculator and restart RM, then 
> our RM crashed, the Exception stack like below.  I think this is a serious 
> bug and hope someone can follow up and fix it.
> 2021-08-30 21:00:59,114 ERROR event.EventDispatcher 
> (MarkerIgnoringBase.java:error(159)) - Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.activateApplications(LeafQueue.java:868)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.removeApplicationAttempt(LeafQueue.java:1014)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.finishApplicationAttempt(LeafQueue.java:972)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1188)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1904)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:79)
> at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10934) activateApplications NPE

2021-09-07 Thread Yuan LUO (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuan LUO updated YARN-10934:

Summary: activateApplications NPE  (was: activateApplications NPL)

> activateApplications NPE
> 
>
> Key: YARN-10934
> URL: https://issues.apache.org/jira/browse/YARN-10934
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.3.1
>Reporter: Yuan LUO
>Priority: Major
>
> Our prod Yarn cluster is hadoop version 3.3.1 ,  we changed 
> DefaultResourceCalculator -> DominantResourceCalculator and restart RM, then 
> our RM crashed, the Exception stack like below.  I think this is a serious 
> bug and hope someone can follow up and fix it.
> 2021-08-30 21:00:59,114 ERROR event.EventDispatcher 
> (MarkerIgnoringBase.java:error(159)) - Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.activateApplications(LeafQueue.java:868)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.removeApplicationAttempt(LeafQueue.java:1014)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.finishApplicationAttempt(LeafQueue.java:972)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1188)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1904)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:79)
> at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10929) Refrain from creating new Configuration object in AbstractManagedParentQueue#initializeLeafQueueConfigs

2021-09-07 Thread jackwangcs (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411624#comment-17411624
 ] 

jackwangcs commented on YARN-10929:
---

Hi [~snemeth], seems that I don't have the permission to assign this to me. 
Could you help to assign it to me?
Thanks.

> Refrain from creating new Configuration object in 
> AbstractManagedParentQueue#initializeLeafQueueConfigs
> ---
>
> Key: YARN-10929
> URL: https://issues.apache.org/jira/browse/YARN-10929
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> AbstractManagedParentQueue#initializeLeafQueueConfigs creates a new 
> CapacitySchedulerConfiguration with templated configs only. We should stop 
> doing this. 
> Also, there is a sorting of config keys in this method, but in the end the 
> configs are added to the Configuration object which is an enhanced Map.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-07 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10935:
--
Attachment: YARN-10935.001.patch

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-07 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411398#comment-17411398
 ] 

Eric Payne commented on YARN-10935:
---

For example, In the following screenshot, the advertising queue is a child of 
root and a parent of 3 sub-queues. One of the sub-queues has consumed all of 
the advertising parent queue's resources. The second sub-queue has submitted 
two apps. One of them is schedulable and one is non-schedulable. The second app 
is non-schedulable because starting the app would put the queue above the 
queue's AM limit:

 !Screen Shot 2021-09-07 at 12.49.52 PM.png! 

See that the second app can't start because of the following:

 !Screen Shot 2021-09-07 at 12.55.37 PM.png! 
Note that, in this example, the max queue AM limit should never go below 2GB 
memory and 16 vCores.


> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-07 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10935:
--
Attachment: Screen Shot 2021-09-07 at 12.55.37 PM.png

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-07 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10935:
--
Attachment: Screen Shot 2021-09-07 at 12.49.52 PM.png

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-07 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10935:
--
Summary: AM Total Queue Limit goes below per-user AM Limit if parent is 
full.  (was: AM Total Queue Limit goes below per-uwer AM Limit if parent is 
full.)

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10935) AM Total Queue Limit goes below per-uwer AM Limit if parent is full.

2021-09-07 Thread Eric Payne (Jira)
Eric Payne created YARN-10935:
-

 Summary: AM Total Queue Limit goes below per-uwer AM Limit if 
parent is full.
 Key: YARN-10935
 URL: https://issues.apache.org/jira/browse/YARN-10935
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler, capacityscheduler
Reporter: Eric Payne


This happens when DRF is enabled and all of one resource is consumed but the 
second resources still has plenty available.

This is reproduceable by setting up a parent queue where the capacity and max 
capacity are the same, with 2 or more sub-queues whose max capacity is 100%.

In one of the sub-queues, start a long-running app that consumes all resources 
in the parent queue's hieararchy. This app will consume all of the memory but 
not vary many vcores (for example)

In a second queue, submit an app. The *{{Max Application Master Resources Per 
User}}* limit is much more than the *{{Max Application Master Resources}}* 
limit.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10935) AM Total Queue Limit goes below per-uwer AM Limit if parent is full.

2021-09-07 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne reassigned YARN-10935:
-

Assignee: Eric Payne

> AM Total Queue Limit goes below per-uwer AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10928) Support default queue properties of capacity scheduler to simplify configuration management

2021-09-07 Thread Weiwei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411357#comment-17411357
 ] 

Weiwei Yang commented on YARN-10928:


Sure, granted the contributor role to [~Weihao Zheng]. Thanks

> Support default queue properties of capacity scheduler to simplify 
> configuration management
> ---
>
> Key: YARN-10928
> URL: https://issues.apache.org/jira/browse/YARN-10928
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Weihao Zheng
>Assignee: Weihao Zheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> There are many user cases that one user owns many queues in his 
> organization's cluster for different business usages in practice. These 
> queues often share the same properties, such as minimum-user-limit-percent 
> and user-limit-factor. Users have to write one property for every queue they 
> use if they want to use customized these shared properties. Adding default 
> queue properties for these cases will simplify capacity scheduler's 
> configuration file and make it easy to adjust queue's common properties. 
>   
>   CHANGES:
> Add two properties as queue's default value in capacity scheduler's 
> configuration:
>  * {{yarn.scheduler.capacity.minimum-user-limit-percent}}
>  * {{yarn.scheduler.capacity.user-limit-factor}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10928) Support default queue properties of capacity scheduler to simplify configuration management

2021-09-07 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang reassigned YARN-10928:
--

Assignee: Weihao Zheng

> Support default queue properties of capacity scheduler to simplify 
> configuration management
> ---
>
> Key: YARN-10928
> URL: https://issues.apache.org/jira/browse/YARN-10928
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Weihao Zheng
>Assignee: Weihao Zheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> There are many user cases that one user owns many queues in his 
> organization's cluster for different business usages in practice. These 
> queues often share the same properties, such as minimum-user-limit-percent 
> and user-limit-factor. Users have to write one property for every queue they 
> use if they want to use customized these shared properties. Adding default 
> queue properties for these cases will simplify capacity scheduler's 
> configuration file and make it easy to adjust queue's common properties. 
>   
>   CHANGES:
> Add two properties as queue's default value in capacity scheduler's 
> configuration:
>  * {{yarn.scheduler.capacity.minimum-user-limit-percent}}
>  * {{yarn.scheduler.capacity.user-limit-factor}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10872) Replace getPropsWithPrefix calls in AutoCreatedQueueTemplate

2021-09-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-10872:
--
Labels: pull-request-available  (was: )

> Replace getPropsWithPrefix calls in AutoCreatedQueueTemplate
> 
>
> Key: YARN-10872
> URL: https://issues.apache.org/jira/browse/YARN-10872
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Benjamin Teke
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> With the introduction of YARN-10838, it is now possible to optimise 
> AutoCreatedQueueTemplate and replace calls of getPropsWithPrefix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10884) EntityGroupFSTimelineStore fails to parse log files which has empty owner

2021-09-07 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411333#comment-17411333
 ] 

Prabhu Joseph commented on YARN-10884:
--

Thanks [~Swathi Chandrashekar] for the patch. Have committed it in trunk.

> EntityGroupFSTimelineStore fails to parse log files which has empty owner
> -
>
> Key: YARN-10884
> URL: https://issues.apache.org/jira/browse/YARN-10884
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Affects Versions: 3.3.1
>Reporter: Prabhu Joseph
>Assignee: SwathiChandrashekar
>Priority: Major
> Fix For: 3.3.1
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Due to [HADOOP-17848|https://issues.apache.org/jira/browse/HADOOP-17848] - 
> Wasb FileSystem sets owner as empty during append operation. 
> ATS1.5 fails to read such files with below error 
> {code:java}
>  java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1271)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1258)
> at 
> org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:141)
> at 
> org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:114)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:701)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:675)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:888)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> It gets ownership of the file to check ACL. In case of disabled ACL check, 
> this is not required. Will suggest to add anonymous user in case of empty 
> user.
> {code}
> if (owner.isEmpty()) {
>   user = "anonymous";
> } else {
>   user = owner;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10884) EntityGroupFSTimelineStore fails to parse log files which has empty owner

2021-09-07 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10884:
-
Labels:   (was: pull-request-available)

> EntityGroupFSTimelineStore fails to parse log files which has empty owner
> -
>
> Key: YARN-10884
> URL: https://issues.apache.org/jira/browse/YARN-10884
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Affects Versions: 3.3.1
>Reporter: Prabhu Joseph
>Assignee: SwathiChandrashekar
>Priority: Major
> Fix For: 3.3.1
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Due to [HADOOP-17848|https://issues.apache.org/jira/browse/HADOOP-17848] - 
> Wasb FileSystem sets owner as empty during append operation. 
> ATS1.5 fails to read such files with below error 
> {code:java}
>  java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1271)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1258)
> at 
> org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:141)
> at 
> org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:114)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:701)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:675)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:888)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> It gets ownership of the file to check ACL. In case of disabled ACL check, 
> this is not required. Will suggest to add anonymous user in case of empty 
> user.
> {code}
> if (owner.isEmpty()) {
>   user = "anonymous";
> } else {
>   user = owner;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10934) activateApplications NPL

2021-09-07 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411281#comment-17411281
 ] 

Szilard Nemeth edited comment on YARN-10934 at 9/7/21, 2:38 PM:


Hi [~luoyuan],
Can you attach a full yarn-site.xml config file here? Probably there's also 
something else than the DominantResourceCalculator that comes into play here.
If you have sensitive info like queues names or something like that, you may 
mask or replace the data with some dummy values.

A question: What's "NPL" in the title? Did you want to refer to NPE 
(NullPointerException) or something else?
Thanks.


was (Author: snemeth):
Hi [~luoyuan],
Can you attach a full yarn-site.xml config file here? Probably there's also 
something else than the DominantResourceCalculator that comes into play here.
If you have sensitive info like queues names or something like that, you may 
mask or replace the data with some dummy values.

> activateApplications NPL
> 
>
> Key: YARN-10934
> URL: https://issues.apache.org/jira/browse/YARN-10934
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.3.1
>Reporter: Yuan LUO
>Priority: Major
>
> Our prod Yarn cluster is hadoop version 3.3.1 ,  we changed 
> DefaultResourceCalculator -> DominantResourceCalculator and restart RM, then 
> our RM crashed, the Exception stack like below.  I think this is a serious 
> bug and hope someone can follow up and fix it.
> 2021-08-30 21:00:59,114 ERROR event.EventDispatcher 
> (MarkerIgnoringBase.java:error(159)) - Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.activateApplications(LeafQueue.java:868)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.removeApplicationAttempt(LeafQueue.java:1014)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.finishApplicationAttempt(LeafQueue.java:972)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1188)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1904)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:79)
> at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10934) activateApplications NPL

2021-09-07 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411281#comment-17411281
 ] 

Szilard Nemeth commented on YARN-10934:
---

Hi [~luoyuan],
Can you attach a full yarn-site.xml config file here? Probably there's also 
something else than the DominantResourceCalculator that comes into play here.
If you have sensitive info like queues names or something like that, you may 
mask or replace the data with some dummy values.

> activateApplications NPL
> 
>
> Key: YARN-10934
> URL: https://issues.apache.org/jira/browse/YARN-10934
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.3.1
>Reporter: Yuan LUO
>Priority: Major
>
> Our prod Yarn cluster is hadoop version 3.3.1 ,  we changed 
> DefaultResourceCalculator -> DominantResourceCalculator and restart RM, then 
> our RM crashed, the Exception stack like below.  I think this is a serious 
> bug and hope someone can follow up and fix it.
> 2021-08-30 21:00:59,114 ERROR event.EventDispatcher 
> (MarkerIgnoringBase.java:error(159)) - Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.activateApplications(LeafQueue.java:868)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.removeApplicationAttempt(LeafQueue.java:1014)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.finishApplicationAttempt(LeafQueue.java:972)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1188)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1904)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:79)
> at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10917) Investigate and simplify CapacitySchedulerConfigValidator#validateQueueHierarchy

2021-09-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-10917:
--
Labels: pull-request-available  (was: )

> Investigate and simplify 
> CapacitySchedulerConfigValidator#validateQueueHierarchy
> 
>
> Key: YARN-10917
> URL: https://issues.apache.org/jira/browse/YARN-10917
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Tamas Domok
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10934) activateApplications NPL

2021-09-07 Thread Yuan LUO (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411200#comment-17411200
 ] 

Yuan LUO commented on YARN-10934:
-

Hi [~zhuqi] [~gandras] [~bteke] [~taoyang] Could you have a look at  this 
issue, thanks!

> activateApplications NPL
> 
>
> Key: YARN-10934
> URL: https://issues.apache.org/jira/browse/YARN-10934
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.3.1
>Reporter: Yuan LUO
>Priority: Major
>
> Our prod Yarn cluster is hadoop version 3.3.1 ,  we changed 
> DefaultResourceCalculator -> DominantResourceCalculator and restart RM, then 
> our RM crashed, the Exception stack like below.  I think this is a serious 
> bug and hope someone can follow up and fix it.
> 2021-08-30 21:00:59,114 ERROR event.EventDispatcher 
> (MarkerIgnoringBase.java:error(159)) - Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.activateApplications(LeafQueue.java:868)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.removeApplicationAttempt(LeafQueue.java:1014)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.finishApplicationAttempt(LeafQueue.java:972)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1188)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1904)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:79)
> at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10934) activateApplications NPL

2021-09-07 Thread Yuan LUO (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuan LUO updated YARN-10934:

Description: 
Our prod Yarn cluster is hadoop version 3.3.1 ,  we changed 
DefaultResourceCalculator -> DominantResourceCalculator and restart RM, then 
our RM crashed, the Exception stack like below.  I think this is a serious bug 
and hope someone can follow up and fix it.

2021-08-30 21:00:59,114 ERROR event.EventDispatcher 
(MarkerIgnoringBase.java:error(159)) - Error in handling event type 
APP_ATTEMPT_REMOVED to the Event Dispatcher
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.activateApplications(LeafQueue.java:868)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.removeApplicationAttempt(LeafQueue.java:1014)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.finishApplicationAttempt(LeafQueue.java:972)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1188)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1904)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:79)
at java.base/java.lang.Thread.run(Thread.java:834)

  was:
Our prod Yarn cluster is hadoop version 3.3.1 ,  we changed 
DefaultResourceCalculator -> DominantResourceCalculator, then our RM crashed, 
the Exception stack like below:

2021-08-30 21:00:59,114 ERROR event.EventDispatcher 
(MarkerIgnoringBase.java:error(159)) - Error in handling event type 
APP_ATTEMPT_REMOVED to the Event Dispatcher
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.activateApplications(LeafQueue.java:868)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.removeApplicationAttempt(LeafQueue.java:1014)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.finishApplicationAttempt(LeafQueue.java:972)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1188)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1904)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:79)
at java.base/java.lang.Thread.run(Thread.java:834)


> activateApplications NPL
> 
>
> Key: YARN-10934
> URL: https://issues.apache.org/jira/browse/YARN-10934
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.3.1
>Reporter: Yuan LUO
>Priority: Major
>
> Our prod Yarn cluster is hadoop version 3.3.1 ,  we changed 
> DefaultResourceCalculator -> DominantResourceCalculator and restart RM, then 
> our RM crashed, the Exception stack like below.  I think this is a serious 
> bug and hope someone can follow up and fix it.
> 2021-08-30 21:00:59,114 ERROR event.EventDispatcher 
> (MarkerIgnoringBase.java:error(159)) - Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.activateApplications(LeafQueue.java:868)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.removeApplicationAttempt(LeafQueue.java:1014)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.finishApplicationAttempt(LeafQueue.java:972)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1188)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1904)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:79)
> at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, 

[jira] [Created] (YARN-10934) activateApplications NPL

2021-09-07 Thread Yuan LUO (Jira)
Yuan LUO created YARN-10934:
---

 Summary: activateApplications NPL
 Key: YARN-10934
 URL: https://issues.apache.org/jira/browse/YARN-10934
 Project: Hadoop YARN
  Issue Type: Bug
  Components: RM
Affects Versions: 3.3.1
Reporter: Yuan LUO


Our prod Yarn cluster is hadoop version 3.3.1 ,  we changed 
DefaultResourceCalculator -> DominantResourceCalculator, then our RM crashed, 
the Exception stack like below:

2021-08-30 21:00:59,114 ERROR event.EventDispatcher 
(MarkerIgnoringBase.java:error(159)) - Error in handling event type 
APP_ATTEMPT_REMOVED to the Event Dispatcher
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.activateApplications(LeafQueue.java:868)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.removeApplicationAttempt(LeafQueue.java:1014)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.finishApplicationAttempt(LeafQueue.java:972)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1188)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1904)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:79)
at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app

2021-09-07 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410967#comment-17410967
 ] 

Hadoop QA commented on YARN-8958:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  8s{color} 
| {color:red}{color} | {color:red} YARN-8958 does not apply to trunk. Rebase 
required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for 
help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-8958 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12946245/YARN-8958.002.patch |
| Console output | 
https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1203/console |
| versions | git=2.17.1 |
| Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org |


This message was automatically generated.



> Schedulable entities leak in fair ordering policy when recovering containers 
> between remove app attempt and remove app
> --
>
> Key: YARN-8958
> URL: https://issues.apache.org/jira/browse/YARN-8958
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8958.001.patch, YARN-8958.002.patch
>
>
> We found a NPE in ClientRMService#getApplications when querying apps with 
> specified queue. The cause is that there is one app which can't be found by 
> calling RMContextImpl#getRMApps(is finished and swapped out of memory) but 
> still can be queried from fair ordering policy.
> To reproduce schedulable entities leak in fair ordering policy:
> (1) create app1 and launch container1 on node1
> (2) restart RM
> (3) remove app1 attempt, app1 is removed from the schedulable entities.
> (4) recover container1 after node1 reconnected to RM, then the state of 
> contianer1 is changed to COMPLETED, app1 is bring back to entitiesToReorder 
> after container released, then app1 will be added back into schedulable 
> entities after calling FairOrderingPolicy#getAssignmentIterator by scheduler.
> (5) remove app1
> To solve this problem, we should make sure schedulableEntities can only be 
> affected by add or remove app attempt, new entity should not be added into 
> schedulableEntities by reordering process.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> schedulableEntities.add(schedulableEntity);
>   }
> {code}
> Related codes above can be improved as follow to make sure only existent 
> entity can be re-add into schedulableEntities.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
> //remove, update comparable data, and reinsert to update position in order
> boolean exists = schedulableEntities.remove(schedulableEntity);
> updateSchedulingResourceUsage(
>   schedulableEntity.getSchedulingResourceUsage());
> if (exists) {
>   schedulableEntities.add(schedulableEntity);
> } else {
>   LOG.info("Skip reordering non-existent schedulable entity: "
>   + schedulableEntity.getId());
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org