[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-06-19 Thread K G Bakthavachalam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867599#comment-16867599
 ] 

K G Bakthavachalam commented on YARN-6695:
--

[~Prabhu Joseph] can u provide brief info how to reproduce this issue

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Prabhu Joseph
>Priority: Critical
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-6695-002.patch, YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9448) Fix Opportunistic Scheduling for node local allocations.

2019-06-19 Thread K G Bakthavachalam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867516#comment-16867516
 ] 

K G Bakthavachalam commented on YARN-9448:
--

[~abmodi] can u give brief information how to reproduce the issue

> Fix Opportunistic Scheduling for node local allocations.
> 
>
> Key: YARN-9448
> URL: https://issues.apache.org/jira/browse/YARN-9448
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9448.001.patch, YARN-9448.002.patch, 
> YARN-9448.003.patch, YARN-9448.004.patch
>
>
> Right now, opportunistic container might not get allocated on rack local node 
> even if it's available.
> Nodes are right now blacklisted if any container except node local container 
> is allocated on that node. In case, if previously container was allocated on 
> that node, that node wouldn't be even considered even if there is an ask for 
> node local request. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9435) Add Opportunistic Scheduler metrics in ResourceManager.

2019-05-13 Thread K G Bakthavachalam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839076#comment-16839076
 ] 

K G Bakthavachalam commented on YARN-9435:
--

[~abmodi]
destroy is not handled in RM side ,so when we make RM active from standby 
manually metrics will never get registered because instance is always not null 
so the check becomes false always in the null check.

public static OpportunisticSchedulerMetrics getMetrics() {
if(!isInitialized.get()){
  synchronized (OpportunisticSchedulerMetrics.class) {
if(INSTANCE == null){
  INSTANCE = new OpportunisticSchedulerMetrics();
  registerMetrics();
  isInitialized.set(true);
}
  }
}
return INSTANCE;
  }


> Add Opportunistic Scheduler metrics in ResourceManager.
> ---
>
> Key: YARN-9435
> URL: https://issues.apache.org/jira/browse/YARN-9435
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9435.001.patch, YARN-9435.002.patch, 
> YARN-9435.003.patch, YARN-9435.004.patch
>
>
> # Right now there are no metrics available for Opportunistic Scheduler at 
> ResourceManager. As part of this jira, we will add metrics like number of 
> allocated opportunistic containers, released opportunistic containers, node 
> level allocations, rack level allocations etc. for Opportunistic Scheduler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9435) Add Opportunistic Scheduler metrics in ResourceManager.

2019-05-08 Thread K G Bakthavachalam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

K G Bakthavachalam updated YARN-9435:
-
Description: # Right now there are no metrics available for Opportunistic 
Scheduler at ResourceManager. As part of this jira, we will add metrics like 
number of allocated opportunistic containers, released opportunistic 
containers, node level allocations, rack level allocations etc. for 
Opportunistic Scheduler.  (was: Right now there are no metrics available for 
Opportunistic Scheduler at ResourceManager. As part of this jira, we will add 
metrics like number of allocated opportunistic containers, released 
opportunistic containers, node level allocations, rack level allocations etc. 
for Opportunistic Scheduler.)

> Add Opportunistic Scheduler metrics in ResourceManager.
> ---
>
> Key: YARN-9435
> URL: https://issues.apache.org/jira/browse/YARN-9435
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9435.001.patch, YARN-9435.002.patch, 
> YARN-9435.003.patch, YARN-9435.004.patch
>
>
> # Right now there are no metrics available for Opportunistic Scheduler at 
> ResourceManager. As part of this jira, we will add metrics like number of 
> allocated opportunistic containers, released opportunistic containers, node 
> level allocations, rack level allocations etc. for Opportunistic Scheduler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9039) App ACLs are not validated when serving logs from LogWebService

2019-02-18 Thread K G Bakthavachalam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16770947#comment-16770947
 ] 

K G Bakthavachalam commented on YARN-9039:
--

[~suma.shivaprasad] 
any updates on this jira 

> App ACLs are not validated when serving logs from LogWebService
> ---
>
> Key: YARN-9039
> URL: https://issues.apache.org/jira/browse/YARN-9039
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Critical
> Attachments: YARN-9039.1.patch, YARN-9039.2.patch, YARN-9039.3.patch
>
>
> App Acls are not being validated while serving logs through REST and UI2 via 
> Log Webservice



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5566) Client-side NM graceful decom is not triggered when jobs finish

2018-10-27 Thread K G Bakthavachalam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

K G Bakthavachalam updated YARN-5566:
-
Description: 
I was testing the client-side NM graceful decommission and noticed that it was 
always waiting for the timeout, even if all jobs running on that node (or even 
the cluster) had already finished.

For example:
# JobA is running with at least one container on NodeA
# User runs client-side decom on NodeA at 5:00am with a timeout of 3 hours --> 
NodeA enters DECOMMISSIONING state
#JobA finishes at 6:00am and there are no other jobs running on NodeA
# User's client reaches the timeout at 8:00am, and forcibly decommissions NodeA

NodeA should have decommissioned at 6:00am.

  was:
I was testing the client-side NM graceful decommission and noticed that it was 
always waiting for the timeout, even if all jobs running on that node (or even 
the cluster) had already finished.

For example:
# JobA is running with at least one container on NodeA
# User runs client-side decom on NodeA at 5:00am with a timeout of 3 hours --> 
NodeA enters DECOMMISSIONING state
# and there are no oth JobA finishes at 6:00amer jobs running on NodeA
# User's client reaches the timeout at 8:00am, and forcibly decommissions NodeA

NodeA should have decommissioned at 6:00am.


> Client-side NM graceful decom is not triggered when jobs finish
> ---
>
> Key: YARN-5566
> URL: https://issues.apache.org/jira/browse/YARN-5566
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Fix For: 2.8.0, 3.0.0-alpha2
>
> Attachments: YARN-5566.001.patch, YARN-5566.002.patch, 
> YARN-5566.003.patch, YARN-5566.004.branch-2.8.addendum.patch, 
> YARN-5566.004.branch-2.8.patch, YARN-5566.004.patch
>
>
> I was testing the client-side NM graceful decommission and noticed that it 
> was always waiting for the timeout, even if all jobs running on that node (or 
> even the cluster) had already finished.
> For example:
> # JobA is running with at least one container on NodeA
> # User runs client-side decom on NodeA at 5:00am with a timeout of 3 hours 
> --> NodeA enters DECOMMISSIONING state
> #JobA finishes at 6:00am and there are no other jobs running on NodeA
> # User's client reaches the timeout at 8:00am, and forcibly decommissions 
> NodeA
> NodeA should have decommissioned at 6:00am.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5566) Client-side NM graceful decom is not triggered when jobs finish

2018-10-27 Thread K G Bakthavachalam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

K G Bakthavachalam updated YARN-5566:
-
Description: 
I was testing the client-side NM graceful decommission and noticed that it was 
always waiting for the timeout, even if all jobs running on that node (or even 
the cluster) had already finished.

For example:
# JobA is running with at least one container on NodeA
# User runs client-side decom on NodeA at 5:00am with a timeout of 3 hours --> 
NodeA enters DECOMMISSIONING state
# JobA finishes at 6:00am and there are no other jobs running on NodeA
# User's client reaches the timeout at 8:00am, and forcibly decommissions NodeA

NodeA should have decommissioned at 6:00am.

  was:
I was testing the client-side NM graceful decommission and noticed that it was 
always waiting for the timeout, even if all jobs running on that node (or even 
the cluster) had already finished.

For example:
# JobA is running with at least one container on NodeA
# User runs client-side decom on NodeA at 5:00am with a timeout of 3 hours --> 
NodeA enters DECOMMISSIONING state
#JobA finishes at 6:00am and there are no other jobs running on NodeA
# User's client reaches the timeout at 8:00am, and forcibly decommissions NodeA

NodeA should have decommissioned at 6:00am.


> Client-side NM graceful decom is not triggered when jobs finish
> ---
>
> Key: YARN-5566
> URL: https://issues.apache.org/jira/browse/YARN-5566
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Fix For: 2.8.0, 3.0.0-alpha2
>
> Attachments: YARN-5566.001.patch, YARN-5566.002.patch, 
> YARN-5566.003.patch, YARN-5566.004.branch-2.8.addendum.patch, 
> YARN-5566.004.branch-2.8.patch, YARN-5566.004.patch
>
>
> I was testing the client-side NM graceful decommission and noticed that it 
> was always waiting for the timeout, even if all jobs running on that node (or 
> even the cluster) had already finished.
> For example:
> # JobA is running with at least one container on NodeA
> # User runs client-side decom on NodeA at 5:00am with a timeout of 3 hours 
> --> NodeA enters DECOMMISSIONING state
> # JobA finishes at 6:00am and there are no other jobs running on NodeA
> # User's client reaches the timeout at 8:00am, and forcibly decommissions 
> NodeA
> NodeA should have decommissioned at 6:00am.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5566) Client-side NM graceful decom is not triggered when jobs finish

2018-10-27 Thread K G Bakthavachalam (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

K G Bakthavachalam updated YARN-5566:
-
Description: 
I was testing the client-side NM graceful decommission and noticed that it was 
always waiting for the timeout, even if all jobs running on that node (or even 
the cluster) had already finished.

For example:
# JobA is running with at least one container on NodeA
# User runs client-side decom on NodeA at 5:00am with a timeout of 3 hours --> 
NodeA enters DECOMMISSIONING state
# and there are no oth JobA finishes at 6:00amer jobs running on NodeA
# User's client reaches the timeout at 8:00am, and forcibly decommissions NodeA

NodeA should have decommissioned at 6:00am.

  was:
I was testing the client-side NM graceful decommission and noticed that it was 
always waiting for the timeout, even if all jobs running on that node (or even 
the cluster) had already finished.

For example:
# JobA is running with at least one container on NodeA
# User runs client-side decom on NodeA at 5:00am with a timeout of 3 hours --> 
NodeA enters DECOMMISSIONING state
# JobA finishes at 6:00am and there are no other jobs running on NodeA
# User's client reaches the timeout at 8:00am, and forcibly decommissions NodeA

NodeA should have decommissioned at 6:00am.


> Client-side NM graceful decom is not triggered when jobs finish
> ---
>
> Key: YARN-5566
> URL: https://issues.apache.org/jira/browse/YARN-5566
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Fix For: 2.8.0, 3.0.0-alpha2
>
> Attachments: YARN-5566.001.patch, YARN-5566.002.patch, 
> YARN-5566.003.patch, YARN-5566.004.branch-2.8.addendum.patch, 
> YARN-5566.004.branch-2.8.patch, YARN-5566.004.patch
>
>
> I was testing the client-side NM graceful decommission and noticed that it 
> was always waiting for the timeout, even if all jobs running on that node (or 
> even the cluster) had already finished.
> For example:
> # JobA is running with at least one container on NodeA
> # User runs client-side decom on NodeA at 5:00am with a timeout of 3 hours 
> --> NodeA enters DECOMMISSIONING state
> # and there are no oth JobA finishes at 6:00amer jobs running on NodeA
> # User's client reaches the timeout at 8:00am, and forcibly decommissions 
> NodeA
> NodeA should have decommissioned at 6:00am.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org