[jira] [Created] (YARN-10417) Issues in Script Based Node Attribute Error Handling

2020-08-31 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10417:


 Summary: Issues in Script Based Node Attribute Error Handling
 Key: YARN-10417
 URL: https://issues.apache.org/jira/browse/YARN-10417
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodeattibute
Reporter: Prabhu Joseph
Assignee: Tanu Ajmera


Below issues are seen in Script Based Node Attribute Error Handling:

1. Expected format in Log prints double colon *::* but correct format is only 
single colon.

2. *Execution of Node Labels script* is wrong. 

{code}
2020-08-31 09:09:34,649 WARN 
org.apache.hadoop.yarn.server.nodemanager.nodelabels.NodeDescriptorsScriptRunner:
 Execution of Node Labels script failed, Caught exception : Malformed output, 
expecting format NODE_ATTRIBUTE::ATTRIBUTE_NAME,ATTRIBUTE_TYPE,ATTRIBUTE_VALUE; 
but get HostGroup:STRING:compute
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-1806) webUI update to allow end users to request thread dump

2020-08-26 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-1806:

Fix Version/s: 3.4.0

> webUI update to allow end users to request thread dump
> --
>
> Key: YARN-1806
> URL: https://issues.apache.org/jira/browse/YARN-1806
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ming Ma
>Assignee: Siddharth Ahuja
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-1806.001.patch
>
>
> Both individual container gage and containers page will support this. After 
> end user clicks on the request link, they can follow to get to stdout page 
> for the thread dump content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1806) webUI update to allow end users to request thread dump

2020-08-26 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17185001#comment-17185001
 ] 

Prabhu Joseph commented on YARN-1806:
-

This is very useful for debugging. Thanks [~sahuja] for the patch and 
[~akhilpb] for the review.

Have committed the patch to trunk.

> webUI update to allow end users to request thread dump
> --
>
> Key: YARN-1806
> URL: https://issues.apache.org/jira/browse/YARN-1806
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ming Ma
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-1806.001.patch
>
>
> Both individual container gage and containers page will support this. After 
> end user clicks on the request link, they can follow to get to stdout page 
> for the thread dump content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-08-24 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183261#comment-17183261
 ] 

Prabhu Joseph commented on YARN-10352:
--

Thanks [~bibinchundatt] for the review, can you commit the patch when you get 
time. Thanks.

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch, YARN-10352-007.patch, YARN-10352-008.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10360) Support Multi Node Placement in SingleConstraintAppPlacementAllocator

2020-08-24 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183147#comment-17183147
 ] 

Prabhu Joseph commented on YARN-10360:
--

Thanks [~sunilg], have committed the  [^YARN-10360-002.patch]  to trunk.

> Support Multi Node Placement in SingleConstraintAppPlacementAllocator
> -
>
> Key: YARN-10360
> URL: https://issues.apache.org/jira/browse/YARN-10360
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, multi-node-placement
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10360-001.patch, YARN-10360-002.patch
>
>
> Currently, placement constraints are not supported when Multi Node Placement 
> is enabled. This Jira is to add Support for Multi Node Placement in 
> SingleConstraintAppPlacementAllocator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10360) Support Multi Node Placement in SingleConstraintAppPlacementAllocator

2020-08-23 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182958#comment-17182958
 ] 

Prabhu Joseph commented on YARN-10360:
--

Thanks [~sunilg] for the review.

Testcase failures are existing ones and tracked by YARN-9333 and YARN-9587.

> Support Multi Node Placement in SingleConstraintAppPlacementAllocator
> -
>
> Key: YARN-10360
> URL: https://issues.apache.org/jira/browse/YARN-10360
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, multi-node-placement
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10360-001.patch, YARN-10360-002.patch
>
>
> Currently, placement constraints are not supported when Multi Node Placement 
> is enabled. This Jira is to add Support for Multi Node Placement in 
> SingleConstraintAppPlacementAllocator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10389) Option to override RMWebServices with custom WebService class

2020-08-11 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175554#comment-17175554
 ] 

Prabhu Joseph commented on YARN-10389:
--

Have committed  [^YARN-10389-008.patch]  to trunk.

> Option to override RMWebServices with custom WebService class
> -
>
> Key: YARN-10389
> URL: https://issues.apache.org/jira/browse/YARN-10389
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Tanu Ajmera
>Priority: Major
> Attachments: YARN-10389-001.patch, YARN-10389-002.patch, 
> YARN-10389-003.patch, YARN-10389-004.patch, YARN-10389-005.patch, 
> YARN-10389-006.patch, YARN-10389-007.patch, YARN-10389-008.patch
>
>
> YARN-8047 provides support to add custom WebServices as part of RMWebApp.  
> Since each WebService has to have a separate WebService Path, /ws/v1/cluster 
> root path cannot be used globally.
> Another alternative is to provide an option to override the RMWebServices 
> with custom WebServices implementation which can extend the RMWebService, 
> this way /ws/v1/cluster path can be used globally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10389) Option to override RMWebServices with custom WebService class

2020-08-11 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175401#comment-17175401
 ] 

Prabhu Joseph commented on YARN-10389:
--

Thanks [~tanu.ajmera], the latest patch  [^YARN-10389-008.patch]  looks good. 
Will commit after jenkins result.

Thanks [~BilwaST] and [~sunilg] for the review. 


> Option to override RMWebServices with custom WebService class
> -
>
> Key: YARN-10389
> URL: https://issues.apache.org/jira/browse/YARN-10389
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Tanu Ajmera
>Priority: Major
> Attachments: YARN-10389-001.patch, YARN-10389-002.patch, 
> YARN-10389-003.patch, YARN-10389-004.patch, YARN-10389-005.patch, 
> YARN-10389-006.patch, YARN-10389-007.patch, YARN-10389-008.patch
>
>
> YARN-8047 provides support to add custom WebServices as part of RMWebApp.  
> Since each WebService has to have a separate WebService Path, /ws/v1/cluster 
> root path cannot be used globally.
> Another alternative is to provide an option to override the RMWebServices 
> with custom WebServices implementation which can extend the RMWebService, 
> this way /ws/v1/cluster path can be used globally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10389) Option to override RMWebServices with custom WebService class

2020-08-10 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174941#comment-17174941
 ] 

Prabhu Joseph edited comment on YARN-10389 at 8/10/20, 5:32 PM:


[~tanu.ajmera] Thanks for the patch. The patch looks good. One minor issue 
which is not related to this patch.

1. The below null check is not required 

{code}
 bindExternalClasses();
 if (rm != null)
{code}

rm gets accessed even before the null check from bindExternalClasses, so no use 
having the null check.

{code}
  private void bindExternalClasses() {
YarnConfiguration yarnConf = new YarnConfiguration(rm.getConfig());
{code}




was (Author: prabhu joseph):
[~tanu.ajmera] Thanks for the patch. The patch looks good. One minor issue 
which is not related to this patch.

1. The below null check is not required 

{code}
 bindExternalClasses();
 if (rm != null)
{code}

rm gets accessed even before the null check from bindExternalClasses, so no use 
having the null check.

{code}
  private void bindExternalClasses() {
YarnConfiguration yarnConf = new YarnConfiguration(rm.getConfig());
{codE}



> Option to override RMWebServices with custom WebService class
> -
>
> Key: YARN-10389
> URL: https://issues.apache.org/jira/browse/YARN-10389
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Tanu Ajmera
>Priority: Major
> Attachments: YARN-10389-001.patch, YARN-10389-002.patch, 
> YARN-10389-003.patch, YARN-10389-004.patch, YARN-10389-005.patch, 
> YARN-10389-006.patch, YARN-10389-007.patch
>
>
> YARN-8047 provides support to add custom WebServices as part of RMWebApp.  
> Since each WebService has to have a separate WebService Path, /ws/v1/cluster 
> root path cannot be used globally.
> Another alternative is to provide an option to override the RMWebServices 
> with custom WebServices implementation which can extend the RMWebService, 
> this way /ws/v1/cluster path can be used globally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10389) Option to override RMWebServices with custom WebService class

2020-08-10 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174941#comment-17174941
 ] 

Prabhu Joseph commented on YARN-10389:
--

[~tanu.ajmera] Thanks for the patch. The patch looks good. One minor issue 
which is not related to this patch.

1. The below null check is not required 

{code}
 bindExternalClasses();
 if (rm != null)
{code}

rm gets accessed even before the null check from bindExternalClasses, so no use 
having the null check.

{code}
  private void bindExternalClasses() {
YarnConfiguration yarnConf = new YarnConfiguration(rm.getConfig());
{codE}



> Option to override RMWebServices with custom WebService class
> -
>
> Key: YARN-10389
> URL: https://issues.apache.org/jira/browse/YARN-10389
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Tanu Ajmera
>Priority: Major
> Attachments: YARN-10389-001.patch, YARN-10389-002.patch, 
> YARN-10389-003.patch, YARN-10389-004.patch, YARN-10389-005.patch, 
> YARN-10389-006.patch, YARN-10389-007.patch
>
>
> YARN-8047 provides support to add custom WebServices as part of RMWebApp.  
> Since each WebService has to have a separate WebService Path, /ws/v1/cluster 
> root path cannot be used globally.
> Another alternative is to provide an option to override the RMWebServices 
> with custom WebServices implementation which can extend the RMWebService, 
> this way /ws/v1/cluster path can be used globally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type

2020-08-08 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173691#comment-17173691
 ] 

Prabhu Joseph commented on YARN-10364:
--

Have pushed the patch to trunk.

> Absolute Resource [memory=0] is considered as Percentage config type
> 
>
> Key: YARN-10364
> URL: https://issues.apache.org/jira/browse/YARN-10364
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10364.001.patch, YARN-10364.002.patch, 
> YARN-10364.003.patch
>
>
> Absolute Resource [memory=0] is considered as Percentage config type. This 
> causes failure while converting queues from Percentage to Absolute Resources 
> automatically. 
> *Repro:*
> 1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100%
> 2. While converting above to absolute resource automatically, capacity of 
> queue A = [memory=], A.B = [memory=0]
> This fails with below as A is considered as Absolute Resource whereas B is 
> considered as Percentage config type.
> {code}
> 2020-07-23 09:36:40,499 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Failed 
> to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should 
> use either percentage based capacityconfiguration or absolute resource 
> together for label:
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type

2020-08-08 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173676#comment-17173676
 ] 

Prabhu Joseph commented on YARN-10364:
--

Thanks [~sunilg] for the review. Will commit the patch shortly.


> Absolute Resource [memory=0] is considered as Percentage config type
> 
>
> Key: YARN-10364
> URL: https://issues.apache.org/jira/browse/YARN-10364
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10364.001.patch, YARN-10364.002.patch, 
> YARN-10364.003.patch
>
>
> Absolute Resource [memory=0] is considered as Percentage config type. This 
> causes failure while converting queues from Percentage to Absolute Resources 
> automatically. 
> *Repro:*
> 1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100%
> 2. While converting above to absolute resource automatically, capacity of 
> queue A = [memory=], A.B = [memory=0]
> This fails with below as A is considered as Absolute Resource whereas B is 
> considered as Percentage config type.
> {code}
> 2020-07-23 09:36:40,499 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Failed 
> to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should 
> use either percentage based capacityconfiguration or absolute resource 
> together for label:
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10389) Option to override RMWebServices with custom WebService class

2020-08-06 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10389:


 Summary: Option to override RMWebServices with custom WebService 
class
 Key: YARN-10389
 URL: https://issues.apache.org/jira/browse/YARN-10389
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 3.4.0
Reporter: Prabhu Joseph
Assignee: Tanu Ajmera


YARN-8047 provides support to add custom WebServices as part of RMWebApp.  
Since each WebService has to have a separate WebService Path, /ws/v1/cluster 
root path cannot be used globally.

Another alternative is to provide an option to override the RMWebServices with 
custom WebServices implementation which can extend the RMWebService, this way 
/ws/v1/cluster path can be used globally.








--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10361) Make custom DAO classes configurable into RMWebApp#JAXBContextResolver

2020-08-06 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172109#comment-17172109
 ] 

Prabhu Joseph commented on YARN-10361:
--

Have committed the  [^YARN-10361.003.patch]  to trunk.

> Make custom DAO classes configurable into RMWebApp#JAXBContextResolver
> --
>
> Key: YARN-10361
> URL: https://issues.apache.org/jira/browse/YARN-10361
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10361.001.patch, YARN-10361.002.patch, 
> YARN-10361.003.patch
>
>
> YARN-8047 provides support to add custom WebServices as part of RMWebApp. But 
> the custom DAO classes needs to be added into JAXBContextResolver. This Jira 
> is to configure the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10361) Make custom DAO classes configurable into RMWebApp#JAXBContextResolver

2020-08-06 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172096#comment-17172096
 ] 

Prabhu Joseph commented on YARN-10361:
--

Thanks [~BilwaST] for the patch.

+1, will commit it shortly.

> Make custom DAO classes configurable into RMWebApp#JAXBContextResolver
> --
>
> Key: YARN-10361
> URL: https://issues.apache.org/jira/browse/YARN-10361
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10361.001.patch, YARN-10361.002.patch, 
> YARN-10361.003.patch
>
>
> YARN-8047 provides support to add custom WebServices as part of RMWebApp. But 
> the custom DAO classes needs to be added into JAXBContextResolver. This Jira 
> is to configure the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-08-04 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Attachment: YARN-10352-008.patch

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch, YARN-10352-007.patch, YARN-10352-008.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10377) Clicking on queue in Capacity Scheduler legacy ui does not show any applications

2020-08-04 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170989#comment-17170989
 ] 

Prabhu Joseph commented on YARN-10377:
--

Thanks [~tarunparimi], have committed the patch to trunk.

> Clicking on queue in Capacity Scheduler legacy ui does not show any 
> applications
> 
>
> Key: YARN-10377
> URL: https://issues.apache.org/jira/browse/YARN-10377
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: Screenshot 2020-07-29 at 12.01.28 PM.png, Screenshot 
> 2020-07-29 at 12.01.36 PM.png, YARN-10377.001.patch
>
>
> The issue is in the capacity scheduler 
> [http://rm-host:port/clustter/scheduler] page 
>  If I click on the root queue, I am able to see the applications.
>  !Screenshot 2020-07-29 at 12.01.28 PM.png!
> But the application disappears when I click on the leaf queue -> default. 
> This issue is not present in the older 2.7.0 versions and I am able to see 
> apps normally filtered by the leaf queue when clicking on it.
> !Screenshot 2020-07-29 at 12.01.36 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call

2020-08-04 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10381:
-
Fix Version/s: 3.4.0

> Send out application attempt state along with other elements in the 
> application attempt object returned from appattempts REST API call
> --
>
> Key: YARN-10381
> URL: https://issues.apache.org/jira/browse/YARN-10381
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: YARN-10381.001.patch, YARN-10381.002.patch, 
> YARN-10381.003.patch
>
>
> The [ApplicationAttempts RM REST 
> API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API]
>  :
> {code}
> http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts
> {code}
> returns a collection of Application Attempt objects, where each application 
> attempt object contains elements like id, nodeId, startTime etc.
> This JIRA has been raised to send out Application Attempt state as well as 
> part of the application attempt information from this REST API call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call

2020-08-04 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170985#comment-17170985
 ] 

Prabhu Joseph commented on YARN-10381:
--

Have committed the patch to trunk. 

> Send out application attempt state along with other elements in the 
> application attempt object returned from appattempts REST API call
> --
>
> Key: YARN-10381
> URL: https://issues.apache.org/jira/browse/YARN-10381
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10381.001.patch, YARN-10381.002.patch, 
> YARN-10381.003.patch
>
>
> The [ApplicationAttempts RM REST 
> API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API]
>  :
> {code}
> http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts
> {code}
> returns a collection of Application Attempt objects, where each application 
> attempt object contains elements like id, nodeId, startTime etc.
> This JIRA has been raised to send out Application Attempt state as well as 
> part of the application attempt information from this REST API call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call

2020-08-04 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170978#comment-17170978
 ] 

Prabhu Joseph commented on YARN-10381:
--

Thanks [~sahuja] for the patch and [~BilwaST] for the review.

The patch looks good, +1. Will commit it shortly.

> Send out application attempt state along with other elements in the 
> application attempt object returned from appattempts REST API call
> --
>
> Key: YARN-10381
> URL: https://issues.apache.org/jira/browse/YARN-10381
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10381.001.patch, YARN-10381.002.patch, 
> YARN-10381.003.patch
>
>
> The [ApplicationAttempts RM REST 
> API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API]
>  :
> {code}
> http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts
> {code}
> returns a collection of Application Attempt objects, where each application 
> attempt object contains elements like id, nodeId, startTime etc.
> This JIRA has been raised to send out Application Attempt state as well as 
> part of the application attempt information from this REST API call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-08-04 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Attachment: YARN-10352-007.patch

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch, YARN-10352-007.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-08-04 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170676#comment-17170676
 ] 

Prabhu Joseph commented on YARN-10352:
--

Thanks [~bibinchundatt] for reviewing.

bq. The custom iterator how much improvement we have against the 
Iterators.filter ?

Have used custom iterator mainly to avoid an unnecessary Null Check required by 
FindBugs on using Iterators.filter with predicate in [^YARN-10352-002.patch] - 
[Build 
Run|https://issues.apache.org/jira/browse/YARN-10352?focusedCommentId=17161295=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17161295]
 

{code}
node must be non-null but is marked as nullable At 
MultiNodeSortingManager.java:is marked as nullable At 
MultiNodeSortingManager.java:[lines 124-125]
{code}

Was having a Predicate like below

{code}
private Predicate heartbeatFilter = new Predicate() {
+@Override
+public boolean apply(final N node) {
+  long timeElapsedFromLastHeartbeat =
+  Time.monotonicNow() - node.getLastHeartbeatMonotonicTime();
+  return timeElapsedFromLastHeartbeat <= (nmHeartbeatInterval * 2);
+}
+  };
{code}


Let me know if this is fine, or the findbugs issue can be ignored. Will fix the 
other two comments. Thanks.


 

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10377) Clicking on queue in Capacity Scheduler legacy ui does not show any applications

2020-08-03 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169987#comment-17169987
 ] 

Prabhu Joseph commented on YARN-10377:
--

[~tarunparimi] Thanks for the patch. The patch looks good.

Will commit it tomorrow if no other comments. 

> Clicking on queue in Capacity Scheduler legacy ui does not show any 
> applications
> 
>
> Key: YARN-10377
> URL: https://issues.apache.org/jira/browse/YARN-10377
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: Screenshot 2020-07-29 at 12.01.28 PM.png, Screenshot 
> 2020-07-29 at 12.01.36 PM.png, YARN-10377.001.patch
>
>
> The issue is in the capacity scheduler 
> [http://rm-host:port/clustter/scheduler] page 
>  If I click on the root queue, I am able to see the applications.
>  !Screenshot 2020-07-29 at 12.01.28 PM.png!
> But the application disappears when I click on the leaf queue -> default. 
> This issue is not present in the older 2.7.0 versions and I am able to see 
> apps normally filtered by the leaf queue when clicking on it.
> !Screenshot 2020-07-29 at 12.01.36 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type

2020-08-03 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169986#comment-17169986
 ] 

Prabhu Joseph commented on YARN-10364:
--

[~BilwaST] Latest patch looks good to me.

[~sunilg] Can you review the latest patch when you get time. Thanks.

> Absolute Resource [memory=0] is considered as Percentage config type
> 
>
> Key: YARN-10364
> URL: https://issues.apache.org/jira/browse/YARN-10364
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10364.001.patch, YARN-10364.002.patch, 
> YARN-10364.003.patch
>
>
> Absolute Resource [memory=0] is considered as Percentage config type. This 
> causes failure while converting queues from Percentage to Absolute Resources 
> automatically. 
> *Repro:*
> 1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100%
> 2. While converting above to absolute resource automatically, capacity of 
> queue A = [memory=], A.B = [memory=0]
> This fails with below as A is considered as Absolute Resource whereas B is 
> considered as Percentage config type.
> {code}
> 2020-07-23 09:36:40,499 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Failed 
> to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should 
> use either percentage based capacityconfiguration or absolute resource 
> together for label:
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type

2020-08-03 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169867#comment-17169867
 ] 

Prabhu Joseph commented on YARN-10364:
--

The change looks good, thanks.

> Absolute Resource [memory=0] is considered as Percentage config type
> 
>
> Key: YARN-10364
> URL: https://issues.apache.org/jira/browse/YARN-10364
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10364.001.patch, YARN-10364.002.patch, 
> YARN-10364.003.patch
>
>
> Absolute Resource [memory=0] is considered as Percentage config type. This 
> causes failure while converting queues from Percentage to Absolute Resources 
> automatically. 
> *Repro:*
> 1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100%
> 2. While converting above to absolute resource automatically, capacity of 
> queue A = [memory=], A.B = [memory=0]
> This fails with below as A is considered as Absolute Resource whereas B is 
> considered as Percentage config type.
> {code}
> 2020-07-23 09:36:40,499 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Failed 
> to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should 
> use either percentage based capacityconfiguration or absolute resource 
> together for label:
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type

2020-08-03 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169812#comment-17169812
 ] 

Prabhu Joseph commented on YARN-10364:
--

Thanks [~BilwaST] for the patch. Have a comment in below section

1. AbstractCSQueue#validateAbsoluteVsPercentageCapacityConfig will always 
succeed as it compares localType copied from this.capacityConfigType with the 
same this.capacityConfigType.

It has to compare the type (this.capacityConfigType) of previous node label 
with the new one derived using minResource of next node label. Like below.

{code}
  private void validateAbsoluteVsPercentageCapacityConfig(
  String queuePath, String label) {
CapacityConfigType localType = 
checkConfigTypeIsAbsoluteResource(queuePath,
label) ? CapacityConfigType.ABSOLUTE_RESOURCE : 
CapacityConfigType.PERCENTAGE;
if (!queuePath.equals("root")
&& !this.capacityConfigType.equals(localType)) {
  throw new IllegalArgumentException("Queue '" + getQueuePath()
  + "' should use either percentage based capacity"
  + " configuration or absolute resource.");
}
  }
{code}  


> Absolute Resource [memory=0] is considered as Percentage config type
> 
>
> Key: YARN-10364
> URL: https://issues.apache.org/jira/browse/YARN-10364
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10364.001.patch, YARN-10364.002.patch
>
>
> Absolute Resource [memory=0] is considered as Percentage config type. This 
> causes failure while converting queues from Percentage to Absolute Resources 
> automatically. 
> *Repro:*
> 1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100%
> 2. While converting above to absolute resource automatically, capacity of 
> queue A = [memory=], A.B = [memory=0]
> This fails with below as A is considered as Absolute Resource whereas B is 
> considered as Percentage config type.
> {code}
> 2020-07-23 09:36:40,499 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Failed 
> to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should 
> use either percentage based capacityconfiguration or absolute resource 
> together for label:
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call

2020-08-02 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169602#comment-17169602
 ] 

Prabhu Joseph commented on YARN-10381:
--

Thanks [~sahuja] for the patch. Can you update the doc - [ApplicationAttempts 
RM REST 
API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API]
 - Elements of the appAttempt object, JSON and XML response section.

> Send out application attempt state along with other elements in the 
> application attempt object returned from appattempts REST API call
> --
>
> Key: YARN-10381
> URL: https://issues.apache.org/jira/browse/YARN-10381
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-ui-v2
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10381.001.patch, YARN-10381.002.patch
>
>
> The [ApplicationAttempts RM REST 
> API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API]
>  :
> {code}
> http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts
> {code}
> returns a collection of Application Attempt objects, where each application 
> attempt object contains elements like id, nodeId, startTime etc.
> This JIRA has been raised to send out Application Attempt state as well as 
> part of the application attempt information from this REST API call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-07-31 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10380:
-
Parent: YARN-5139
Issue Type: Sub-task  (was: Improvement)

> Import logic of multi-node allocation in CapacityScheduler
> --
>
> Key: YARN-10380
> URL: https://issues.apache.org/jira/browse/YARN-10380
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Priority: Critical
>
> *1) Entry point:* 
> When we do multi-node allocation, we're using the same logic of async 
> scheduling:
> {code:java}
> // Allocate containers of node [start, end)
>  for (FiCaSchedulerNode node : nodes) {
>   if (current++ >= start) {
>      if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
>         continue;
>      }
>      cs.allocateContainersToNode(node.getNodeID(), false);
>   }
>  } {code}
> Is it the most effective way to do multi-node scheduling? Should we allocate 
> based on partitions? In above logic, if we have thousands of node in one 
> partition, we will repeatly access all nodes of the partition thousands of 
> times.
> I would suggest looking at making entry-point for node-heartbeat, 
> async-scheduling (single node), and async-scheduling (multi-node) to be 
> different.
> Node-heartbeat and async-scheduling (single node) can be still similar and 
> share most of the code. 
> async-scheduling (multi-node): should iterate partition first, using pseudo 
> code like: 
> {code:java}
> for (partition : all partitions) {
>   allocateContainersOnMultiNodes(getCandidate(partition))
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-07-31 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168893#comment-17168893
 ] 

Prabhu Joseph commented on YARN-10380:
--

[~wangda] Below are the other issues

1. YARN-10357 - Proactively relocate allocated containers from a stopped node

2. Handling difference in CandidateSet v.s. Multi-node sorter - 
https://issues.apache.org/jira/browse/YARN-10352?focusedCommentId=17161696=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17161696

3. NM does not unregister to RM when recovery is enabled. This causes RM to 
unnecessarily allocate on those nodes and later need to relocate (YARN-10357) 
if nodes has not heartbeated. Relying on heartbeats won't be accurate if there 
are some network delays, instead NM can unregister with a special flag set. 

> Import logic of multi-node allocation in CapacityScheduler
> --
>
> Key: YARN-10380
> URL: https://issues.apache.org/jira/browse/YARN-10380
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wangda Tan
>Priority: Critical
>
> *1) Entry point:* 
> When we do multi-node allocation, we're using the same logic of async 
> scheduling:
> {code:java}
> // Allocate containers of node [start, end)
>  for (FiCaSchedulerNode node : nodes) {
>   if (current++ >= start) {
>      if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
>         continue;
>      }
>      cs.allocateContainersToNode(node.getNodeID(), false);
>   }
>  } {code}
> Is it the most effective way to do multi-node scheduling? Should we allocate 
> based on partitions? In above logic, if we have thousands of node in one 
> partition, we will repeatly access all nodes of the partition thousands of 
> times.
> I would suggest looking at making entry-point for node-heartbeat, 
> async-scheduling (single node), and async-scheduling (multi-node) to be 
> different.
> Node-heartbeat and async-scheduling (single node) can be still similar and 
> share most of the code. 
> async-scheduling (multi-node): should iterate partition first, using pseudo 
> code like: 
> {code:java}
> for (partition : all partitions) {
>   allocateContainersOnMultiNodes(getCandidate(partition))
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10360) Support Multi Node Placement in SingleConstraintAppPlacementAllocator

2020-07-27 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165524#comment-17165524
 ] 

Prabhu Joseph commented on YARN-10360:
--

[~tangzhankun] Can you review this Jira when you get time. This adds Multi Node 
Iterator inside SingleConstraintAppPlacementAllocator to serve 
SchedulingRequest when Multi Node Placement Enabled. Thanks.

> Support Multi Node Placement in SingleConstraintAppPlacementAllocator
> -
>
> Key: YARN-10360
> URL: https://issues.apache.org/jira/browse/YARN-10360
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, multi-node-placement
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10360-001.patch, YARN-10360-002.patch
>
>
> Currently, placement constraints are not supported when Multi Node Placement 
> is enabled. This Jira is to add Support for Multi Node Placement in 
> SingleConstraintAppPlacementAllocator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10360) Support Multi Node Placement in SingleConstraintAppPlacementAllocator

2020-07-27 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10360:
-
Attachment: YARN-10360-002.patch

> Support Multi Node Placement in SingleConstraintAppPlacementAllocator
> -
>
> Key: YARN-10360
> URL: https://issues.apache.org/jira/browse/YARN-10360
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, multi-node-placement
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10360-001.patch, YARN-10360-002.patch
>
>
> Currently, placement constraints are not supported when Multi Node Placement 
> is enabled. This Jira is to add Support for Multi Node Placement in 
> SingleConstraintAppPlacementAllocator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type

2020-07-27 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165502#comment-17165502
 ] 

Prabhu Joseph commented on YARN-10364:
--

[~BilwaST] Yes right. One more place has similar check in 
AbstractCSQueue#validateAbsoluteVsPercentageCapacityConfig

{code}
   if (!minResource.equals(Resources.none())) {
  localType = CapacityConfigType.ABSOLUTE_RESOURCE;
}
{code}

Any config which matches RESOURCE_PATTERN like [memory=0] or [vcores=0] has to 
be considered as CapacityConfigType.ABSOLUTE_RESOURCE.

Need thorough testing to make sure that considering configs [memory=0] or 
[vcores=0] as CapacityConfigType.ABSOLUTE_RESOURCE does not cause any failures 
for a ParentQueue, LeafQueue, ManagedParentQueue and AutoCreatedLeafQueue.






> Absolute Resource [memory=0] is considered as Percentage config type
> 
>
> Key: YARN-10364
> URL: https://issues.apache.org/jira/browse/YARN-10364
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Bilwa S T
>Priority: Major
>
> Absolute Resource [memory=0] is considered as Percentage config type. This 
> causes failure while converting queues from Percentage to Absolute Resources 
> automatically. 
> *Repro:*
> 1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100%
> 2. While converting above to absolute resource automatically, capacity of 
> queue A = [memory=], A.B = [memory=0]
> This fails with below as A is considered as Absolute Resource whereas B is 
> considered as Percentage config type.
> {code}
> 2020-07-23 09:36:40,499 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Failed 
> to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should 
> use either percentage based capacityconfiguration or absolute resource 
> together for label:
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10366) Yarn rmadmin help message shows two labels for one node for --replaceLabelsOnNode

2020-07-27 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165478#comment-17165478
 ] 

Prabhu Joseph commented on YARN-10366:
--

Have committed to trunk. Will resolve the Jira.

> Yarn rmadmin help message shows two labels for one node for 
> --replaceLabelsOnNode
> -
>
> Key: YARN-10366
> URL: https://issues.apache.org/jira/browse/YARN-10366
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Tanu Ajmera
>Assignee: Tanu Ajmera
>Priority: Major
> Attachments: Screenshot 2020-07-24 at 4.07.10 PM.png, 
> YARN-10366-001.patch
>
>
> In the help message of “yarn rmadmin” , looks like one node can be assign 
> with two labels, which is not consistent with the “Each node can have only 
> one node label”



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10366) Yarn rmadmin help message shows two labels for one node for --replaceLabelsOnNode

2020-07-27 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165465#comment-17165465
 ] 

Prabhu Joseph commented on YARN-10366:
--

Thanks [~tanu.ajmera] for the patch.

+1. will commit it shortly.

> Yarn rmadmin help message shows two labels for one node for 
> --replaceLabelsOnNode
> -
>
> Key: YARN-10366
> URL: https://issues.apache.org/jira/browse/YARN-10366
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Tanu Ajmera
>Assignee: Tanu Ajmera
>Priority: Major
> Attachments: Screenshot 2020-07-24 at 4.07.10 PM.png, 
> YARN-10366-001.patch
>
>
> In the help message of “yarn rmadmin” , looks like one node can be assign 
> with two labels, which is not consistent with the “Each node can have only 
> one node label”



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-07-24 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164177#comment-17164177
 ] 

Prabhu Joseph commented on YARN-10319:
--

Thanks [~Tao Yang] and [~adam.antal] for the review.

Have committed the  [^YARN-10319-006.patch]  to trunk. Will resolve the Jira.

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, 
> YARN-10319-004.patch, YARN-10319-005.patch, YARN-10319-006.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-07-24 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10319:
-
Fix Version/s: 3.4.0

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Fix For: 3.4.0
>
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, 
> YARN-10319-004.patch, YARN-10319-005.patch, YARN-10319-006.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-07-23 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163411#comment-17163411
 ] 

Prabhu Joseph commented on YARN-10319:
--

[~adam.antal] Failed testcase is not related, let me know if the latest patch  
[^YARN-10319-006.patch]  is fine. Thanks.

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, 
> YARN-10319-004.patch, YARN-10319-005.patch, YARN-10319-006.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type

2020-07-23 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10364:


 Summary: Absolute Resource [memory=0] is considered as Percentage 
config type
 Key: YARN-10364
 URL: https://issues.apache.org/jira/browse/YARN-10364
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


Absolute Resource [memory=0] is considered as Percentage config type. This 
causes failure while converting queues from Percentage to Absolute Resources 
automatically. 

*Repro:*

1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100%

2. While converting above to absolute resource automatically, capacity of queue 
A = [memory=], A.B = [memory=0]

This fails with below as A is considered as Absolute Resource whereas B is 
considered as Percentage config type.

{code}
2020-07-23 09:36:40,499 WARN 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
CapacityScheduler configuration validation failed:java.io.IOException: Failed 
to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should use 
either percentage based capacityconfiguration or absolute resource together for 
label:
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type

2020-07-23 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10364:
-
Affects Version/s: 3.4.0

> Absolute Resource [memory=0] is considered as Percentage config type
> 
>
> Key: YARN-10364
> URL: https://issues.apache.org/jira/browse/YARN-10364
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> Absolute Resource [memory=0] is considered as Percentage config type. This 
> causes failure while converting queues from Percentage to Absolute Resources 
> automatically. 
> *Repro:*
> 1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100%
> 2. While converting above to absolute resource automatically, capacity of 
> queue A = [memory=], A.B = [memory=0]
> This fails with below as A is considered as Absolute Resource whereas B is 
> considered as Percentage config type.
> {code}
> 2020-07-23 09:36:40,499 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Failed 
> to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should 
> use either percentage based capacityconfiguration or absolute resource 
> together for label:
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-23 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163335#comment-17163335
 ] 

Prabhu Joseph commented on YARN-10352:
--

[~wangda] The failing testcase if not related to this fix. Let me know if the 
latest patch  [^YARN-10352-006.patch] is fine, i will commit it if no comments. 
Thanks.

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-22 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Attachment: YARN-10352-006.patch

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10361) Make custom DAO classes configurable into RMWebApp#JAXBContextResolver

2020-07-22 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10361:


 Summary: Make custom DAO classes configurable into 
RMWebApp#JAXBContextResolver
 Key: YARN-10361
 URL: https://issues.apache.org/jira/browse/YARN-10361
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 3.4.0
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


YARN-8047 provides support to add custom WebServices as part of RMWebApp. But 
the custom DAO classes needs to be added into JAXBContextResolver. This Jira is 
to configure the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10360) Support Multi Node Placement in SingleConstraintAppPlacementAllocator

2020-07-22 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10360:
-
Attachment: YARN-10360-001.patch

> Support Multi Node Placement in SingleConstraintAppPlacementAllocator
> -
>
> Key: YARN-10360
> URL: https://issues.apache.org/jira/browse/YARN-10360
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, multi-node-placement
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10360-001.patch
>
>
> Currently, placement constraints are not supported when Multi Node Placement 
> is enabled. This Jira is to add Support for Multi Node Placement in 
> SingleConstraintAppPlacementAllocator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10360) Support Multi Node Placement in SingleConstraintAppPlacementAllocator

2020-07-22 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10360:


 Summary: Support Multi Node Placement in 
SingleConstraintAppPlacementAllocator
 Key: YARN-10360
 URL: https://issues.apache.org/jira/browse/YARN-10360
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, multi-node-placement
Affects Versions: 3.4.0
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


Currently, placement constraints are not supported when Multi Node Placement is 
enabled. This Jira is to add Support for Multi Node Placement in 
SingleConstraintAppPlacementAllocator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-07-22 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10293:
-
Parent: YARN-5139
Issue Type: Sub-task  (was: Bug)

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch, 
> YARN-10293-003-WIP.patch, YARN-10293-004.patch, YARN-10293-005.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when the used capacity 
> is <= 1.0f ?*
> {code}
> The reason is even though the container is preempted - nodemanager has to 
> stop the container and heartbeat and update the available and unallocated 
> resources to ResourceManager.
> {code}
> 5. Now, no new allocation happens and reserved container stays at reserved.
> After reservation the used capacity becomes 1.0f, below will be in a loop and 
> no new allocate or reserve happens. The reserved container cannot be 
> allocated as reserved node does not have space. node2 has space for 1GB, 
> 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting 
> called causing the Hang.
> *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> 
> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container 
> on node*
> {code}
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1590046667304_0005 
> on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignContainers: partition= #applications=1
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> 2020-05-21 12:13:33,243 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Allocation proposal accepted
> {code}
> CapacityScheduler#allocateOrReserveNewContainers won't be 

[jira] [Updated] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement

2020-07-22 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10259:
-
Parent: YARN-5139
Issue Type: Sub-task  (was: Bug)

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement
> ---
>
> Key: YARN-10259
> URL: https://issues.apache.org/jira/browse/YARN-10259
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10259-001.patch, YARN-10259-002.patch, 
> YARN-10259-003.patch
>
>
> Reserved Containers are not allocated from the available space of other nodes 
> in CandidateNodeSet in MultiNodePlacement. 
> *Repro:*
> 1. MultiNode Placement Enabled.
> 2. Two nodes h1 and h2 with 8GB
> 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets 
> placed in h2.
> 4. Submit app3 AM which is reserved in h1
> 5. Kill app2 which frees space in h2.
> 6. app3 AM never gets ALLOCATED
> RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on 
> h2 as it expects the assignment to be on same node where reservation has 
> happened.
> {code}
> 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt 
> appattempt_1588684773609_0003_01 reserved container 
> container_1588684773609_0003_01_01 on node host: h1:1234 #containers=1 
> available= used=. This attempt 
> currently has 1 reserved containers at priority 0; currentReservation 
> 
> 2020-05-05 18:49:37,264 INFO  [AsyncDispatcher event handler] 
> fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved 
> container=container_1588684773609_0003_01_01, on node=host: h1:1234 
> #containers=1 available= used= 
> with resource=
>RESERVED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h1:1234; Resource=)]
>
> 2020-05-05 18:49:38,283 DEBUG [Time-limited test] 
> allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: 
> node=h2 application=application_1588684773609_0003 priority=0 
> pendingAsk=,repeat=1> 
> type=OFF_SWITCH
> 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate 
> from reserved container container_1588684773609_0003_01_01, but node is 
> not reserved
>ALLOCATED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h2:1234; Resource=)]
> {code}
> Attached testcase which reproduces the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10357) Proactively relocate allocated containers from a stopped node

2020-07-22 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10357:
-
Parent: YARN-5139
Issue Type: Sub-task  (was: Improvement)

> Proactively relocate allocated containers from a stopped node
> -
>
> Key: YARN-10357
> URL: https://issues.apache.org/jira/browse/YARN-10357
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, multi-node-placement
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> In a cloud environment, node can be frequently commissioned, if we always 
> wait for 10 mins timeout, it may not be good, it's better to improve the 
> logic by preempting containers newly allocated (by not acquired) on NM which 
> stopped heartbeating. With this, we can proactively relocate containers to 
> different nodes before the 10 mins timeout.
> cc [~leftnoteasy]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-22 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Parent: YARN-5139
Issue Type: Sub-task  (was: Bug)

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-07-21 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162214#comment-17162214
 ] 

Prabhu Joseph commented on YARN-10319:
--

Thanks [~adam.antal] for the review.

Have addressed them in patch  [^YARN-10319-006.patch] .

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, 
> YARN-10319-004.patch, YARN-10319-005.patch, YARN-10319-006.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-07-21 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10319:
-
Attachment: YARN-10319-006.patch

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, 
> YARN-10319-004.patch, YARN-10319-005.patch, YARN-10319-006.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-21 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Attachment: YARN-10352-005.patch

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-21 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162199#comment-17162199
 ] 

Prabhu Joseph commented on YARN-10352:
--

Thanks [~wangda].

Have fixed checkstyle issues and failed testcase in  [^YARN-10352-005.patch] 

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-21 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161931#comment-17161931
 ] 

Prabhu Joseph commented on YARN-10352:
--

[~wangda] Thanks [~wangda] for the review comments. Have addressed them in  
[^YARN-10352-004.patch] 

bq.  With this, we can proactively relocate containers to different nodes 
before the 10 mins timeout. 

Yes right, have reported YARN-10357 to track this.

Currently NM does not unregister from RM when Node Recovery is Enabled so that 
it won't affect the existing running containers. Instead, i think it can send 
unRegisterNM with a boolean set which RM can use for stop scheduling, 
preempting allocated (but not acquired) containers without disturbing running 
containers on that node. RM will also have the right cluster available 
resources without considering the stopped nodes.

NodeStatusUpdaterImpl#serviceStop

{code}
 if (this.registeredWithRM && !this.isStopped
  && !isNMUnderSupervisionWithRecoveryEnabled()
  && !context.getDecommissioned() && !failedToConnect) {
unRegisterNM();
  }
{code}



> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10357) Proactively relocate allocated containers from a stopped node

2020-07-21 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10357:


 Summary: Proactively relocate allocated containers from a stopped 
node
 Key: YARN-10357
 URL: https://issues.apache.org/jira/browse/YARN-10357
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: multi-node-placement, capacityscheduler
Affects Versions: 3.4.0
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


In a cloud environment, node can be frequently commissioned, if we always wait 
for 10 mins timeout, it may not be good, it's better to improve the logic by 
preempting containers newly allocated (by not acquired) on NM which stopped 
heartbeating. With this, we can proactively relocate containers to different 
nodes before the 10 mins timeout.

cc [~leftnoteasy]





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-21 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Attachment: YARN-10352-004.patch

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161682#comment-17161682
 ] 

Prabhu Joseph commented on YARN-10352:
--

[~wangda] For each node in the list given by 
{{CapacityScheduler#getNodesHeartbeated}}, only allocation of reserved 
containers from that node happens.

Allocate or Reserve new containers uses the multiple node candidates prepared 
by {{MultiNodeSorter#reSortClusterNodes}} (below code snippet) which passes the 
list to the configured {{MultiNodeLookupPolicy}} to perform sorting in 
background at every configured sorting interval. {{MultiNodeSortingManager}} 
filters that list while returning to {{RegularContainerAllocator#allocate}} 
call.

 
{code:java}
  Map nodesByPartition = new HashMap<>();
  List nodes = ((AbstractYarnScheduler) rmContext
  .getScheduler()).getNodeTracker().getNodesPerPartition(label);
  if (nodes != null) {
nodes.forEach(n -> nodesByPartition.put(n.getNodeID(), n));
multiNodePolicy.addAndRefreshNodesSet(
(Collection) nodesByPartition.values(), label);
  }
{code}

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Attachment: YARN-10352-003.patch

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161163#comment-17161163
 ] 

Prabhu Joseph commented on YARN-10352:
--

Thanks [~bibinchundatt] for the inputs.

1. Removed iterating the nodes from {{ClusterNodeTracker}} and moved the 
filtering logic to {{CapacityScheduler}}.

2. Have added filter logic while returning the {{preferrednodeIterator}}.

3. {{reSortClusterNodes}} need not filter as at the end 
{{preferrednodeIterator}} does the same. Let me know if this is fine.





> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Attachment: YARN-10352-002.patch

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-17 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Attachment: YARN-10352-001.patch

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-17 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Summary: Skip schedule on not heartbeated nodes in Multi Node Placement  
(was: MultiNode Placement assigns container on stopped NodeManagers)

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments

2020-07-16 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159389#comment-17159389
 ] 

Prabhu Joseph commented on YARN-10339:
--

Have committed the  [^YARN-10339.002.patch]  to trunk. Will resolve this Jira.

> Timeline Client in Nodemanager gets 403 errors when simple auth is used in 
> kerberos environments
> 
>
> Key: YARN-10339
> URL: https://issues.apache.org/jira/browse/YARN-10339
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineclient
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10339.001.patch, YARN-10339.002.patch
>
>
> We get below errors in NodeManager logs whenever we set 
> yarn.timeline-service.http-authentication.type=simple in a cluster which has 
> kerberos enabled. There are use cases where simple auth is used only in 
> timeline server for convenience although kerberos is enabled.
> {code:java}
> 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl 
> (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline 
> server is not successful, HTTP error code: 403, Server response:
> {"exception":"ForbiddenException","message":"java.lang.Exception: The owner 
> of the posted timeline entities is not 
> set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}
> {code}
> This seems to affect the NM timeline publisher which uses 
> TimelineV2ClientImpl. Doing a simple auth directly to timeline service via 
> curl works fine. So this issue is in the authenticator configuration in 
> timeline client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments

2020-07-16 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159383#comment-17159383
 ] 

Prabhu Joseph commented on YARN-10339:
--

Thanks [~tarunparimi] for the patch.

+1, will commit it shortly.

> Timeline Client in Nodemanager gets 403 errors when simple auth is used in 
> kerberos environments
> 
>
> Key: YARN-10339
> URL: https://issues.apache.org/jira/browse/YARN-10339
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineclient
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10339.001.patch, YARN-10339.002.patch
>
>
> We get below errors in NodeManager logs whenever we set 
> yarn.timeline-service.http-authentication.type=simple in a cluster which has 
> kerberos enabled. There are use cases where simple auth is used only in 
> timeline server for convenience although kerberos is enabled.
> {code:java}
> 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl 
> (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline 
> server is not successful, HTTP error code: 403, Server response:
> {"exception":"ForbiddenException","message":"java.lang.Exception: The owner 
> of the posted timeline entities is not 
> set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}
> {code}
> This seems to affect the NM timeline publisher which uses 
> TimelineV2ClientImpl. Doing a simple auth directly to timeline service via 
> curl works fine. So this issue is in the authenticator configuration in 
> timeline client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) MultiNode Placement assigns container on stopped NodeManagers

2020-07-16 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Summary: MultiNode Placement assigns container on stopped NodeManagers  
(was: MultiNode Placament assigns container on stopped NodeManagers)

> MultiNode Placement assigns container on stopped NodeManagers
> -
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) MultiNode Placament assigns container on stopped NodeManagers

2020-07-16 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Labels: capacityscheduler multi-node-placement  (was: )

> MultiNode Placament assigns container on stopped NodeManagers
> -
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) MultiNode Placament assigns container on stopped NodeManagers

2020-07-16 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Affects Version/s: 3.3.0

> MultiNode Placament assigns container on stopped NodeManagers
> -
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10352) MultiNode Placament assigns container on stopped NodeManagers

2020-07-16 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10352:
-
Affects Version/s: 3.4.0

> MultiNode Placament assigns container on stopped NodeManagers
> -
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10352) MultiNode Placament assigns container on stopped NodeManagers

2020-07-16 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10352:


 Summary: MultiNode Placament assigns container on stopped 
NodeManagers
 Key: YARN-10352
 URL: https://issues.apache.org/jira/browse/YARN-10352
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
Active Nodes will be still having those stopped nodes until NM Liveliness 
Monitor Expires after configured timeout 
(yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
Multi Node Placement assigns the containers on those nodes. They need to 
exclude the nodes which has not heartbeated for configured heartbeat interval 
(yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
Asynchronous Capacity Scheduler Threads. 
(CapacityScheduler#shouldSkipNodeSchedule)


*Repro:*

1. Enable Multi Node Placement 
(yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery Enabled  
(yarn.node.recovery.enabled)

2. Have only one NM running say worker0

3. Stop worker0 and start any other NM say worker1

4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
worker0.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8676) Incorrect progress index in old yarn UI

2020-07-13 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17156533#comment-17156533
 ] 

Prabhu Joseph commented on YARN-8676:
-

[~Cyl] Can you share the screenshot of RM UI before the patch which shows the 
issue and the screenshot after the fix which shows the issue gets fixed. Thanks.

> Incorrect progress index in old yarn UI
> ---
>
> Key: YARN-8676
> URL: https://issues.apache.org/jira/browse/YARN-8676
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yeliang Cang
>Assignee: Yeliang Cang
>Priority: Critical
> Attachments: YARN-8676.001.patch
>
>
> The index of parseHadoopProgress index is wrong in 
> WebPageUtils#getAppsTableColumnDefs
> {code:java}
> if (isFairSchedulerPage) {
>  sb.append("[15]");
> } else if (isResourceManager) {
>  sb.append("[17]");
> } else {
>  sb.append("[9]");
> }
> {code}
> should be
> {code:java}
> if (isFairSchedulerPage) {
>  sb.append("[16]");
> } else if (isResourceManager) {
>  sb.append("[18]");
> } else {
>  sb.append("[11]");
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10349) YARN_AM_RM_TOKEN, Localizer token does not have a service name

2020-07-09 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10349:


 Summary: YARN_AM_RM_TOKEN, Localizer token does not have a service 
name
 Key: YARN-10349
 URL: https://issues.apache.org/jira/browse/YARN-10349
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


UGI Credentials#addToken silently overrides the token with same service name 
(HADOOP-17121). This causes tokens like YARN_AM_RM_TOKEN, Localizer with empty 
service name to get overridden. It is safer to have a service name for these 
tokens.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10340) HsWebServices getContainerReport uses loginUser instead of remoteUser to access ApplicationClientProtocol

2020-07-08 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10340:
-
Parent: YARN-10025
Issue Type: Sub-task  (was: Bug)

> HsWebServices getContainerReport uses loginUser instead of remoteUser to 
> access ApplicationClientProtocol
> -
>
> Key: YARN-10340
> URL: https://issues.apache.org/jira/browse/YARN-10340
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Prabhu Joseph
>Assignee: Tarun Parimi
>Priority: Major
>
> HsWebServices getContainerReport uses loginUser instead of remoteUser to 
> access ApplicationClientProtocol
>  
> [http://:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs|http://pjoseph-secure-1.pjoseph-secure.root.hwx.site:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs]
> While accessing above link using systest user, the request fails saying 
> mapred user does not have access to the job
>  
> {code:java}
> 2020-07-06 14:02:59,178 WARN org.apache.hadoop.yarn.server.webapp.LogServlet: 
> Could not obtain node HTTP address from provider.
> javax.ws.rs.WebApplicationException: 
> org.apache.hadoop.yarn.exceptions.YarnException: User mapred does not have 
> privilege to see this application application_1593997842459_0214
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:516)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:985)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:913)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowThrowable(WebServices.java:544)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowException(WebServices.java:530)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getContainer(WebServices.java:405)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getNodeHttpAddress(WebServices.java:373)
> at 
> org.apache.hadoop.yarn.server.webapp.LogServlet.getContainerLogsInfo(LogServlet.java:268)
> at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.getContainerLogs(HsWebServices.java:461)
>  
> {code}
> On Analyzing, found WebServices#getContainer uses doAs using UGI created by 
> createRemoteUser(end user) to access RM#ApplicationClientProtocol which does 
> not work. Need to use createProxyUser to do the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs

2020-07-08 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10345:
-
Parent: YARN-10025
Issue Type: Sub-task  (was: Bug)

> HsWebServices containerlogs does not honor ACLs for completed jobs
> --
>
> Key: YARN-10345
> URL: https://issues.apache.org/jira/browse/YARN-10345
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Critical
> Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png
>
>
> HsWebServices containerlogs does not honor ACLs. User who does not have 
> permission to view a job is allowed to view the job logs for completed jobs 
> from YARN UI2 through HsWebServices.
> *Repro:*
> Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + 
> HistoryServer runs as mapred
>  # Run a sample MR job using systest user
>  #  Once the job is complete, access the job logs using hue user from YARN 
> UI2.
> !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300!
>  
> YARN CLI works fine and does not allow hue user to view systest user job logs.
> {code:java}
> [hue@pjoseph-cm-2 /]$ 
> [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at 
> rmhostname:8032
> Permission denied: user=hue, access=EXECUTE, 
> inode="/tmp/logs/systest":systest:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs

2020-07-08 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10345:
-
Affects Version/s: (was: 3.2.0)
   3.2.2
   3.3.0

> HsWebServices containerlogs does not honor ACLs for completed jobs
> --
>
> Key: YARN-10345
> URL: https://issues.apache.org/jira/browse/YARN-10345
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Critical
> Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png
>
>
> HsWebServices containerlogs does not honor ACLs. User who does not have 
> permission to view a job is allowed to view the job logs for completed jobs 
> from YARN UI2 through HsWebServices.
> *Repro:*
> Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + 
> HistoryServer runs as mapred
>  # Run a sample MR job using systest user
>  #  Once the job is complete, access the job logs using hue user from YARN 
> UI2.
> !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300!
>  
> YARN CLI works fine and does not allow hue user to view systest user job logs.
> {code:java}
> [hue@pjoseph-cm-2 /]$ 
> [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at 
> rmhostname:8032
> Permission denied: user=hue, access=EXECUTE, 
> inode="/tmp/logs/systest":systest:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs

2020-07-08 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10345:
-
Description: 
HsWebServices containerlogs does not honor ACLs. User who does not have 
permission to view a job is allowed to view the job logs for completed jobs 
from YARN UI2 through HsWebServices.

*Repro:*

Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + 
HistoryServer runs as mapred
 # Run a sample MR job using systest user
 #  Once the job is complete, access the job logs using hue user from YARN UI2.

!Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300!

 

YARN CLI works fine and does not allow hue user to view systest user job logs.
{code:java}
[hue@pjoseph-cm-2 /]$ 
[hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002
WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at 
rmhostname:8032
Permission denied: user=hue, access=EXECUTE, 
inode="/tmp/logs/systest":systest:hadoop:drwxrwx---
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)
{code}

  was:
HsWebServices containerlogs does not honor ACLs. User who does not have 
permission to view a job is allowed to view the job logs from YARN UI2 through 
HsWebServices.

*Repro:*

Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + 
HistoryServer runs as mapred
 # Run a sample MR job using systest user
 #  Once the job is complete, access the job logs using hue user from YARN UI2.

!Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300!

 

YARN CLI works fine and does not allow hue user to view systest user job logs.
{code:java}
[hue@pjoseph-cm-2 /]$ 
[hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002
WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at 
rmhostname:8032
Permission denied: user=hue, access=EXECUTE, 
inode="/tmp/logs/systest":systest:hadoop:drwxrwx---
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)
{code}


> HsWebServices containerlogs does not honor ACLs for completed jobs
> --
>
> Key: YARN-10345
> URL: https://issues.apache.org/jira/browse/YARN-10345
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.2.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Critical
> Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png
>
>
> HsWebServices containerlogs does not honor ACLs. User who does not have 
> permission to view a job is allowed to view the job logs for completed jobs 
> from YARN UI2 through HsWebServices.
> *Repro:*
> Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + 
> HistoryServer runs as mapred
>  # Run a sample MR job using systest user
>  #  Once the job is complete, access the job logs using hue user from YARN 
> UI2.
> !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300!
>  
> YARN CLI works fine and does not allow hue user to view systest user job logs.
> {code:java}
> [hue@pjoseph-cm-2 /]$ 
> [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at 
> rmhostname:8032
> Permission denied: user=hue, access=EXECUTE, 
> inode="/tmp/logs/systest":systest:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs

2020-07-08 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10345:
-
Description: 
HsWebServices containerlogs does not honor ACLs. User who does not have 
permission to view a job is allowed to view the job logs from YARN UI2 through 
HsWebServices.

*Repro:*

Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + 
HistoryServer runs as mapred
 # Run a sample MR job using systest user
 #  Once the job is complete, access the job logs using hue user from YARN UI2.

!Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300!

 

YARN CLI works fine and does not allow hue user to view systest user job logs.
{code:java}
[hue@pjoseph-cm-2 /]$ 
[hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002
WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at 
rmhostname:8032
Permission denied: user=hue, access=EXECUTE, 
inode="/tmp/logs/systest":systest:hadoop:drwxrwx---
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)
{code}

  was:
HsWebServices containerlogs does not honor ACLs. User who does not have 
permission to view a job is allowed to view the job logs from YARN UI2 through 
HsWebServices.

*Repro:*

Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + 
HistoryServer runs as mapred

1. Run a sample MR job using systest user
2. Once the job is complete, access the job logs using hue user from YARN UI2. 




YARN CLI works fine.
{code}
[hue@pjoseph-cm-2 /]$ 
[hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002
WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at 
rmhostname:8032
Permission denied: user=hue, access=EXECUTE, 
inode="/tmp/logs/systest":systest:hadoop:drwxrwx---
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)
{code}




> HsWebServices containerlogs does not honor ACLs for completed jobs
> --
>
> Key: YARN-10345
> URL: https://issues.apache.org/jira/browse/YARN-10345
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.2.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Critical
> Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png
>
>
> HsWebServices containerlogs does not honor ACLs. User who does not have 
> permission to view a job is allowed to view the job logs from YARN UI2 
> through HsWebServices.
> *Repro:*
> Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + 
> HistoryServer runs as mapred
>  # Run a sample MR job using systest user
>  #  Once the job is complete, access the job logs using hue user from YARN 
> UI2.
> !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300!
>  
> YARN CLI works fine and does not allow hue user to view systest user job logs.
> {code:java}
> [hue@pjoseph-cm-2 /]$ 
> [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at 
> rmhostname:8032
> Permission denied: user=hue, access=EXECUTE, 
> inode="/tmp/logs/systest":systest:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs

2020-07-08 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10345:


 Summary: HsWebServices containerlogs does not honor ACLs for 
completed jobs
 Key: YARN-10345
 URL: https://issues.apache.org/jira/browse/YARN-10345
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.2.0, 3.4.0
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph
 Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png

HsWebServices containerlogs does not honor ACLs. User who does not have 
permission to view a job is allowed to view the job logs from YARN UI2 through 
HsWebServices.

*Repro:*

Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + 
HistoryServer runs as mapred

1. Run a sample MR job using systest user
2. Once the job is complete, access the job logs using hue user from YARN UI2. 




YARN CLI works fine.
{code}
[hue@pjoseph-cm-2 /]$ 
[hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002
WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at 
rmhostname:8032
Permission denied: user=hue, access=EXECUTE, 
inode="/tmp/logs/systest":systest:hadoop:drwxrwx---
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs

2020-07-08 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10345:
-
Attachment: Screen Shot 2020-07-08 at 12.54.21 PM.png

> HsWebServices containerlogs does not honor ACLs for completed jobs
> --
>
> Key: YARN-10345
> URL: https://issues.apache.org/jira/browse/YARN-10345
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.2.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Critical
> Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png
>
>
> HsWebServices containerlogs does not honor ACLs. User who does not have 
> permission to view a job is allowed to view the job logs from YARN UI2 
> through HsWebServices.
> *Repro:*
> Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + 
> HistoryServer runs as mapred
> 1. Run a sample MR job using systest user
> 2. Once the job is complete, access the job logs using hue user from YARN 
> UI2. 
> YARN CLI works fine.
> {code}
> [hue@pjoseph-cm-2 /]$ 
> [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at 
> rmhostname:8032
> Permission denied: user=hue, access=EXECUTE, 
> inode="/tmp/logs/systest":systest:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8047) RMWebApp make external class pluggable

2020-07-08 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153309#comment-17153309
 ] 

Prabhu Joseph commented on YARN-8047:
-

Thanks [~BilwaST] for the patch.

Have committed the latest patch  [^YARN-8047.006.patch]  to trunk. Can you 
report a separate Jira to handle the testcase.

> RMWebApp make external class pluggable
> --
>
> Key: YARN-8047
> URL: https://issues.apache.org/jira/browse/YARN-8047
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Bilwa S T
>Priority: Minor
> Attachments: YARN-8047-001.patch, YARN-8047-002.patch, 
> YARN-8047-003.patch, YARN-8047.004.patch, YARN-8047.005.patch, 
> YARN-8047.006.patch
>
>
> JIra should make sure we should be able to plugin webservices and web pages 
> of scheduler in Resourcemanager
> * RMWebApp allow to bind external classes
> * RMController allow to plugin scheduler classes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10340) HsWebServices getContainerReport uses loginUser instead of remoteUser to access ApplicationClientProtocol

2020-07-07 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153240#comment-17153240
 ] 

Prabhu Joseph commented on YARN-10340:
--

Thanks [~tarunparimi] for the analysis. 

bq. This creates a separate rpc client instance every time though which is not 
efficient.

This won't be a problem as these newly added WebServices (YARN-10028) are used 
only by Yarn UI2 unless user opens huge number of UI2 pages at a time. And also 
this is the right way for achieving doAs for RPC calls.

> HsWebServices getContainerReport uses loginUser instead of remoteUser to 
> access ApplicationClientProtocol
> -
>
> Key: YARN-10340
> URL: https://issues.apache.org/jira/browse/YARN-10340
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Tarun Parimi
>Priority: Major
>
> HsWebServices getContainerReport uses loginUser instead of remoteUser to 
> access ApplicationClientProtocol
>  
> [http://:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs|http://pjoseph-secure-1.pjoseph-secure.root.hwx.site:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs]
> While accessing above link using systest user, the request fails saying 
> mapred user does not have access to the job
>  
> {code:java}
> 2020-07-06 14:02:59,178 WARN org.apache.hadoop.yarn.server.webapp.LogServlet: 
> Could not obtain node HTTP address from provider.
> javax.ws.rs.WebApplicationException: 
> org.apache.hadoop.yarn.exceptions.YarnException: User mapred does not have 
> privilege to see this application application_1593997842459_0214
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:516)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:985)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:913)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowThrowable(WebServices.java:544)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowException(WebServices.java:530)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getContainer(WebServices.java:405)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getNodeHttpAddress(WebServices.java:373)
> at 
> org.apache.hadoop.yarn.server.webapp.LogServlet.getContainerLogsInfo(LogServlet.java:268)
> at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.getContainerLogs(HsWebServices.java:461)
>  
> {code}
> On Analyzing, found WebServices#getContainer uses doAs using UGI created by 
> createRemoteUser(end user) to access RM#ApplicationClientProtocol which does 
> not work. Need to use createProxyUser to do the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10337) TestRMHATimelineCollectors fails on hadoop trunk

2020-07-07 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10337:
-
Parent: YARN-9802
Issue Type: Sub-task  (was: Bug)

> TestRMHATimelineCollectors fails on hadoop trunk
> 
>
> Key: YARN-10337
> URL: https://issues.apache.org/jira/browse/YARN-10337
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: test, yarn
>Reporter: Ahmed Hussein
>Assignee: Bilwa S T
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10337.001.patch
>
>
> {{TestRMHATimelineCollectors}} has been failing on trunk. I see it frequently 
> in the qbt reports and the yetus reprts
> {code:bash}
> [INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 5.95 
> s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors
> [ERROR] 
> testRebuildCollectorDataOnFailover(org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors)
>   Time elapsed: 5.615 s  <<< ERROR!
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors.testRebuildCollectorDataOnFailover(TestRMHATimelineCollectors.java:105)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:80)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> [INFO]
> [INFO] Results:
> [INFO]
> [ERROR] Errors:
> [ERROR]   TestRMHATimelineCollectors.testRebuildCollectorDataOnFailover:105 
> NullPointer
> [INFO]
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0
> [INFO]
> [ERROR] There are test failures.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10337) TestRMHATimelineCollectors fails on hadoop trunk

2020-07-07 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152680#comment-17152680
 ] 

Prabhu Joseph commented on YARN-10337:
--

Thanks [~BilwaST] for the patch. 

+1, have committed it to trunk. Will resolve the Jira.

> TestRMHATimelineCollectors fails on hadoop trunk
> 
>
> Key: YARN-10337
> URL: https://issues.apache.org/jira/browse/YARN-10337
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test, yarn
>Reporter: Ahmed Hussein
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10337.001.patch
>
>
> {{TestRMHATimelineCollectors}} has been failing on trunk. I see it frequently 
> in the qbt reports and the yetus reprts
> {code:bash}
> [INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 5.95 
> s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors
> [ERROR] 
> testRebuildCollectorDataOnFailover(org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors)
>   Time elapsed: 5.615 s  <<< ERROR!
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors.testRebuildCollectorDataOnFailover(TestRMHATimelineCollectors.java:105)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:80)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> [INFO]
> [INFO] Results:
> [INFO]
> [ERROR] Errors:
> [ERROR]   TestRMHATimelineCollectors.testRebuildCollectorDataOnFailover:105 
> NullPointer
> [INFO]
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0
> [INFO]
> [ERROR] There are test failures.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments

2020-07-07 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152527#comment-17152527
 ] 

Prabhu Joseph commented on YARN-10339:
--

[~tarunparimi] Thanks for the patch. The patch looks good. Can you fix the 
checkstyle issues and failing testcase.

> Timeline Client in Nodemanager gets 403 errors when simple auth is used in 
> kerberos environments
> 
>
> Key: YARN-10339
> URL: https://issues.apache.org/jira/browse/YARN-10339
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineclient
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10339.001.patch
>
>
> We get below errors in NodeManager logs whenever we set 
> yarn.timeline-service.http-authentication.type=simple in a cluster which has 
> kerberos enabled. There are use cases where simple auth is used only in 
> timeline server for convenience although kerberos is enabled.
> {code:java}
> 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl 
> (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline 
> server is not successful, HTTP error code: 403, Server response:
> {"exception":"ForbiddenException","message":"java.lang.Exception: The owner 
> of the posted timeline entities is not 
> set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}
> {code}
> This seems to affect the NM timeline publisher which uses 
> TimelineV2ClientImpl. Doing a simple auth directly to timeline service via 
> curl works fine. So this issue is in the authenticator configuration in 
> timeline client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10340) HsWebServices getContainerReport uses loginUser instead of remoteUser to access ApplicationClientProtocol

2020-07-06 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152173#comment-17152173
 ] 

Prabhu Joseph commented on YARN-10340:
--

[~brahmareddy] This issue happens irrespective of HADOOP-16095 change. Looks 
this issue is present long ago.

*Repro:*

Setup: Secure cluster + HistoryServer runs as mapred user + yarn.admin.acl=yarn 
and ACL for queues are set to " "

1. Run a mapreduce sleep job as userA
2. Access 
http://:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs
 as userA after kinit.
3. The request fails with below error in HistoryServer logs

{code}
2020-07-06 14:02:59,178 WARN org.apache.hadoop.yarn.server.webapp.LogServlet: 
Could not obtain node HTTP address from provider.
javax.ws.rs.WebApplicationException: 
org.apache.hadoop.yarn.exceptions.YarnException: User mapred does not have 
privilege to see this application application_1593997842459_0214
at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:516)
{code}






> HsWebServices getContainerReport uses loginUser instead of remoteUser to 
> access ApplicationClientProtocol
> -
>
> Key: YARN-10340
> URL: https://issues.apache.org/jira/browse/YARN-10340
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Tarun Parimi
>Priority: Major
>
> HsWebServices getContainerReport uses loginUser instead of remoteUser to 
> access ApplicationClientProtocol
>  
> [http://:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs|http://pjoseph-secure-1.pjoseph-secure.root.hwx.site:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs]
> While accessing above link using systest user, the request fails saying 
> mapred user does not have access to the job
>  
> {code:java}
> 2020-07-06 14:02:59,178 WARN org.apache.hadoop.yarn.server.webapp.LogServlet: 
> Could not obtain node HTTP address from provider.
> javax.ws.rs.WebApplicationException: 
> org.apache.hadoop.yarn.exceptions.YarnException: User mapred does not have 
> privilege to see this application application_1593997842459_0214
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:516)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:985)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:913)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowThrowable(WebServices.java:544)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowException(WebServices.java:530)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getContainer(WebServices.java:405)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getNodeHttpAddress(WebServices.java:373)
> at 
> org.apache.hadoop.yarn.server.webapp.LogServlet.getContainerLogsInfo(LogServlet.java:268)
> at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.getContainerLogs(HsWebServices.java:461)
>  
> {code}
> On Analyzing, found WebServices#getContainer uses doAs using UGI created by 
> createRemoteUser(end user) to access RM#ApplicationClientProtocol which does 
> not work. Need to use createProxyUser to do the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10340) HsWebServices getContainerReport uses loginUser instead of remoteUser to access ApplicationClientProtocol

2020-07-06 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10340:


 Summary: HsWebServices getContainerReport uses loginUser instead 
of remoteUser to access ApplicationClientProtocol
 Key: YARN-10340
 URL: https://issues.apache.org/jira/browse/YARN-10340
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Prabhu Joseph
Assignee: Tarun Parimi


HsWebServices getContainerReport uses loginUser instead of remoteUser to access 
ApplicationClientProtocol

 

[http://:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs|http://pjoseph-secure-1.pjoseph-secure.root.hwx.site:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs]

While accessing above link using systest user, the request fails saying mapred 
user does not have access to the job

 
{code:java}
2020-07-06 14:02:59,178 WARN org.apache.hadoop.yarn.server.webapp.LogServlet: 
Could not obtain node HTTP address from provider.
javax.ws.rs.WebApplicationException: 
org.apache.hadoop.yarn.exceptions.YarnException: User mapred does not have 
privilege to see this application application_1593997842459_0214
at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:516)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466)
at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:985)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:913)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882)

at 
org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowThrowable(WebServices.java:544)
at 
org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowException(WebServices.java:530)
at 
org.apache.hadoop.yarn.server.webapp.WebServices.getContainer(WebServices.java:405)
at 
org.apache.hadoop.yarn.server.webapp.WebServices.getNodeHttpAddress(WebServices.java:373)
at 
org.apache.hadoop.yarn.server.webapp.LogServlet.getContainerLogsInfo(LogServlet.java:268)
at 
org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.getContainerLogs(HsWebServices.java:461)
 
{code}

On Analyzing, found WebServices#getContainer uses doAs using UGI created by 
createRemoteUser(end user) to access RM#ApplicationClientProtocol which does 
not work. Need to use createProxyUser to do the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-07-02 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150173#comment-17150173
 ] 

Prabhu Joseph commented on YARN-10319:
--

Thanks [~Tao Yang] and [~adam.antal] for reviewing.

Have addressed above comments in the latest patch  [^YARN-10319-005.patch] .

Document change can be viewed [here|https://tinyurl.com/y7mpok4q]

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, 
> YARN-10319-004.patch, YARN-10319-005.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-07-02 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10319:
-
Attachment: YARN-10319-005.patch

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, 
> YARN-10319-004.patch, YARN-10319-005.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10333) YarnClient obtain Delegation Token for Log Aggregation Path

2020-07-01 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149595#comment-17149595
 ] 

Prabhu Joseph edited comment on YARN-10333 at 7/1/20, 5:41 PM:
---

[~sunil.gov...@gmail.com] Can you review this Jira when you get time. Thanks.

Have verified with below combinations:

*fs.defaultFS   Log Aggregation Path*
 hdfs://nm1       s3a://tmp/app-logs
 hdfs://nm1       abfs://tmp/app-logs
 hdfs://nm1       hdfs://nm2/tmp/app-logs
 hdfs://nm1       hdfs://nm1/tmp/app-logs


was (Author: prabhu joseph):
[~sunil.gov...@gmail.com] Can you review this Jira when you get time. Thanks.

Have verified with below combinations:

*fs.defaultFS Log Aggregation Path*
hdfs://nm1   s3a://tmp/app-logs
hdfs://nm1   abfs://tmp/app-logs
hdfs://nm1   hdfs://nm2/tmp/app-logs
hdfs://nm1   hdfs://nm1/tmp/app-logs

> YarnClient obtain Delegation Token for Log Aggregation Path
> ---
>
> Key: YARN-10333
> URL: https://issues.apache.org/jira/browse/YARN-10333
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10333-001.patch, YARN-10333-002.patch, 
> YARN-10333-003.patch
>
>
> There are use cases where Yarn Log Aggregation Path is configured to a 
> FileSystem like S3 or ABFS different from what is configured in fs.defaultFS 
> (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS 
> and not for log aggregation path.
> This Jira is to improve YarnClient by obtaining delegation token for log 
> aggregation path and add it to the Credential of Container Launch Context 
> similar to how it does for Timeline Delegation Token.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10333) YarnClient obtain Delegation Token for Log Aggregation Path

2020-07-01 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149595#comment-17149595
 ] 

Prabhu Joseph commented on YARN-10333:
--

[~sunil.gov...@gmail.com] Can you review this Jira when you get time. Thanks.

Have verified with below combinations:

*fs.defaultFS Log Aggregation Path*
hdfs://nm1   s3a://tmp/app-logs
hdfs://nm1   abfs://tmp/app-logs
hdfs://nm1   hdfs://nm2/tmp/app-logs
hdfs://nm1   hdfs://nm1/tmp/app-logs

> YarnClient obtain Delegation Token for Log Aggregation Path
> ---
>
> Key: YARN-10333
> URL: https://issues.apache.org/jira/browse/YARN-10333
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10333-001.patch, YARN-10333-002.patch, 
> YARN-10333-003.patch
>
>
> There are use cases where Yarn Log Aggregation Path is configured to a 
> FileSystem like S3 or ABFS different from what is configured in fs.defaultFS 
> (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS 
> and not for log aggregation path.
> This Jira is to improve YarnClient by obtaining delegation token for log 
> aggregation path and add it to the Credential of Container Launch Context 
> similar to how it does for Timeline Delegation Token.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10333) YarnClient obtain Delegation Token for Log Aggregation Path

2020-07-01 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10333:
-
Attachment: YARN-10333-003.patch

> YarnClient obtain Delegation Token for Log Aggregation Path
> ---
>
> Key: YARN-10333
> URL: https://issues.apache.org/jira/browse/YARN-10333
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10333-001.patch, YARN-10333-002.patch, 
> YARN-10333-003.patch
>
>
> There are use cases where Yarn Log Aggregation Path is configured to a 
> FileSystem like S3 or ABFS different from what is configured in fs.defaultFS 
> (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS 
> and not for log aggregation path.
> This Jira is to improve YarnClient by obtaining delegation token for log 
> aggregation path and add it to the Credential of Container Launch Context 
> similar to how it does for Timeline Delegation Token.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10333) YarnClient obtain Delegation Token for Log Aggregation Path

2020-07-01 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10333:
-
Attachment: YARN-10333-002.patch

> YarnClient obtain Delegation Token for Log Aggregation Path
> ---
>
> Key: YARN-10333
> URL: https://issues.apache.org/jira/browse/YARN-10333
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10333-001.patch, YARN-10333-002.patch
>
>
> There are use cases where Yarn Log Aggregation Path is configured to a 
> FileSystem like S3 or ABFS different from what is configured in fs.defaultFS 
> (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS 
> and not for log aggregation path.
> This Jira is to improve YarnClient by obtaining delegation token for log 
> aggregation path and add it to the Credential of Container Launch Context 
> similar to how it does for Timeline Delegation Token.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10333) YarnClient obtain Delegation Token for Log Aggregation Path

2020-06-30 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10333:
-
Description: 
There are use cases where Yarn Log Aggregation Path is configured to a 
FileSystem like S3 or ABFS different from what is configured in fs.defaultFS 
(HDFS). Log Aggregation fails as the client has token only for fs.defaultFS and 
not for log aggregation path.

This Jira is to improve YarnClient by obtaining delegation token for log 
aggregation path and add it to the Credential of Container Launch Context 
similar to how it does for Timeline Delegation Token.

  was:
There are use cases where Yarn Log Aggregation Path is configured to a 
FileSystem like S3 or ABFS different from what is configured in fs.defaultFS 
(HDFS). Log Aggregation fails as the client has token only for fs.defaultFS and 
not for log aggregation path.

This Jira is to improve YarnClient by obtaining delegation token for log 
aggregation path and add it to the Credential of Container Launch Context 
similar to Timeline Delegation Token.


> YarnClient obtain Delegation Token for Log Aggregation Path
> ---
>
> Key: YARN-10333
> URL: https://issues.apache.org/jira/browse/YARN-10333
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10333-001.patch
>
>
> There are use cases where Yarn Log Aggregation Path is configured to a 
> FileSystem like S3 or ABFS different from what is configured in fs.defaultFS 
> (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS 
> and not for log aggregation path.
> This Jira is to improve YarnClient by obtaining delegation token for log 
> aggregation path and add it to the Credential of Container Launch Context 
> similar to how it does for Timeline Delegation Token.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10333) YarnClient obtain Delegation Token for Log Aggregation Path

2020-06-30 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10333:
-
Attachment: YARN-10333-001.patch

> YarnClient obtain Delegation Token for Log Aggregation Path
> ---
>
> Key: YARN-10333
> URL: https://issues.apache.org/jira/browse/YARN-10333
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10333-001.patch
>
>
> There are use cases where Yarn Log Aggregation Path is configured to a 
> FileSystem like S3 or ABFS different from what is configured in fs.defaultFS 
> (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS 
> and not for log aggregation path.
> This Jira is to improve YarnClient by obtaining delegation token for log 
> aggregation path and add it to the Credential of Container Launch Context 
> similar to Timeline Delegation Token.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10333) YarnClient obtain Delegation Token for Log Aggregation Path

2020-06-30 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10333:
-
Summary: YarnClient obtain Delegation Token for Log Aggregation Path  (was: 
YarnClient fetch DT for Log Aggregation Path)

> YarnClient obtain Delegation Token for Log Aggregation Path
> ---
>
> Key: YARN-10333
> URL: https://issues.apache.org/jira/browse/YARN-10333
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10333-001.patch
>
>
> There are use cases where Yarn Log Aggregation Path is configured to a 
> FileSystem like S3 or ABFS different from what is configured in fs.defaultFS 
> (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS 
> and not for log aggregation path.
> This Jira is to improve YarnClient by obtaining delegation token for log 
> aggregation path and add it to the Credential of Container Launch Context 
> similar to Timeline Delegation Token.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10333) YarnClient fetch DT for Log Aggregation Path

2020-06-30 Thread Prabhu Joseph (Jira)
Prabhu Joseph created YARN-10333:


 Summary: YarnClient fetch DT for Log Aggregation Path
 Key: YARN-10333
 URL: https://issues.apache.org/jira/browse/YARN-10333
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: log-aggregation
Affects Versions: 3.3.0
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


There are use cases where Yarn Log Aggregation Path is configured to a 
FileSystem like S3 or ABFS different from what is configured in fs.defaultFS 
(HDFS). Log Aggregation fails as the client has token only for fs.defaultFS and 
not for log aggregation path.

This Jira is to improve YarnClient by obtaining delegation token for log 
aggregation path and add it to the Credential of Container Launch Context 
similar to Timeline Delegation Token.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-06-30 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148552#comment-17148552
 ] 

Prabhu Joseph commented on YARN-10319:
--

[~Tao Yang] Can you review the latest patch when you get time. Thanks.

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, 
> YARN-10319-004.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-06-24 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143544#comment-17143544
 ] 

Prabhu Joseph commented on YARN-10319:
--

Thanks [~Tao Yang] for detailed review.

Have addressed all above comments in  [^YARN-10319-004.patch] . 

bq. The fetching approaches of activities and bulk-activities REST API are 
different (asynchronous or synchronous), I think we should elaborate this in 
the document.

Have updated the document with below 

{code}
The scheduler bulk activities RESTful API can fetch scheduler activities info 
recorded for multiple scheduling cycle. This may take time to return as it 
internally waits still it records specified activitiesCount.
{code}

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, 
> YARN-10319-004.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager

2020-06-23 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10319:
-
Attachment: YARN-10319-004.patch

> Record Last N Scheduler Activities from ActivitiesManager
> -
>
> Key: YARN-10319
> URL: https://issues.apache.org/jira/browse/YARN-10319
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: activitiesmanager
> Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, 
> YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, 
> YARN-10319-004.patch
>
>
> ActivitiesManager records a call flow for a given nodeId or a last call flow. 
> This is useful when debugging the issue live where the user queries with 
> right nodeId. But capturing last N scheduler activities during the issue 
> period can help to debug the issue offline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8047) RMWebApp make external class pluggable

2020-06-23 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142799#comment-17142799
 ] 

Prabhu Joseph commented on YARN-8047:
-

[~BilwaST] Thanks for the patch. Below are some of the comments.

1. yarn-default.xml

   a. The description of yarn.http.rmwebapp.external.classes says "Used to 
specify custom web application pages".    
       This looks not correct, can we change it to "Used to specify custom web 
services for ResourceManager"

2. RMWebApp.java

   b. Below code is not necessary as getClasses won't return null.
{code:java}
+if (externalClasses == null) {
+  return;
+}
{code}
3. RmController.java

   a. Below needs to be removed. 

     + System.out.println(schedulerName);

   b. This is affecting existing behavior for custom schedulers, we need to 
show the DefaultSchedulerPage.class
       if hadoop.http.rmwebapp.scheduler.page.class not configured. And a warn 
message into log saying 
       the custom page class is not found if user has configured a class which 
does not exist.

       + renderText("Not Found");

4. Can you also include a unit testcase which shows custom webservice and 
custom page works fine.

> RMWebApp make external class pluggable
> --
>
> Key: YARN-8047
> URL: https://issues.apache.org/jira/browse/YARN-8047
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Bilwa S T
>Priority: Minor
> Attachments: YARN-8047-001.patch, YARN-8047-002.patch, 
> YARN-8047-003.patch, YARN-8047.004.patch, YARN-8047.005.patch
>
>
> JIra should make sure we should be able to plugin webservices and web pages 
> of scheduler in Resourcemanager
> * RMWebApp allow to bind external classes
> * RMController allow to plugin scheduler classes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10321) Break down TestUserGroupMappingPlacementRule#testMapping into test scenarios

2020-06-22 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141756#comment-17141756
 ] 

Prabhu Joseph commented on YARN-10321:
--

Thanks [~snemeth] for the patch and [~shuzirra] for the review.

The patch looks good, +1. Have committed it to trunk.

> Break down TestUserGroupMappingPlacementRule#testMapping into test scenarios
> 
>
> Key: YARN-10321
> URL: https://issues.apache.org/jira/browse/YARN-10321
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
> Attachments: YARN-10321.001.patch
>
>
> org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule#testMapping
>  is very large and hard to read/maintain and moreover, error-prone.
> We should break this testcase down into several separate testcases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10321) Break down TestUserGroupMappingPlacementRule#testMapping into test scenarios

2020-06-22 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10321:
-
Fix Version/s: 3.4.0

> Break down TestUserGroupMappingPlacementRule#testMapping into test scenarios
> 
>
> Key: YARN-10321
> URL: https://issues.apache.org/jira/browse/YARN-10321
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: YARN-10321.001.patch
>
>
> org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule#testMapping
>  is very large and hard to read/maintain and moreover, error-prone.
> We should break this testcase down into several separate testcases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >