[jira] [Commented] (YARN-8155) Improve ATSv2 client logging in RM and NM publisher

2018-06-14 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512147#comment-16512147
 ] 

Rohith Sharma K S commented on YARN-8155:
-

committed to trunk/branch-3.1/branch-3.0.. branch-2 compilation is failing 
[~abmodi] would you give branch-2 patch? 

> Improve ATSv2 client logging in RM and NM publisher
> ---
>
> Key: YARN-8155
> URL: https://issues.apache.org/jira/browse/YARN-8155
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Rohith Sharma K S
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-8155.001.patch, YARN-8155.002.patch, 
> YARN-8155.003.patch, YARN-8155.004.patch, YARN-8155.005.patch
>
>
> We see that NM logs are filled with larger stack trace of NotFoundException 
> if collector is removed from one of the NM and other NMs are still publishing 
> the entities.
>  
> This Jira is to improve the logging in NM so that we log with informative 
> message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8155) Improve ATSv2 client logging in RM and NM publisher

2018-06-14 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8155:

Summary: Improve ATSv2 client logging in RM and NM publisher  (was: Improve 
the logging in NMTimelinePublisher and TimelineCollectorWebService)

> Improve ATSv2 client logging in RM and NM publisher
> ---
>
> Key: YARN-8155
> URL: https://issues.apache.org/jira/browse/YARN-8155
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-8155.001.patch, YARN-8155.002.patch, 
> YARN-8155.003.patch, YARN-8155.004.patch, YARN-8155.005.patch
>
>
> We see that NM logs are filled with larger stack trace of NotFoundException 
> if collector is removed from one of the NM and other NMs are still publishing 
> the entities.
>  
> This Jira is to improve the logging in NM so that we log with informative 
> message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8155) Improve ATSv2 client logging in RM and NM publisher

2018-06-14 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8155:

Issue Type: Improvement  (was: Bug)

> Improve ATSv2 client logging in RM and NM publisher
> ---
>
> Key: YARN-8155
> URL: https://issues.apache.org/jira/browse/YARN-8155
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Rohith Sharma K S
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-8155.001.patch, YARN-8155.002.patch, 
> YARN-8155.003.patch, YARN-8155.004.patch, YARN-8155.005.patch
>
>
> We see that NM logs are filled with larger stack trace of NotFoundException 
> if collector is removed from one of the NM and other NMs are still publishing 
> the entities.
>  
> This Jira is to improve the logging in NM so that we log with informative 
> message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8155) Improve the logging in NMTimelinePublisher and TimelineCollectorWebService

2018-06-14 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512030#comment-16512030
 ] 

Rohith Sharma K S commented on YARN-8155:
-

Committing shortly.. 

> Improve the logging in NMTimelinePublisher and TimelineCollectorWebService
> --
>
> Key: YARN-8155
> URL: https://issues.apache.org/jira/browse/YARN-8155
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-8155.001.patch, YARN-8155.002.patch, 
> YARN-8155.003.patch, YARN-8155.004.patch, YARN-8155.005.patch
>
>
> We see that NM logs are filled with larger stack trace of NotFoundException 
> if collector is removed from one of the NM and other NMs are still publishing 
> the entities.
>  
> This Jira is to improve the logging in NM so that we log with informative 
> message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-8302) ATS v2 should handle HBase connection issue properly

2018-06-12 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S reopened YARN-8302:
-

Reopening the JIRA to discuss and figure out alternative approach other than 
reducing retry count

> ATS v2 should handle HBase connection issue properly
> 
>
> Key: YARN-8302
> URL: https://issues.apache.org/jira/browse/YARN-8302
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2
>Affects Versions: 3.1.0
>Reporter: Yesha Vora
>Priority: Major
>
> ATS v2 call times out with below error when it can't connect to HBase 
> instance.
> {code}
> bash-4.2$ curl -i -k -s -1  -H 'Content-Type: application/json'  -H 'Accept: 
> application/json' --max-time 5   --negotiate -u : 
> 'https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/YARN_CONTAINER?fields=ALL&_=1526425686092'
> curl: (28) Operation timed out after 5002 milliseconds with 0 bytes received
> {code}
> {code:title=ATS log}
> 2018-05-15 23:10:03,623 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=7, 
> retries=7, started=8165 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:13,651 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=8, 
> retries=8, started=18192 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:23,730 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=9, 
> retries=9, started=28272 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:33,788 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=10, 
> retries=10, started=38330 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1{code}
> There are two issues here.
> 1) Check why ATS can't connect to HBase
> 2) In case of connection error,  ATS call should not get timeout. It should 
> fail with proper error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8411) Enable stopped system services to be started during RM start

2018-06-12 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510593#comment-16510593
 ] 

Rohith Sharma K S commented on YARN-8411:
-

thanks [~billie.rinaldi] for the patch. It looks good to me as well. 
I have one generic doubt on start operations, does service start identifies if 
service is already running? Because in system services services are long 
running and in case of RM restart or HA switch, by default we are going to 
start services. 

> Enable stopped system services to be started during RM start
> 
>
> Key: YARN-8411
> URL: https://issues.apache.org/jira/browse/YARN-8411
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
>Priority: Critical
> Attachments: YARN-8411.01.patch, YARN-8411.02.patch
>
>
> With YARN-8048, the RM can launch system services using the YARN service 
> framework. If the service app is in a stopped state, user intervention is 
> required to delete the service so that the RM can launch it again when the RM 
> is restarted. It would be an improvement for the RM to be able to start a 
> stopped system service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8405) RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773

2018-06-12 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510588#comment-16510588
 ] 

Rohith Sharma K S commented on YARN-8405:
-

bq. Could you also cherry pick to branch-2.9?
Done

> RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773
> --
>
> Key: YARN-8405
> URL: https://issues.apache.org/jira/browse/YARN-8405
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.1.0
>Reporter: Rohith Sharma K S
>Assignee: Íñigo Goiri
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.4
>
> Attachments: YARN-8405.000.patch, YARN-8405.001.patch, 
> YARN-8405.002.patch, YARN-8405.003.patch
>
>
> HADOOP-14773 changes the ACL for 
> yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, 
> /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now  
> behavior changed from setting acls to parent node. As a result, parent node 
> /rmstore is set to default acl. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8405) RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773

2018-06-12 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8405:

Fix Version/s: 2.9.2

> RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773
> --
>
> Key: YARN-8405
> URL: https://issues.apache.org/jira/browse/YARN-8405
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.1.0
>Reporter: Rohith Sharma K S
>Assignee: Íñigo Goiri
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.4
>
> Attachments: YARN-8405.000.patch, YARN-8405.001.patch, 
> YARN-8405.002.patch, YARN-8405.003.patch
>
>
> HADOOP-14773 changes the ACL for 
> yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, 
> /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now  
> behavior changed from setting acls to parent node. As a result, parent node 
> /rmstore is set to default acl. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8413) Flow activity page is failing with "Timeline server failed with an error"

2018-06-12 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8413:

Fix Version/s: 3.1.1
   3.2.0

> Flow activity page is failing with "Timeline server failed with an error"
> -
>
> Key: YARN-8413
> URL: https://issues.apache.org/jira/browse/YARN-8413
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.1.1
>Reporter: Yesha Vora
>Assignee: Sunil Govindan
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8413.001.patch
>
>
> Flow activity page is fail to load with "Timeline server failed with an error"
> This page uses incorrect flow call 
> "https://localhost:8188/ws/v2/timeline/flows?_=1528755339836; and it is 
> failing to load.
> 1) Its using localhost instead ATS v2 hostname
> 2) Its using ATS v1.5 http port instead ATS v2 https port
> The correct rest call is "https://: port>/ws/v2/timeline/flows?_=1528755339836"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8405) RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773

2018-06-12 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509513#comment-16509513
 ] 

Rohith Sharma K S commented on YARN-8405:
-

test failures are unrelated to patch.. committing shortly. 

> RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773
> --
>
> Key: YARN-8405
> URL: https://issues.apache.org/jira/browse/YARN-8405
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.1.0
>Reporter: Rohith Sharma K S
>Assignee: Íñigo Goiri
>Priority: Major
> Attachments: YARN-8405.000.patch, YARN-8405.001.patch, 
> YARN-8405.002.patch, YARN-8405.003.patch
>
>
> HADOOP-14773 changes the ACL for 
> yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, 
> /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now  
> behavior changed from setting acls to parent node. As a result, parent node 
> /rmstore is set to default acl. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8413) Flow activity page is failing with "Timeline server failed with an error"

2018-06-12 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509514#comment-16509514
 ] 

Rohith Sharma K S commented on YARN-8413:
-

+1 lgtm

> Flow activity page is failing with "Timeline server failed with an error"
> -
>
> Key: YARN-8413
> URL: https://issues.apache.org/jira/browse/YARN-8413
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.1.1
>Reporter: Yesha Vora
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8413.001.patch
>
>
> Flow activity page is fail to load with "Timeline server failed with an error"
> This page uses incorrect flow call 
> "https://localhost:8188/ws/v2/timeline/flows?_=1528755339836; and it is 
> failing to load.
> 1) Its using localhost instead ATS v2 hostname
> 2) Its using ATS v1.5 http port instead ATS v2 https port
> The correct rest call is "https://: port>/ws/v2/timeline/flows?_=1528755339836"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8405) RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773

2018-06-11 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16508491#comment-16508491
 ] 

Rohith Sharma K S commented on YARN-8405:
-

Ahh.. I see it :-)  Apologies for missing it. 

+1 lgtm.. pending jenkins

> RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773
> --
>
> Key: YARN-8405
> URL: https://issues.apache.org/jira/browse/YARN-8405
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.1.0
>Reporter: Rohith Sharma K S
>Assignee: Íñigo Goiri
>Priority: Major
> Attachments: YARN-8405.000.patch, YARN-8405.001.patch, 
> YARN-8405.002.patch, YARN-8405.003.patch
>
>
> HADOOP-14773 changes the ACL for 
> yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, 
> /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now  
> behavior changed from setting acls to parent node. As a result, parent node 
> /rmstore is set to default acl. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8405) RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773

2018-06-11 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16508412#comment-16508412
 ] 

Rohith Sharma K S commented on YARN-8405:
-

thanks [~elgoiri] for patch!
My concern on patch is ZKCuratorManager is utils class which is already 
released. This would cause a problem right? 

> RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773
> --
>
> Key: YARN-8405
> URL: https://issues.apache.org/jira/browse/YARN-8405
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.1.0
>Reporter: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8405.000.patch, YARN-8405.001.patch, 
> YARN-8405.002.patch, YARN-8405.003.patch
>
>
> HADOOP-14773 changes the ACL for 
> yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, 
> /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now  
> behavior changed from setting acls to parent node. As a result, parent node 
> /rmstore is set to default acl. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8303) YarnClient should contact TimelineReader for application/attempt/container report

2018-06-09 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16507256#comment-16507256
 ] 

Rohith Sharma K S commented on YARN-8303:
-

All is yours:-) Go ahead..

> YarnClient should contact TimelineReader for application/attempt/container 
> report
> -
>
> Key: YARN-8303
> URL: https://issues.apache.org/jira/browse/YARN-8303
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Rohith Sharma K S
>Assignee: Abhishek Modi
>Priority: Critical
>
> YarnClient get app/attempt/container information from RM. If RM doesn't have 
> then queried to ahsClient. When ATSv2 is only enabled, yarnClient will result 
> empty. 
> YarnClient is used by many users which result in empty information for 
> app/attempt/container report. 
> Proposal is to have adapter from yarn client so that app/attempt/container 
> reports can be generated from AHSv2Client which does REST API to 
> TimelineReader and get the entity and convert it into app/attempt/container 
> report.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8405) RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773

2018-06-08 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506794#comment-16506794
 ] 

Rohith Sharma K S commented on YARN-8405:
-

[~elgoiri] would you please update the patch with test case?

> RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773
> --
>
> Key: YARN-8405
> URL: https://issues.apache.org/jira/browse/YARN-8405
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.1.0
>Reporter: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8405.000.patch, YARN-8405.001.patch, 
> YARN-8405.002.patch
>
>
> HADOOP-14773 changes the ACL for 
> yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, 
> /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now  
> behavior changed from setting acls to parent node. As a result, parent node 
> /rmstore is set to default acl. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8405) RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773

2018-06-08 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505866#comment-16505866
 ] 

Rohith Sharma K S commented on YARN-8405:
-

thanks [~elgoiri] for the patch. 
# Since ZKCuratorManager is in util class, changing public API would be an 
issue. Instead of changing existing API, we can add new method with 
_createRootDirRecursively(String path, List zkAcl)_

bq. do you have a test proposal?
You can change permission values to verify it. Note that 
TestZKRMStateStore#testZKRootPathAcls verifies for /rmstore/ZKRMStateRoot but 
not to /rmstore. Below is the test code that fails without patch. 
{code}
diff --git 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java
 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java
index 4cba2664d15..6c421157158 100644
--- 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java
+++ 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java
@@ -419,13 +419,14 @@ private static boolean verifyZKACL(String id, String 
scheme, int perm,
   public void testZKRootPathAcls() throws Exception {
 StateChangeRequestInfo req = new StateChangeRequestInfo(
 HAServiceProtocol.RequestSource.REQUEST_BY_USER);
-String rootPath =
-YarnConfiguration.DEFAULT_ZK_RM_STATE_STORE_PARENT_PATH + "/" +
-ZKRMStateStore.ROOT_ZNODE_NAME;
+String parentPath = 
YarnConfiguration.DEFAULT_ZK_RM_STATE_STORE_PARENT_PATH;
+String rootPath = parentPath + "/" + ZKRMStateStore.ROOT_ZNODE_NAME;

 // Start RM with HA enabled
 Configuration conf =
 createHARMConf("rm1,rm2", "rm1", 1234, false, curatorTestingServer);
+conf.set(YarnConfiguration.RM_ZK_ACL, "world:anyone:rwca");
+int perm = 23;// rwca=1+2+4+16
 ResourceManager rm = new MockRM(conf);
 rm.start();
 rm.getRMContext().getRMAdminService().transitionToActive(req);
@@ -436,10 +437,16 @@ public void testZKRootPathAcls() throws Exception {
 verifyZKACL("digest", "localhost", Perms.CREATE | Perms.DELETE, acls);
 verifyZKACL(
 "world", "anyone", Perms.ALL ^ (Perms.CREATE | Perms.DELETE), acls);
+
+acls =
+((ZKRMStateStore) 
rm.getRMContext().getStateStore()).getACL(parentPath);
+assertEquals(1, acls.size());
+assertEquals(perm, acls.get(0).getPerms());
 rm.close();

 // Now start RM with HA disabled. NoAuth Exception should not be thrown.
 conf.setBoolean(YarnConfiguration.RM_HA_ENABLED, false);
+conf.set(YarnConfiguration.RM_ZK_ACL, YarnConfiguration.DEFAULT_RM_ZK_ACL);
 rm = new MockRM(conf);
 rm.start();
 rm.getRMContext().getRMAdminService().transitionToActive(req);
{code}

> RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773
> --
>
> Key: YARN-8405
> URL: https://issues.apache.org/jira/browse/YARN-8405
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.1.0
>Reporter: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8405.000.patch, YARN-8405.001.patch, 
> YARN-8405.002.patch
>
>
> HADOOP-14773 changes the ACL for 
> yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, 
> /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now  
> behavior changed from setting acls to parent node. As a result, parent node 
> /rmstore is set to default acl. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8397) Thread leak in ActivitiesManager

2018-06-07 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8397:

Summary: Thread leak in ActivitiesManager  (was: ActivitiesManager thread 
doesn't handles InterruptedException )

> Thread leak in ActivitiesManager
> 
>
> Key: YARN-8397
> URL: https://issues.apache.org/jira/browse/YARN-8397
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8397.01.patch
>
>
> It is observed while using MiniYARNCluster, MiniYARNCluster#stop doesn't stop 
> JVM. 
> Thread dump shows that ActivitiesManager is in timed_waiting state. 
> {code}
> "Thread-43" #66 prio=5 os_prio=31 tid=0x7ffea09fd000 nid=0xa103 waiting 
> on condition [0x76f1]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:142)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8405) RM zk-state-store.parent-path acls has been changed since HADOOP-14773

2018-06-07 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504937#comment-16504937
 ] 

Rohith Sharma K S commented on YARN-8405:
-

HADOOP-14741 added ZKCuratorManager. The method in create() send null zkAcl. 
But in ZKRMStateStore, this method send zkAcl value. 
{code}
/**
   * Create a ZNode.
   * @param path Path of the ZNode.
   * @return If the ZNode was created.
   * @throws Exception If it cannot contact Zookeeper.
   */
  public boolean create(final String path) throws Exception {
return create(path, null);
  }
{code}

> RM zk-state-store.parent-path acls has been changed since HADOOP-14773
> --
>
> Key: YARN-8405
> URL: https://issues.apache.org/jira/browse/YARN-8405
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.1.0
>Reporter: Rohith Sharma K S
>Priority: Major
>
> HADOOP-14773 changes the ACL for 
> yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, 
> /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now  
> behavior changed from setting acls to parent node. As a result, parent node 
> /rmstore is set to default acl. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8386) App log can not be viewed from Logs tab in secure cluster

2018-06-07 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8386:

Fix Version/s: 3.1.1
   3.2.0

>  App log can not be viewed from Logs tab in secure cluster
> --
>
> Key: YARN-8386
> URL: https://issues.apache.org/jira/browse/YARN-8386
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.1.0
>Reporter: Yesha Vora
>Assignee: Sunil Govindan
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8386.001.patch, YARN-8386.002.patch
>
>
> App Logs can not be viewed from UI2 logs tab.
> Steps:
> 1) Launch yarn service 
> 2) Let application finish and go to Logs tab to view AM log
> Here, service am api is failing with 401 authentication error.
> {code}
> Request URL: 
> http://xxx:8188/ws/v1/applicationhistory/containers/container_e09_1527737134553_0034_01_01/logs/serviceam.log?_=1527799590942
> Request Method: GET
> Status Code: 401 Authentication required
>  Response 
> html>
> 
> 
> Error 401 Authentication required
> 
> HTTP ERROR 401
> Problem accessing 
> /ws/v1/applicationhistory/containers/container_e09_1527737134553_0034_01_01/logs/serviceam.log.
>  Reason:
> Authentication required
> 
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8386) App log can not be viewed from Logs tab in secure cluster

2018-06-07 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504871#comment-16504871
 ] 

Rohith Sharma K S commented on YARN-8386:
-

committed to trunk/branch-3.1.. branch-3.0 cherry-picking causing an issues. 
[~sunilg] would you help to cherry pick to branch-3.0?

I am keeping Jira open as long we cherry pick to branch-3.0. Is this also 
required for branch-2 ?

>  App log can not be viewed from Logs tab in secure cluster
> --
>
> Key: YARN-8386
> URL: https://issues.apache.org/jira/browse/YARN-8386
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.1.0
>Reporter: Yesha Vora
>Assignee: Sunil Govindan
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8386.001.patch, YARN-8386.002.patch
>
>
> App Logs can not be viewed from UI2 logs tab.
> Steps:
> 1) Launch yarn service 
> 2) Let application finish and go to Logs tab to view AM log
> Here, service am api is failing with 401 authentication error.
> {code}
> Request URL: 
> http://xxx:8188/ws/v1/applicationhistory/containers/container_e09_1527737134553_0034_01_01/logs/serviceam.log?_=1527799590942
> Request Method: GET
> Status Code: 401 Authentication required
>  Response 
> html>
> 
> 
> Error 401 Authentication required
> 
> HTTP ERROR 401
> Problem accessing 
> /ws/v1/applicationhistory/containers/container_e09_1527737134553_0034_01_01/logs/serviceam.log.
>  Reason:
> Authentication required
> 
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8386) App log can not be viewed from Logs tab in secure cluster

2018-06-07 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504857#comment-16504857
 ] 

Rohith Sharma K S commented on YARN-8386:
-

+1, tested the patch in secure cluster and non-secure cluster. committing 
shortly

>  App log can not be viewed from Logs tab in secure cluster
> --
>
> Key: YARN-8386
> URL: https://issues.apache.org/jira/browse/YARN-8386
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.1.0
>Reporter: Yesha Vora
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8386.001.patch, YARN-8386.002.patch
>
>
> App Logs can not be viewed from UI2 logs tab.
> Steps:
> 1) Launch yarn service 
> 2) Let application finish and go to Logs tab to view AM log
> Here, service am api is failing with 401 authentication error.
> {code}
> Request URL: 
> http://xxx:8188/ws/v1/applicationhistory/containers/container_e09_1527737134553_0034_01_01/logs/serviceam.log?_=1527799590942
> Request Method: GET
> Status Code: 401 Authentication required
>  Response 
> html>
> 
> 
> Error 401 Authentication required
> 
> HTTP ERROR 401
> Problem accessing 
> /ws/v1/applicationhistory/containers/container_e09_1527737134553_0034_01_01/logs/serviceam.log.
>  Reason:
> Authentication required
> 
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8405) RM zk-state-store.parent-path acls has been changed since HADOOP-14773

2018-06-07 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8405:

Affects Version/s: 2.9.0

> RM zk-state-store.parent-path acls has been changed since HADOOP-14773
> --
>
> Key: YARN-8405
> URL: https://issues.apache.org/jira/browse/YARN-8405
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.1.0
>Reporter: Rohith Sharma K S
>Priority: Major
>
> HADOOP-14773 changes the ACL for 
> yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, 
> /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now  
> behavior changed from setting acls to parent node. As a result, parent node 
> /rmstore is set to default acl. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8405) RM zk-state-store.parent-path acls has been changed since HADOOP-14773

2018-06-07 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8405:

Target Version/s: 2.10.0, 3.1.1

> RM zk-state-store.parent-path acls has been changed since HADOOP-14773
> --
>
> Key: YARN-8405
> URL: https://issues.apache.org/jira/browse/YARN-8405
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.1.0
>Reporter: Rohith Sharma K S
>Priority: Major
>
> HADOOP-14773 changes the ACL for 
> yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, 
> /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now  
> behavior changed from setting acls to parent node. As a result, parent node 
> /rmstore is set to default acl. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8405) RM zk-state-store.parent-path acls has been changed since HADOOP-14773

2018-06-07 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8405:

Affects Version/s: 3.1.0

> RM zk-state-store.parent-path acls has been changed since HADOOP-14773
> --
>
> Key: YARN-8405
> URL: https://issues.apache.org/jira/browse/YARN-8405
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.1.0
>Reporter: Rohith Sharma K S
>Priority: Major
>
> HADOOP-14773 changes the ACL for 
> yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, 
> /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now  
> behavior changed from setting acls to parent node. As a result, parent node 
> /rmstore is set to default acl. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8405) RM zk-state-store.parent-path acls has been changed since HADOOP-14773

2018-06-07 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504659#comment-16504659
 ] 

Rohith Sharma K S commented on YARN-8405:
-

Below is the difference  
*Before* :
{code}
[zk: localhost:2181(CONNECTED) 0] getAcl /rmstore
'sasl,'rm
: cdrwa
{code}

*After*:
{code}
[zk: localhost:2181(CONNECTED) 1] getAcl /rmstore
'world,'anyone
: cdrwa
[zk: localhost:2181(CONNECTED) 2] getAcl /rmstore/ZKRMStateRoot
'sasl,'rm
: rwa
'digest,'ctr-e138-1518143905142-346048-01-08.test.site:C1u8x7GQW9SdBpprg1Gov7bAAf8=
: cd
{code}


The reason is while creating parent node recursively, ACLs are not set. Once 
parent node is created, then for further node creation acls are set. 
cc:/ [~subru] [~elgoiri]

> RM zk-state-store.parent-path acls has been changed since HADOOP-14773
> --
>
> Key: YARN-8405
> URL: https://issues.apache.org/jira/browse/YARN-8405
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Major
>
> HADOOP-14773 changes the ACL for 
> yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, 
> /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now  
> behavior changed from setting acls to parent node. As a result, parent node 
> /rmstore is set to default acl. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8405) RM zk-state-store.parent-path acls has been changed since HADOOP-14773

2018-06-07 Thread Rohith Sharma K S (JIRA)
Rohith Sharma K S created YARN-8405:
---

 Summary: RM zk-state-store.parent-path acls has been changed since 
HADOOP-14773
 Key: YARN-8405
 URL: https://issues.apache.org/jira/browse/YARN-8405
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Rohith Sharma K S


HADOOP-14773 changes the ACL for 
yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, 
/rmstore used set acls with yarn.resourcemanager.zk-acl value. But now  
behavior changed from setting acls to parent node. As a result, parent node 
/rmstore is set to default acl. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8404) RM Event dispatcher is blocked if ATS1/1.5 server is not running.

2018-06-07 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504401#comment-16504401
 ] 

Rohith Sharma K S commented on YARN-8404:
-

cc :/ [~naganarasimha...@apache.org] [~jlowe] [~leftnoteasy] [~vinodkv] 
[~sjlee0] [~vrushalic] do you see any critical issues making async for 
appFinished flow? 

> RM Event dispatcher is blocked if ATS1/1.5 server is not running. 
> --
>
> Key: YARN-8404
> URL: https://issues.apache.org/jira/browse/YARN-8404
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.0.2
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Blocker
> Attachments: YARN-8404.01.patch
>
>
> It is observed that if ATS1/1.5 daemon is not running, RM recovery is delayed 
> as long as timeline client get timed out for each applications. By default, 
> timed out will take around 5 mins. If completed applications are more then 
> amount of time RM will wait is *(number of completed applications in a 
> cluster * 5 minutes)* which is kind of hanged. 
> Primary reason for this behavior is YARN-3044 YARN-4129 which refactor 
> existing system metric publisher. This refactoring made appFinished event as 
> synchronous which was asynchronous earlier. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8404) RM Event dispatcher is blocked if ATS1/1.5 server is not running.

2018-06-07 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504313#comment-16504313
 ] 

Rohith Sharma K S commented on YARN-8404:
-

Attached patch that retain original behavior that is appFinished event 
processing as asynchronous. 
[~sunilg] Please review the patch. 

> RM Event dispatcher is blocked if ATS1/1.5 server is not running. 
> --
>
> Key: YARN-8404
> URL: https://issues.apache.org/jira/browse/YARN-8404
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-8404.01.patch
>
>
> It is observed that if ATS1/1.5 daemon is not running, RM recovery is delayed 
> as long as timeline client get timed out for each applications. By default, 
> timed out will take around 5 mins. If completed applications are more then 
> amount of time RM will wait is *(number of completed applications in a 
> cluster * 5 minutes)* which is kind of hanged. 
> Primary reason for this behavior is YARN-3044 YARN-4129 which refactor 
> existing system metric publisher. This refactoring made appFinished event as 
> synchronous which was asynchronous earlier. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8404) RM Event dispatcher is blocked if ATS1/1.5 server is not running.

2018-06-07 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8404:

Attachment: YARN-8404.01.patch

> RM Event dispatcher is blocked if ATS1/1.5 server is not running. 
> --
>
> Key: YARN-8404
> URL: https://issues.apache.org/jira/browse/YARN-8404
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-8404.01.patch
>
>
> It is observed that if ATS1/1.5 daemon is not running, RM recovery is delayed 
> as long as timeline client get timed out for each applications. By default, 
> timed out will take around 5 mins. If completed applications are more then 
> amount of time RM will wait is *(number of completed applications in a 
> cluster * 5 minutes)* which is kind of hanged. 
> Primary reason for this behavior is YARN-3044 YARN-4129 which refactor 
> existing system metric publisher. This refactoring made appFinished event as 
> synchronous which was asynchronous earlier. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8404) RM Event dispatcher is blocked if ATS1/1.5 server is not running.

2018-06-07 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8404:

Summary: RM Event dispatcher is blocked if ATS1/1.5 server is not running.  
 (was: RM Recovery is delayed much if ATS1/1.5 server is not running. )

> RM Event dispatcher is blocked if ATS1/1.5 server is not running. 
> --
>
> Key: YARN-8404
> URL: https://issues.apache.org/jira/browse/YARN-8404
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Critical
>
> It is observed that if ATS1/1.5 daemon is not running, RM recovery is delayed 
> as long as timeline client get timed out for each applications. By default, 
> timed out will take around 5 mins. If completed applications are more then 
> amount of time RM will wait is *(number of completed applications in a 
> cluster * 5 minutes)* which is kind of hanged. 
> Primary reason for this behavior is YARN-3044 YARN-4129 which refactor 
> existing system metric publisher. This refactoring made appFinished event as 
> synchronous which was asynchronous earlier. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8404) RM Recovery is delayed much if ATS1/1.5 server is not running.

2018-06-06 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504271#comment-16504271
 ] 

Rohith Sharma K S commented on YARN-8404:
-

Making TimelineServiceV1Publisher#appFinished synchronous is very dangerous. If 
ATS1/1.5 daemon is down then primary AsyncDispatcher thread is blocked which 
makes primary dispatcher event queue grow up. 

> RM Recovery is delayed much if ATS1/1.5 server is not running. 
> ---
>
> Key: YARN-8404
> URL: https://issues.apache.org/jira/browse/YARN-8404
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Critical
>
> It is observed that if ATS1/1.5 daemon is not running, RM recovery is delayed 
> as long as timeline client get timed out for each applications. By default, 
> timed out will take around 5 mins. If completed applications are more then 
> amount of time RM will wait is *(number of completed applications in a 
> cluster * 5 minutes)* which is kind of hanged. 
> Primary reason for this behavior is YARN-3044 YARN-4129 which refactor 
> existing system metric publisher. This refactoring made appFinished event as 
> synchronous which was asynchronous earlier. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Moved] (YARN-8404) RM Recovery is delayed much if ATS1/1.5 server is not running.

2018-06-06 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S moved MAPREDUCE-7106 to YARN-8404:


Key: YARN-8404  (was: MAPREDUCE-7106)
Project: Hadoop YARN  (was: Hadoop Map/Reduce)

> RM Recovery is delayed much if ATS1/1.5 server is not running. 
> ---
>
> Key: YARN-8404
> URL: https://issues.apache.org/jira/browse/YARN-8404
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Critical
>
> It is observed that if ATS1/1.5 daemon is not running, RM recovery is delayed 
> as long as timeline client get timed out for each applications. By default, 
> timed out will take around 5 mins. If completed applications are more then 
> amount of time RM will wait is *(number of completed applications in a 
> cluster * 5 minutes)* which is kind of hanged. 
> Primary reason for this behavior is YARN-3044 YARN-4129 which refactor 
> existing system metric publisher. This refactoring made appFinished event as 
> synchronous which was asynchronous earlier. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8399) NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode

2018-06-06 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8399:

Fix Version/s: (was: 3.0.3)
   3.2.0

> NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode
> --
>
> Key: YARN-8399
> URL: https://issues.apache.org/jira/browse/YARN-8399
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineservice
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8399.001.patch, YARN-8399.002.patch, 
> YARN-8399.003.patch
>
>
> Getting 403 GSS exception while accessing NM http port via curl. 
> {code:java}
> curl -k -i --negotiate -u: https://:/node
> HTTP/1.1 401 Authentication required
> Date: Tue, 05 Jun 2018 17:59:00 GMT
> Date: Tue, 05 Jun 2018 17:59:00 GMT
> Pragma: no-cache
> WWW-Authenticate: Negotiate
> Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly
> Cache-Control: must-revalidate,no-cache,no-store
> Content-Type: text/html;charset=iso-8859-1
> Content-Length: 264
> HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism 
> level: Request is a replay (34)){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8399) NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode

2018-06-06 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8399:

Target Version/s: 2.10.0, 3.2.0, 3.1.1, 3.0.3  (was: 3.1.1)

> NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode
> --
>
> Key: YARN-8399
> URL: https://issues.apache.org/jira/browse/YARN-8399
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineservice
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8399.001.patch, YARN-8399.002.patch, 
> YARN-8399.003.patch
>
>
> Getting 403 GSS exception while accessing NM http port via curl. 
> {code:java}
> curl -k -i --negotiate -u: https://:/node
> HTTP/1.1 401 Authentication required
> Date: Tue, 05 Jun 2018 17:59:00 GMT
> Date: Tue, 05 Jun 2018 17:59:00 GMT
> Pragma: no-cache
> WWW-Authenticate: Negotiate
> Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly
> Cache-Control: must-revalidate,no-cache,no-store
> Content-Type: text/html;charset=iso-8859-1
> Content-Length: 264
> HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism 
> level: Request is a replay (34)){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8399) NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode

2018-06-06 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8399:

Fix Version/s: 3.0.3
   3.1.1

> NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode
> --
>
> Key: YARN-8399
> URL: https://issues.apache.org/jira/browse/YARN-8399
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineservice
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8399.001.patch, YARN-8399.002.patch, 
> YARN-8399.003.patch
>
>
> Getting 403 GSS exception while accessing NM http port via curl. 
> {code:java}
> curl -k -i --negotiate -u: https://:/node
> HTTP/1.1 401 Authentication required
> Date: Tue, 05 Jun 2018 17:59:00 GMT
> Date: Tue, 05 Jun 2018 17:59:00 GMT
> Pragma: no-cache
> WWW-Authenticate: Negotiate
> Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly
> Cache-Control: must-revalidate,no-cache,no-store
> Content-Type: text/html;charset=iso-8859-1
> Content-Length: 264
> HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism 
> level: Request is a replay (34)){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8399) NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode

2018-06-06 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504259#comment-16504259
 ] 

Rohith Sharma K S commented on YARN-8399:
-

I committed to trunk/branch-3.1.. patch apply to branch-3.0/branch-2 fails. I 
tried to resolve it and later compilation failed for tests. 
Since this impact for branch-3.0/branch-2 as well, I am keeping the jira open. 
[~sunilg] can you provide patch for branch-3.0 and branch-2?

> NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode
> --
>
> Key: YARN-8399
> URL: https://issues.apache.org/jira/browse/YARN-8399
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineservice
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8399.001.patch, YARN-8399.002.patch, 
> YARN-8399.003.patch
>
>
> Getting 403 GSS exception while accessing NM http port via curl. 
> {code:java}
> curl -k -i --negotiate -u: https://:/node
> HTTP/1.1 401 Authentication required
> Date: Tue, 05 Jun 2018 17:59:00 GMT
> Date: Tue, 05 Jun 2018 17:59:00 GMT
> Pragma: no-cache
> WWW-Authenticate: Negotiate
> Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly
> Cache-Control: must-revalidate,no-cache,no-store
> Content-Type: text/html;charset=iso-8859-1
> Content-Length: 264
> HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism 
> level: Request is a replay (34)){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8399) NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode

2018-06-06 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504228#comment-16504228
 ] 

Rohith Sharma K S commented on YARN-8399:
-

forgot to commit yesterday, committing it now

> NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode
> --
>
> Key: YARN-8399
> URL: https://issues.apache.org/jira/browse/YARN-8399
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineservice
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8399.001.patch, YARN-8399.002.patch, 
> YARN-8399.003.patch
>
>
> Getting 403 GSS exception while accessing NM http port via curl. 
> {code:java}
> curl -k -i --negotiate -u: https://:/node
> HTTP/1.1 401 Authentication required
> Date: Tue, 05 Jun 2018 17:59:00 GMT
> Date: Tue, 05 Jun 2018 17:59:00 GMT
> Pragma: no-cache
> WWW-Authenticate: Negotiate
> Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly
> Cache-Control: must-revalidate,no-cache,no-store
> Content-Type: text/html;charset=iso-8859-1
> Content-Length: 264
> HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism 
> level: Request is a replay (34)){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8399) NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode

2018-06-06 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503617#comment-16503617
 ] 

Rohith Sharma K S commented on YARN-8399:
-

+1 committing shortly

> NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode
> --
>
> Key: YARN-8399
> URL: https://issues.apache.org/jira/browse/YARN-8399
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineservice
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8399.001.patch, YARN-8399.002.patch, 
> YARN-8399.003.patch
>
>
> Getting 403 GSS exception while accessing NM http port via curl. 
> {code:java}
> curl -k -i --negotiate -u: https://:/node
> HTTP/1.1 401 Authentication required
> Date: Tue, 05 Jun 2018 17:59:00 GMT
> Date: Tue, 05 Jun 2018 17:59:00 GMT
> Pragma: no-cache
> WWW-Authenticate: Negotiate
> Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly
> Cache-Control: must-revalidate,no-cache,no-store
> Content-Type: text/html;charset=iso-8859-1
> Content-Length: 264
> HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism 
> level: Request is a replay (34)){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8401) Yarnui2 not working with out internet connection

2018-06-06 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503604#comment-16503604
 ] 

Rohith Sharma K S commented on YARN-8401:
-

[~sunilg] I remember similar issue we were faced for normal http2 server start 
up for RM also because of jetty server used to download from java.sun.com. I 
guess this was issue with dns entry? 

> Yarnui2 not working with out internet connection
> 
>
> Key: YARN-8401
> URL: https://issues.apache.org/jira/browse/YARN-8401
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
> Attachments: YARN-8401.001.patch
>
>
> {code}
> 2018-06-06 21:10:58,611 WARN org.eclipse.jetty.webapp.WebAppContext: Failed 
> startup of context 
> o.e.j.w.WebAppContext@108a46d6{/ui2,file:///opt/HA/310/install/hadoop/resourcemanager/share/hadoop/yarn/webapps/ui2/,null}
> java.net.UnknownHostException: java.sun.com
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at java.net.Socket.connect(Socket.java:538)
> at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
> at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
> at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
> at sun.net.www.http.HttpClient.(HttpClient.java:211)
> at sun.net.www.http.HttpClient.New(HttpClient.java:308)
> at sun.net.www.http.HttpClient.New(HttpClient.java:326)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1168)
> at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1104)
> at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:998)
> at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:932)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1512)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1440)
> at 
> com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:646)
> at 
> com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1300)
> at 
> com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:1267)
> at 
> com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:263)
> at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1164)
> at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1050)
> at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:964)
> at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
> at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
> at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
> at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
> at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
> at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
> at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:333)
> at org.eclipse.jetty.xml.XmlParser.parse(XmlParser.java:255)
> at org.eclipse.jetty.webapp.Descriptor.parse(Descriptor.java:54)
> at 
> org.eclipse.jetty.webapp.WebDescriptor.parse(WebDescriptor.java:207)
> at org.eclipse.jetty.webapp.MetaData.setWebXml(MetaData.java:189)
> at 
> org.eclipse.jetty.webapp.WebXmlConfiguration.preConfigure(WebXmlConfiguration.java:60)
> at 
> org.eclipse.jetty.webapp.WebAppContext.preConfigure(WebAppContext.java:485)
> at 
> org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:521)
> at 
> 

[jira] [Commented] (YARN-8399) NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode

2018-06-06 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503346#comment-16503346
 ] 

Rohith Sharma K S commented on YARN-8399:
-

+1

> NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode
> --
>
> Key: YARN-8399
> URL: https://issues.apache.org/jira/browse/YARN-8399
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineservice
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8399.001.patch, YARN-8399.002.patch
>
>
> Getting 403 GSS exception while accessing NM http port via curl. 
> {code:java}
> curl -k -i --negotiate -u: https://:/node
> HTTP/1.1 401 Authentication required
> Date: Tue, 05 Jun 2018 17:59:00 GMT
> Date: Tue, 05 Jun 2018 17:59:00 GMT
> Pragma: no-cache
> WWW-Authenticate: Negotiate
> Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly
> Cache-Control: must-revalidate,no-cache,no-store
> Content-Type: text/html;charset=iso-8859-1
> Content-Length: 264
> HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism 
> level: Request is a replay (34)){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6904) [ATSv2] Fix findbugs warnings

2018-06-06 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503179#comment-16503179
 ] 

Rohith Sharma K S commented on YARN-6904:
-

This JIRA we can close as it was happening in YARN-5355 branch which is pretty 
older. All those changes got rebased against trunk and corrected before branch 
merge into trunk. 

> [ATSv2] Fix findbugs warnings
> -
>
> Key: YARN-6904
> URL: https://issues.apache.org/jira/browse/YARN-6904
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: YARN-5355
>Reporter: Rohith Sharma K S
>Assignee: Abhishek Modi
>Priority: Major
>
> Many extant findbugs warnings are reported branch YARN-5355 
> [Jenkins|https://issues.apache.org/jira/browse/YARN-6130?focusedCommentId=16105786=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16105786]
> This need to be investigated and fix one by one. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8155) Improve the logging in NMTimelinePublisher and TimelineCollectorWebService

2018-06-06 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502937#comment-16502937
 ] 

Rohith Sharma K S commented on YARN-8155:
-

looks fine to me.. [~vrushalic] [~haibochen] would you take a look at the 
patch? I will commit it later of today if no more objections. 

> Improve the logging in NMTimelinePublisher and TimelineCollectorWebService
> --
>
> Key: YARN-8155
> URL: https://issues.apache.org/jira/browse/YARN-8155
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-8155.001.patch, YARN-8155.002.patch, 
> YARN-8155.003.patch, YARN-8155.004.patch
>
>
> We see that NM logs are filled with larger stack trace of NotFoundException 
> if collector is removed from one of the NM and other NMs are still publishing 
> the entities.
>  
> This Jira is to improve the logging in NM so that we log with informative 
> message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8399) NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode

2018-06-05 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502364#comment-16502364
 ] 

Rohith Sharma K S commented on YARN-8399:
-

Should we make this change at auxiliary service level?

> NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode
> --
>
> Key: YARN-8399
> URL: https://issues.apache.org/jira/browse/YARN-8399
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineservice
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8399.001.patch
>
>
> Getting 403 GSS exception while accessing NM http port via curl. 
> {code:java}
> curl -k -i --negotiate -u: https://:/node
> HTTP/1.1 401 Authentication required
> Date: Tue, 05 Jun 2018 17:59:00 GMT
> Date: Tue, 05 Jun 2018 17:59:00 GMT
> Pragma: no-cache
> WWW-Authenticate: Negotiate
> Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly
> Cache-Control: must-revalidate,no-cache,no-store
> Content-Type: text/html;charset=iso-8859-1
> Content-Length: 264
> HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism 
> level: Request is a replay (34)){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8396) Click on an individual container continuously spins and doesn't load the page

2018-06-05 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501840#comment-16501840
 ] 

Rohith Sharma K S commented on YARN-8396:
-

thank to [~sunilg] for the patch! 

> Click on an individual container continuously spins and doesn't load the page
> -
>
> Key: YARN-8396
> URL: https://issues.apache.org/jira/browse/YARN-8396
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Charan Hebri
>Assignee: Sunil Govindan
>Priority: Blocker
> Fix For: 3.2.0, 3.1.1
>
> Attachments: Screen Shot 2018-05-31 at 3.24.09 PM.png, 
> YARN-8396.001.patch
>
>
> For a running application, a click on an individual container leads to an 
> infinite spinner which doesn't load the corresponding page. To reproduce, 
> with a running application click:
> Nodes -> \{Node_HTTP_Address} -> List of Containers on this Node -> 
> \{Container_id}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8397) ActivitiesManager thread doesn't handles InterruptedException

2018-06-05 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501737#comment-16501737
 ] 

Rohith Sharma K S commented on YARN-8397:
-

Findbug error is from fair scheduler, not related to patch.

> ActivitiesManager thread doesn't handles InterruptedException 
> --
>
> Key: YARN-8397
> URL: https://issues.apache.org/jira/browse/YARN-8397
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8397.01.patch
>
>
> It is observed while using MiniYARNCluster, MiniYARNCluster#stop doesn't stop 
> JVM. 
> Thread dump shows that ActivitiesManager is in timed_waiting state. 
> {code}
> "Thread-43" #66 prio=5 os_prio=31 tid=0x7ffea09fd000 nid=0xa103 waiting 
> on condition [0x76f1]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:142)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8155) Improve the logging in NMTimelinePublisher and TimelineCollectorWebService

2018-06-05 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501603#comment-16501603
 ] 

Rohith Sharma K S commented on YARN-8155:
-

TimelineServiceV2Publisher printing exception message which could fill up logs 
very quickly. Probably we should log exception only in debug log but not for 
info logs. 

> Improve the logging in NMTimelinePublisher and TimelineCollectorWebService
> --
>
> Key: YARN-8155
> URL: https://issues.apache.org/jira/browse/YARN-8155
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-8155.001.patch, YARN-8155.002.patch, 
> YARN-8155.003.patch
>
>
> We see that NM logs are filled with larger stack trace of NotFoundException 
> if collector is removed from one of the NM and other NMs are still publishing 
> the entities.
>  
> This Jira is to improve the logging in NM so that we log with informative 
> message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8397) ActivitiesManager thread doesn't handles InterruptedException

2018-06-05 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S reassigned YARN-8397:
---

Assignee: Rohith Sharma K S

> ActivitiesManager thread doesn't handles InterruptedException 
> --
>
> Key: YARN-8397
> URL: https://issues.apache.org/jira/browse/YARN-8397
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8397.01.patch
>
>
> It is observed while using MiniYARNCluster, MiniYARNCluster#stop doesn't stop 
> JVM. 
> Thread dump shows that ActivitiesManager is in timed_waiting state. 
> {code}
> "Thread-43" #66 prio=5 os_prio=31 tid=0x7ffea09fd000 nid=0xa103 waiting 
> on condition [0x76f1]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:142)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8397) ActivitiesManager thread doesn't handles InterruptedException

2018-06-05 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501582#comment-16501582
 ] 

Rohith Sharma K S commented on YARN-8397:
-

There were 2 issues which got fixed in this patch. 
# ActivityManager doesn't handle InterruptedException. 
# Since activityManager was not stopped anywhere, in RM HA scenario there was 
one thread leak on every RM switch. I

> ActivitiesManager thread doesn't handles InterruptedException 
> --
>
> Key: YARN-8397
> URL: https://issues.apache.org/jira/browse/YARN-8397
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8397.01.patch
>
>
> It is observed while using MiniYARNCluster, MiniYARNCluster#stop doesn't stop 
> JVM. 
> Thread dump shows that ActivitiesManager is in timed_waiting state. 
> {code}
> "Thread-43" #66 prio=5 os_prio=31 tid=0x7ffea09fd000 nid=0xa103 waiting 
> on condition [0x76f1]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:142)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8397) ActivitiesManager thread doesn't handles InterruptedException

2018-06-05 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8397:

Attachment: YARN-8397.01.patch

> ActivitiesManager thread doesn't handles InterruptedException 
> --
>
> Key: YARN-8397
> URL: https://issues.apache.org/jira/browse/YARN-8397
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8397.01.patch
>
>
> It is observed while using MiniYARNCluster, MiniYARNCluster#stop doesn't stop 
> JVM. 
> Thread dump shows that ActivitiesManager is in timed_waiting state. 
> {code}
> "Thread-43" #66 prio=5 os_prio=31 tid=0x7ffea09fd000 nid=0xa103 waiting 
> on condition [0x76f1]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:142)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8397) ActivitiesManager thread doesn't handles InterruptedException

2018-06-05 Thread Rohith Sharma K S (JIRA)
Rohith Sharma K S created YARN-8397:
---

 Summary: ActivitiesManager thread doesn't handles 
InterruptedException 
 Key: YARN-8397
 URL: https://issues.apache.org/jira/browse/YARN-8397
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Rohith Sharma K S


It is observed while using MiniYARNCluster, MiniYARNCluster#stop doesn't stop 
JVM. 
Thread dump shows that ActivitiesManager is in timed_waiting state. 
{code}
"Thread-43" #66 prio=5 os_prio=31 tid=0x7ffea09fd000 nid=0xa103 waiting on 
condition [0x76f1]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:142)
at java.lang.Thread.run(Thread.java:748)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8397) ActivitiesManager thread doesn't handles InterruptedException

2018-06-05 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501568#comment-16501568
 ] 

Rohith Sharma K S commented on YARN-8397:
-

This is due to ActivitiesManager#serviceStop interrupt the thread and come out. 
But exception handling in run method ignores exception and continue in loop.
{code}
    cleanUpThread = new Thread(new Runnable() {
  @Override
  public void run() {
    while (true) {
  // some code
  try {
    Thread.sleep(5000);
  } catch (Exception e) {
    // ignore
  }
    }
  }
    });
{code}

> ActivitiesManager thread doesn't handles InterruptedException 
> --
>
> Key: YARN-8397
> URL: https://issues.apache.org/jira/browse/YARN-8397
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Major
>
> It is observed while using MiniYARNCluster, MiniYARNCluster#stop doesn't stop 
> JVM. 
> Thread dump shows that ActivitiesManager is in timed_waiting state. 
> {code}
> "Thread-43" #66 prio=5 os_prio=31 tid=0x7ffea09fd000 nid=0xa103 waiting 
> on condition [0x76f1]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:142)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8155) Improve the logging in NMTimelinePublisher and TimelineCollectorWebService

2018-06-04 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499958#comment-16499958
 ] 

Rohith Sharma K S commented on YARN-8155:
-

thanks Abhishek Modi for patch. it looks reasonable to me.  Would you add 
similar change  in TimelineServiceV2Publisher as well.?

TimelineCollectorWebService
# Catching notfoundexception and converting into web application exception 
changing return code. We should still retain return code not found right?

[~vrushalic] [~haibo.chen] would you take a look at this patch please? 

> Improve the logging in NMTimelinePublisher and TimelineCollectorWebService
> --
>
> Key: YARN-8155
> URL: https://issues.apache.org/jira/browse/YARN-8155
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-8155.001.patch, YARN-8155.002.patch
>
>
> We see that NM logs are filled with larger stack trace of NotFoundException 
> if collector is removed from one of the NM and other NMs are still publishing 
> the entities.
>  
> This Jira is to improve the logging in NM so that we log with informative 
> message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8319) More YARN pages need to honor yarn.resourcemanager.display.per-user-apps

2018-06-01 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498857#comment-16498857
 ] 

Rohith Sharma K S commented on YARN-8319:
-

[~sunilg] if it is applicable for branch-2, please feel free to backport to 
branch-2 as well. 

> More YARN pages need to honor yarn.resourcemanager.display.per-user-apps
> 
>
> Key: YARN-8319
> URL: https://issues.apache.org/jira/browse/YARN-8319
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Sunil Govindan
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8319.001.patch, YARN-8319.002.patch, 
> YARN-8319.003.patch, YARN-8319.addendum.001.patch
>
>
> When this config is on
>  - Per queue page on UI2 should filter app list by user
>  -- TODO: Verify the same with UI1 Per-queue page
>  - ATSv2 with UI2 should filter list of all users' flows and flow activities
>  - Per Node pages
>  -- Listing of apps and containers on a per-node basis should filter apps and 
> containers by user.
> To this end, because this is no longer just for resourcemanager, we should 
> also deprecate {{yarn.resourcemanager.display.per-user-apps}} in favor of 
> {{yarn.webapp.filter-app-list-by-user}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8372) ApplicationAttemptNotFoundException should be handled correctly by Distributed Shell App Master

2018-06-01 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497823#comment-16497823
 ] 

Rohith Sharma K S commented on YARN-8372:
-

Thanks [~suma.shivaprasad] for the patch. Approach looks good to me. Would you 
look at test and checkstyle errors?

> ApplicationAttemptNotFoundException should be handled correctly by 
> Distributed Shell App Master
> ---
>
> Key: YARN-8372
> URL: https://issues.apache.org/jira/browse/YARN-8372
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell
>Reporter: Charan Hebri
>Assignee: Suma Shivaprasad
>Priority: Major
> Attachments: YARN-8372.1.patch, YARN-8372.2.patch
>
>
> {noformat}
> try {
>   response = client.allocate(progress);
> } catch (ApplicationAttemptNotFoundException e) {
> handler.onShutdownRequest();
> LOG.info("Shutdown requested. Stopping callback.");
> return;{noformat}
> is a code snippet from AMRMClientAsyncImpl. The corresponding 
> onShutdownRequest call for the Distributed Shell App master,
> {noformat}
> @Override
> public void onShutdownRequest() {
>   done = true;
> }{noformat}
> Due to the above change, the current behavior is that whenever an application 
> attempt fails due to a NM restart (NM where the DS AM is running), an 
> ApplicationAttemptNotFoundException is thrown and all containers for that 
> attempt including the ones that are running on other NMs are killed by the AM 
> and marked as COMPLETE. The subsequent attempt spawns new containers just 
> like a new attempt. This behavior is different to a Map Reduce application 
> where the containers are not killed.
> cc [~rohithsharma]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8383) TimelineServer 1.5 start fails with NoClassDefFoundError

2018-05-31 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8383:

Target Version/s: 2.8.5

I see this behavior only 2.8.4 but not in 2.9.x release line. It occurs every 
time, just download 2.8.4 and configure 1.5 properties. 

JsonFactory file is from jackson-core-x.x.x.jar. In hadoop-2.8.4, this jar not 
found. But in hadoop-2.9.x, I see this jar available in hdfs lib.

*In hadoop-2.8.4:* Though tools/lib has jackson-core-2.2.3.jar, this won't be 
loaded for daemon start.
{code:java}
HW12723:hadoop-2.8.4 rsharmaks$ find ./ -iname "jackson-core-*.jar"
.//share/hadoop/common/lib/jackson-core-asl-1.9.13.jar
.//share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar
.//share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-core-asl-1.9.13.jar
.//share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/jackson-core-asl-1.9.13.jar
.//share/hadoop/mapreduce/lib/jackson-core-asl-1.9.13.jar
.//share/hadoop/tools/lib/jackson-core-2.2.3.jar
.//share/hadoop/tools/lib/jackson-core-asl-1.9.13.jar
.//share/hadoop/yarn/lib/jackson-core-asl-1.9.13.jar
{code}
*In hadoop-2.9.x :* Observer that jackson-core-2.7.8.jar is there in hdfs/lib 
which will be loaded for timelineserver. Though tools/lib has this jar, this 
won't be loaded for daemon start. 
{code:java}
HW12723:hadoop-2.9.0 rsharmaks$ find ./ -iname "jackson-core-*.jar"
.//share/hadoop/common/lib/jackson-core-asl-1.9.13.jar
.//share/hadoop/hdfs/lib/jackson-core-2.7.8.jar
.//share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar
.//share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-core-2.7.8.jar
.//share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-core-asl-1.9.13.jar
.//share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/jackson-core-asl-1.9.13.jar
.//share/hadoop/mapreduce/lib/jackson-core-asl-1.9.13.jar
.//share/hadoop/tools/lib/jackson-core-2.7.8.jar
.//share/hadoop/tools/lib/jackson-core-asl-1.9.13.jar
.//share/hadoop/yarn/lib/jackson-core-asl-1.9.13.jar

HW12723:hadoop-2.9.0 rsharmaks$ cd ../hadoop-2.9.1
HW12723:hadoop-2.9.1 rsharmaks$ find ./ -iname "jackson-core-*.jar"
.//share/hadoop/common/lib/jackson-core-asl-1.9.13.jar
.//share/hadoop/hdfs/lib/jackson-core-2.7.8.jar
.//share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar
.//share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-core-2.7.8.jar
.//share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-core-asl-1.9.13.jar
.//share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/jackson-core-asl-1.9.13.jar
.//share/hadoop/mapreduce/lib/jackson-core-asl-1.9.13.jar
.//share/hadoop/tools/lib/jackson-core-2.7.8.jar
.//share/hadoop/tools/lib/jackson-core-asl-1.9.13.jar
.//share/hadoop/yarn/lib/jackson-core-asl-1.9.13.jar
{code}

I couldn't get 2.8.3 release artifact to verify it. 
cc :/ [~jlowe]

> TimelineServer 1.5 start fails with NoClassDefFoundError
> 
>
> Key: YARN-8383
> URL: https://issues.apache.org/jira/browse/YARN-8383
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.4
>Reporter: Rohith Sharma K S
>Priority: Blocker
>
> TimelineServer 1.5 start fails with NoClassDefFoundError.
> {noformat}
> 2018-05-31 22:10:58,548 FATAL 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
>  Error starting ApplicationHistoryServer
> java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/JsonFactory
>   at 
> org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.(RollingLevelDBTimelineStore.java:174)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2306)
>   at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2271)
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393)
>   at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.createSummaryStore(EntityGroupFSTimelineStore.java:239)
>   at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.serviceInit(EntityGroupFSTimelineStore.java:146)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:180)

[jira] [Created] (YARN-8383) TimelineServer 1.5 start fails with NoClassDefFoundError

2018-05-31 Thread Rohith Sharma K S (JIRA)
Rohith Sharma K S created YARN-8383:
---

 Summary: TimelineServer 1.5 start fails with NoClassDefFoundError
 Key: YARN-8383
 URL: https://issues.apache.org/jira/browse/YARN-8383
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.8.4
Reporter: Rohith Sharma K S


TimelineServer 1.5 start fails with NoClassDefFoundError.
{noformat}
2018-05-31 22:10:58,548 FATAL 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
 Error starting ApplicationHistoryServer
java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/JsonFactory
at 
org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.(RollingLevelDBTimelineStore.java:174)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at 
org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2306)
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2271)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.createSummaryStore(EntityGroupFSTimelineStore.java:239)
at 
org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.serviceInit(EntityGroupFSTimelineStore.java:146)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:180)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:190)
Caused by: java.lang.ClassNotFoundException: 
com.fasterxml.jackson.core.JsonFactory
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 15 more

{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8368) yarn app start cli should print applicationId

2018-05-31 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496188#comment-16496188
 ] 

Rohith Sharma K S commented on YARN-8368:
-

Thanks [~billie.rinaldi] for review and committing the patch..

> yarn app start cli should print applicationId
> -
>
> Key: YARN-8368
> URL: https://issues.apache.org/jira/browse/YARN-8368
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Rohith Sharma K S
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8368.01.patch, YARN-8368.02.patch
>
>
> yarn app start cli should print the application Id similar to yarn launch cmd.
> {code:java}
> bash-4.2$ yarn app -start hbase-app-test
> WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of 
> YARN_LOGFILE.
> WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of 
> YARN_PID_DIR.
> 18/05/24 15:15:53 INFO client.RMProxy: Connecting to ResourceManager at 
> xxx/xxx:8050
> 18/05/24 15:15:54 INFO client.RMProxy: Connecting to ResourceManager at 
> xxx/xxx:8050
> 18/05/24 15:15:55 INFO client.ApiServiceClient: Service hbase-app-test is 
> successfully started.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8380) Support shared mounts in docker runtime

2018-05-31 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496185#comment-16496185
 ] 

Rohith Sharma K S commented on YARN-8380:
-

Just fyi.. Whenever docker volume is mounted with shared, I see an error like 
*_docker: Error response from daemon: linux mounts: Could not find source mount 
of /var/lib/kubelet_*. Do you know any reason for this? 

> Support shared mounts in docker runtime
> ---
>
> Key: YARN-8380
> URL: https://issues.apache.org/jira/browse/YARN-8380
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
>Priority: Major
>
> The docker run command supports the mount type shared, but currently we are 
> only supporting ro and rw mount types in the docker runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8372) ApplicationAttemptNotFoundException should be handled correctly by Distributed Shell App Master

2018-05-29 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494690#comment-16494690
 ] 

Rohith Sharma K S commented on YARN-8372:
-

DS app master should handle shutdown request properly whether to clean up or 
not based on the attempt number check. Current behavior clean up all the 
running containers!

> ApplicationAttemptNotFoundException should be handled correctly by 
> Distributed Shell App Master
> ---
>
> Key: YARN-8372
> URL: https://issues.apache.org/jira/browse/YARN-8372
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell
>Reporter: Charan Hebri
>Priority: Major
>
> {noformat}
> try {
>   response = client.allocate(progress);
> } catch (ApplicationAttemptNotFoundException e) {
> handler.onShutdownRequest();
> LOG.info("Shutdown requested. Stopping callback.");
> return;{noformat}
> is a code snippet from AMRMClientAsyncImpl. The corresponding 
> onShutdownRequest call for the Distributed Shell App master,
> {noformat}
> @Override
> public void onShutdownRequest() {
>   done = true;
> }{noformat}
> Due to the above change, the current behavior is that whenever an application 
> attempt fails due to a NM restart (NM where the DS AM is running), an 
> ApplicationAttemptNotFoundException is thrown and all containers for that 
> attempt including the ones that are running on other NMs are killed by the AM 
> and marked as COMPLETE. The subsequent attempt spawns new containers just 
> like a new attempt. This behavior is different to a Map Reduce application 
> where the containers are not killed.
> cc [~rohithsharma]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8369) Javadoc build failed due to "bad use of '>'"

2018-05-28 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492584#comment-16492584
 ] 

Rohith Sharma K S commented on YARN-8369:
-

On further compiling java doc, it fails at CapacitySchedulerPreemptionUtils. We 
also need to fix this.
{code:java}
[ERROR] 
/opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/CapacitySchedulerPreemptionUtils.java:139:
 error: malformed HTML
[ERROR]*stop preempt container when any major resource type <= 
0 for to-
[ERROR] ^
[ERROR] 
/opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/CapacitySchedulerPreemptionUtils.java:143:
 error: malformed HTML
[ERROR]*stop preempt container when: all major resource type <= 
0 for
{code}

> Javadoc build failed due to "bad use of '>'"
> 
>
> Key: YARN-8369
> URL: https://issues.apache.org/jira/browse/YARN-8369
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: build, docs
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
> Attachments: YARN-8369.1.patch
>
>
> {noformat}
> $ mvn javadoc:javadoc --projects 
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common
> ...
> [ERROR] 
> /hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:263:
>  error: bad use of '>'
> [ERROR]* included) has a >0 value.
> [ERROR]  ^
> [ERROR] 
> /hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:266:
>  error: bad use of '>'
> [ERROR]* @return returns true if any resource is >0
> [ERROR]  ^
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8371) Javadoc error in ResourceCalculator

2018-05-28 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S resolved YARN-8371.
-
Resolution: Duplicate

> Javadoc error in ResourceCalculator
> ---
>
> Key: YARN-8371
> URL: https://issues.apache.org/jira/browse/YARN-8371
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
>
> Hadoop package build fails with java doc error
> {code}
> [ERROR] 
> /opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:263:
>  error: bad use of '>'
> [ERROR]* included) has a >0 value.
> [ERROR]  ^
> [ERROR] 
> /opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:266:
>  error: bad use of '>'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8292) Fix the dominant resource preemption cannot happen when some of the resource vector becomes negative

2018-05-28 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492565#comment-16492565
 ] 

Rohith Sharma K S commented on YARN-8292:
-

This causes java doc errors in trunk and branch-3.1. See YARN-8371

> Fix the dominant resource preemption cannot happen when some of the resource 
> vector becomes negative
> 
>
> Key: YARN-8292
> URL: https://issues.apache.org/jira/browse/YARN-8292
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8292.001.patch, YARN-8292.002.patch, 
> YARN-8292.003.patch, YARN-8292.004.patch, YARN-8292.005.patch, 
> YARN-8292.006.patch, YARN-8292.007.patch, YARN-8292.008.patch, 
> YARN-8292.009.patch
>
>
> This is an example of the problem: 
>   
> {code}
> //   guaranteed,  max,used,   pending
> "root(=[30:18:6  30:18:6 12:12:6 1:1:1]);" + //root
> "-a(=[10:6:2 10:6:2  6:6:3   0:0:0]);" + // a
> "-b(=[10:6:2 10:6:2  6:6:3   0:0:0]);" + // b
> "-c(=[10:6:2 10:6:2  0:0:0   1:1:1])"; // c
> {code}
> There're 3 resource types. Total resource of the cluster is 30:18:6
> For both of a/b, there're 3 containers running, each of container is 2:2:1.
> Queue c uses 0 resource, and have 1:1:1 pending resource.
> Under existing logic, preemption cannot happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8371) Javadoc error in ResourceCalculator

2018-05-28 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8371:

Target Version/s: 3.2.0, 3.1.1

> Javadoc error in ResourceCalculator
> ---
>
> Key: YARN-8371
> URL: https://issues.apache.org/jira/browse/YARN-8371
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
>
> Hadoop package build fails with java doc error
> {code}
> [ERROR] 
> /opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:263:
>  error: bad use of '>'
> [ERROR]* included) has a >0 value.
> [ERROR]  ^
> [ERROR] 
> /opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:266:
>  error: bad use of '>'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8371) Javadoc error in ResourceCalculator

2018-05-28 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8371:

Priority: Critical  (was: Major)

> Javadoc error in ResourceCalculator
> ---
>
> Key: YARN-8371
> URL: https://issues.apache.org/jira/browse/YARN-8371
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
>
> Hadoop package build fails with java doc error
> {code}
> [ERROR] 
> /opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:263:
>  error: bad use of '>'
> [ERROR]* included) has a >0 value.
> [ERROR]  ^
> [ERROR] 
> /opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:266:
>  error: bad use of '>'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8371) Javadoc error in ResourceCalculator

2018-05-28 Thread Rohith Sharma K S (JIRA)
Rohith Sharma K S created YARN-8371:
---

 Summary: Javadoc error in ResourceCalculator
 Key: YARN-8371
 URL: https://issues.apache.org/jira/browse/YARN-8371
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Rohith Sharma K S


Hadoop package build fails with java doc error
{code}
[ERROR] 
/opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:263:
 error: bad use of '>'
[ERROR]* included) has a >0 value.
[ERROR]  ^
[ERROR] 
/opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:266:
 error: bad use of '>'
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8068) Application Priority field causes NPE in app timeline publish when Hadoop 2.7 based clients to 2.8+

2018-05-28 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8068:

Issue Type: Sub-task  (was: Bug)
Parent: YARN-8347

> Application Priority field causes NPE in app timeline publish when Hadoop 2.7 
> based clients to 2.8+
> ---
>
> Key: YARN-8068
> URL: https://issues.apache.org/jira/browse/YARN-8068
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.8.3
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Blocker
> Fix For: 3.1.0, 2.10.0, 2.9.2, 3.0.3
>
> Attachments: YARN-8068.001.patch
>
>
> [TimelineServiceV1Publisher|eclipse-javadoc:%E2%98%82=hadoop-yarn-server-resourcemanager/src%5C/main%5C/java%3Corg.apache.hadoop.yarn.server.resourcemanager.metrics%7BTimelineServiceV1Publisher.java%E2%98%83TimelineServiceV1Publisher].appCreated
>  will cause NPE as we use like below
> {code:java}
> entityInfo.put(ApplicationMetricsConstants.APPLICATION_PRIORITY_INFO, 
> app.getApplicationPriority().getPriority());{code}
> We have to handle this case while recovery.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8368) yarn app start cli should print applicationId

2018-05-28 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8368:

Attachment: YARN-8368.02.patch

> yarn app start cli should print applicationId
> -
>
> Key: YARN-8368
> URL: https://issues.apache.org/jira/browse/YARN-8368
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-8368.01.patch, YARN-8368.02.patch
>
>
> yarn app start cli should print the application Id similar to yarn launch cmd.
> {code:java}
> bash-4.2$ yarn app -start hbase-app-test
> WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of 
> YARN_LOGFILE.
> WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of 
> YARN_PID_DIR.
> 18/05/24 15:15:53 INFO client.RMProxy: Connecting to ResourceManager at 
> xxx/xxx:8050
> 18/05/24 15:15:54 INFO client.RMProxy: Connecting to ResourceManager at 
> xxx/xxx:8050
> 18/05/24 15:15:55 INFO client.ApiServiceClient: Service hbase-app-test is 
> successfully started.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8368) yarn app start cli should print applicationId

2018-05-27 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8368:

Attachment: YARN-8368.01.patch

> yarn app start cli should print applicationId
> -
>
> Key: YARN-8368
> URL: https://issues.apache.org/jira/browse/YARN-8368
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-8368.01.patch
>
>
> yarn app start cli should print the application Id similar to yarn launch cmd.
> {code:java}
> bash-4.2$ yarn app -start hbase-app-test
> WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of 
> YARN_LOGFILE.
> WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of 
> YARN_PID_DIR.
> 18/05/24 15:15:53 INFO client.RMProxy: Connecting to ResourceManager at 
> xxx/xxx:8050
> 18/05/24 15:15:54 INFO client.RMProxy: Connecting to ResourceManager at 
> xxx/xxx:8050
> 18/05/24 15:15:55 INFO client.ApiServiceClient: Service hbase-app-test is 
> successfully started.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8368) yarn app start cli should print applicationId

2018-05-27 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S reassigned YARN-8368:
---

Assignee: Rohith Sharma K S

> yarn app start cli should print applicationId
> -
>
> Key: YARN-8368
> URL: https://issues.apache.org/jira/browse/YARN-8368
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-8368.01.patch
>
>
> yarn app start cli should print the application Id similar to yarn launch cmd.
> {code:java}
> bash-4.2$ yarn app -start hbase-app-test
> WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of 
> YARN_LOGFILE.
> WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of 
> YARN_PID_DIR.
> 18/05/24 15:15:53 INFO client.RMProxy: Connecting to ResourceManager at 
> xxx/xxx:8050
> 18/05/24 15:15:54 INFO client.RMProxy: Connecting to ResourceManager at 
> xxx/xxx:8050
> 18/05/24 15:15:55 INFO client.ApiServiceClient: Service hbase-app-test is 
> successfully started.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8346) Upgrading to 3.1 kills running containers with error "Opportunistic container queue is full"

2018-05-24 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489047#comment-16489047
 ] 

Rohith Sharma K S commented on YARN-8346:
-

back ported to 2.9 as well. 

> Upgrading to 3.1 kills running containers with error "Opportunistic container 
> queue is full"
> 
>
> Key: YARN-8346
> URL: https://issues.apache.org/jira/browse/YARN-8346
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0, 3.0.2
>Reporter: Rohith Sharma K S
>Assignee: Jason Lowe
>Priority: Blocker
> Fix For: 3.1.0, 2.10.0, 3.2.0, 2.9.2, 3.0.3
>
> Attachments: YARN-8346.001.patch
>
>
> It is observed while rolling upgrade from 2.8.4 to 3.1 release, all the 
> running containers are killed and second attempt is launched for that 
> application. The diagnostics message is "Opportunistic container queue is 
> full" which is the reason for container killed. 
> In NM log, I see below logs for after container is recovered.
> {noformat}
> 2018-05-23 17:18:50,655 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler:
>  Opportunistic container [container_e06_1527075664705_0001_01_01] will 
> not be queued at the NMsince max queue length [0] has been reached
> {noformat}
> Following steps are executed for rolling upgrade
> # Install 2.8.4 cluster and launch a MR job with distributed cache enabled.
> # Stop 2.8.4 RM. Start 3.1.0 RM with same configuration.
> # Stop 2.8.4 NM batch by batch. Start 3.1.0 NM batch by batch. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8346) Upgrading to 3.1 kills running containers with error "Opportunistic container queue is full"

2018-05-24 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8346:

Fix Version/s: 2.9.2

> Upgrading to 3.1 kills running containers with error "Opportunistic container 
> queue is full"
> 
>
> Key: YARN-8346
> URL: https://issues.apache.org/jira/browse/YARN-8346
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0, 3.0.2
>Reporter: Rohith Sharma K S
>Assignee: Jason Lowe
>Priority: Blocker
> Fix For: 3.1.0, 2.10.0, 3.2.0, 2.9.2, 3.0.3
>
> Attachments: YARN-8346.001.patch
>
>
> It is observed while rolling upgrade from 2.8.4 to 3.1 release, all the 
> running containers are killed and second attempt is launched for that 
> application. The diagnostics message is "Opportunistic container queue is 
> full" which is the reason for container killed. 
> In NM log, I see below logs for after container is recovered.
> {noformat}
> 2018-05-23 17:18:50,655 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler:
>  Opportunistic container [container_e06_1527075664705_0001_01_01] will 
> not be queued at the NMsince max queue length [0] has been reached
> {noformat}
> Following steps are executed for rolling upgrade
> # Install 2.8.4 cluster and launch a MR job with distributed cache enabled.
> # Stop 2.8.4 RM. Start 3.1.0 RM with same configuration.
> # Stop 2.8.4 NM batch by batch. Start 3.1.0 NM batch by batch. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8319) More YARN pages need to honor yarn.resourcemanager.display.per-user-apps

2018-05-24 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488667#comment-16488667
 ] 

Rohith Sharma K S commented on YARN-8319:
-

+1, will commit it shortly

> More YARN pages need to honor yarn.resourcemanager.display.per-user-apps
> 
>
> Key: YARN-8319
> URL: https://issues.apache.org/jira/browse/YARN-8319
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8319.001.patch, YARN-8319.002.patch, 
> YARN-8319.003.patch
>
>
> When this config is on
>  - Per queue page on UI2 should filter app list by user
>  -- TODO: Verify the same with UI1 Per-queue page
>  - ATSv2 with UI2 should filter list of all users' flows and flow activities
>  - Per Node pages
>  -- Listing of apps and containers on a per-node basis should filter apps and 
> containers by user.
> To this end, because this is no longer just for resourcemanager, we should 
> also deprecate {{yarn.resourcemanager.display.per-user-apps}} in favor of 
> {{yarn.webapp.filter-app-list-by-user}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8346) Upgrading to 3.1 kills running containers with error "Opportunistic container queue is full"

2018-05-24 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488526#comment-16488526
 ] 

Rohith Sharma K S commented on YARN-8346:
-

committing shortly

> Upgrading to 3.1 kills running containers with error "Opportunistic container 
> queue is full"
> 
>
> Key: YARN-8346
> URL: https://issues.apache.org/jira/browse/YARN-8346
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0, 3.0.2
>Reporter: Rohith Sharma K S
>Assignee: Jason Lowe
>Priority: Blocker
> Attachments: YARN-8346.001.patch
>
>
> It is observed while rolling upgrade from 2.8.4 to 3.1 release, all the 
> running containers are killed and second attempt is launched for that 
> application. The diagnostics message is "Opportunistic container queue is 
> full" which is the reason for container killed. 
> In NM log, I see below logs for after container is recovered.
> {noformat}
> 2018-05-23 17:18:50,655 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler:
>  Opportunistic container [container_e06_1527075664705_0001_01_01] will 
> not be queued at the NMsince max queue length [0] has been reached
> {noformat}
> Following steps are executed for rolling upgrade
> # Install 2.8.4 cluster and launch a MR job with distributed cache enabled.
> # Stop 2.8.4 RM. Start 3.1.0 RM with same configuration.
> # Stop 2.8.4 NM batch by batch. Start 3.1.0 NM batch by batch. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8346) Upgrading to 3.1 kills running containers with error "Opportunistic container queue is full"

2018-05-23 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488439#comment-16488439
 ] 

Rohith Sharma K S commented on YARN-8346:
-

Thanks [~jlowe] for quick turnaround. I verified the patch in cluster and 
working fine as expected.

I am +1 for the patch.

> Upgrading to 3.1 kills running containers with error "Opportunistic container 
> queue is full"
> 
>
> Key: YARN-8346
> URL: https://issues.apache.org/jira/browse/YARN-8346
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0, 3.0.2
>Reporter: Rohith Sharma K S
>Assignee: Jason Lowe
>Priority: Blocker
> Attachments: YARN-8346.001.patch
>
>
> It is observed while rolling upgrade from 2.8.4 to 3.1 release, all the 
> running containers are killed and second attempt is launched for that 
> application. The diagnostics message is "Opportunistic container queue is 
> full" which is the reason for container killed. 
> In NM log, I see below logs for after container is recovered.
> {noformat}
> 2018-05-23 17:18:50,655 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler:
>  Opportunistic container [container_e06_1527075664705_0001_01_01] will 
> not be queued at the NMsince max queue length [0] has been reached
> {noformat}
> Following steps are executed for rolling upgrade
> # Install 2.8.4 cluster and launch a MR job with distributed cache enabled.
> # Stop 2.8.4 RM. Start 3.1.0 RM with same configuration.
> # Stop 2.8.4 NM batch by batch. Start 3.1.0 NM batch by batch. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8346) Upgrading to 3.1 kills running containers with error "Opportunistic container queue is full"

2018-05-23 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487175#comment-16487175
 ] 

Rohith Sharma K S commented on YARN-8346:
-

In class ContainerScheduler#enqueueContainer, for recovered container from 
2.8.4 execution type is not set which result in else condition with zero queue 
lenght.  This is sending kill event for container resulting running containers 
are killed.

{code}
private boolean enqueueContainer(Container container) {
boolean isGuaranteedContainer = container.getContainerTokenIdentifier().
getExecutionType() == ExecutionType.GUARANTEED;

boolean isQueued;
if (isGuaranteedContainer) {
  queuedGuaranteedContainers.put(container.getContainerId(), container);
  isQueued = true;
} else {
  if (queuedOpportunisticContainers.size() < maxOppQueueLength) {
LOG.info("Opportunistic container {} will be queued at the NM.",
container.getContainerId());
queuedOpportunisticContainers.put(
container.getContainerId(), container);
isQueued = true;
  } else {
LOG.info("Opportunistic container [{}] will not be queued at the NM" +
"since max queue length [{}] has been reached",
container.getContainerId(), maxOppQueueLength);
container.sendKillEvent(
ContainerExitStatus.KILLED_BY_CONTAINER_SCHEDULER,
"Opportunistic container queue is full.");
isQueued = false;
  }
}
{code}


Since opportunistic container feature is exist in 2.9, this would also issue 
upgrading into 2.9 I think.
cc:/ [~jlowe] [~arun.sur...@gmail.com] 

> Upgrading to 3.1 kills running containers with error "Opportunistic container 
> queue is full"
> 
>
> Key: YARN-8346
> URL: https://issues.apache.org/jira/browse/YARN-8346
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Major
>
> It is observed while rolling upgrade from 2.8.4 to 3.1 release, all the 
> running containers are killed and second attempt is launched for that 
> application. The diagnostics message is "Opportunistic container queue is 
> full" which is the reason for container killed. 
> In NM log, I see below logs for after container is recovered.
> {noformat}
> 2018-05-23 17:18:50,655 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler:
>  Opportunistic container [container_e06_1527075664705_0001_01_01] will 
> not be queued at the NMsince max queue length [0] has been reached
> {noformat}
> Following steps are executed for rolling upgrade
> # Install 2.8.4 cluster and launch a MR job with distributed cache enabled.
> # Stop 2.8.4 RM. Start 3.1.0 RM with same configuration.
> # Stop 2.8.4 NM batch by batch. Start 3.1.0 NM batch by batch. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8346) Upgrading to 3.1 kills running containers with error "Opportunistic container queue is full"

2018-05-23 Thread Rohith Sharma K S (JIRA)
Rohith Sharma K S created YARN-8346:
---

 Summary: Upgrading to 3.1 kills running containers with error 
"Opportunistic container queue is full"
 Key: YARN-8346
 URL: https://issues.apache.org/jira/browse/YARN-8346
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Rohith Sharma K S


It is observed while rolling upgrade from 2.8.4 to 3.1 release, all the running 
containers are killed and second attempt is launched for that application. The 
diagnostics message is "Opportunistic container queue is full" which is the 
reason for container killed. 

In NM log, I see below logs for after container is recovered.
{noformat}
2018-05-23 17:18:50,655 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler:
 Opportunistic container [container_e06_1527075664705_0001_01_01] will not 
be queued at the NMsince max queue length [0] has been reached
{noformat}

Following steps are executed for rolling upgrade
# Install 2.8.4 cluster and launch a MR job with distributed cache enabled.
# Stop 2.8.4 RM. Start 3.1.0 RM with same configuration.
# Stop 2.8.4 NM batch by batch. Start 3.1.0 NM batch by batch. 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8155) Improve the logging in NMTimelinePublisher and TimelineCollectorWebService

2018-05-17 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478908#comment-16478908
 ] 

Rohith Sharma K S commented on YARN-8155:
-

[~abmodi] do you have time to update the patch? 

> Improve the logging in NMTimelinePublisher and TimelineCollectorWebService
> --
>
> Key: YARN-8155
> URL: https://issues.apache.org/jira/browse/YARN-8155
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Abhishek Modi
>Priority: Major
>
> We see that NM logs are filled with larger stack trace of NotFoundException 
> if collector is removed from one of the NM and other NMs are still publishing 
> the entities.
>  
> This Jira is to improve the logging in NM so that we log with informative 
> message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8297) Incorrect ATS Url used for Wire encrypted cluster

2018-05-17 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8297:

Attachment: (was: Ambari.txt)

> Incorrect ATS Url used for Wire encrypted cluster
> -
>
> Key: YARN-8297
> URL: https://issues.apache.org/jira/browse/YARN-8297
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.1.0
>Reporter: Yesha Vora
>Assignee: Sunil G
>Priority: Blocker
> Attachments: YARN-8297.001.patch
>
>
> "Service" page uses incorrect web url for ATS in wire encrypted env. For ATS 
> urls, it uses https protocol with http port.
> This issue causes all ATS call to fail and UI does not display component 
> details.
> url used: 
> https://xxx:8198/ws/v2/timeline/apps/application_1526357251888_0022/entities/SERVICE_ATTEMPT?fields=ALL&_=1526415938320
> expected url : 
> https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/SERVICE_ATTEMPT?fields=ALL&_=1526415938320



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8297) Incorrect ATS Url used for Wire encrypted cluster

2018-05-17 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8297:

Attachment: Ambari.txt

> Incorrect ATS Url used for Wire encrypted cluster
> -
>
> Key: YARN-8297
> URL: https://issues.apache.org/jira/browse/YARN-8297
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.1.0
>Reporter: Yesha Vora
>Assignee: Sunil G
>Priority: Blocker
> Attachments: YARN-8297.001.patch
>
>
> "Service" page uses incorrect web url for ATS in wire encrypted env. For ATS 
> urls, it uses https protocol with http port.
> This issue causes all ATS call to fail and UI does not display component 
> details.
> url used: 
> https://xxx:8198/ws/v2/timeline/apps/application_1526357251888_0022/entities/SERVICE_ATTEMPT?fields=ALL&_=1526415938320
> expected url : 
> https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/SERVICE_ATTEMPT?fields=ALL&_=1526415938320



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8297) Incorrect ATS Url used for Wire encrypted cluster

2018-05-17 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478880#comment-16478880
 ] 

Rohith Sharma K S commented on YARN-8297:
-

OK, +1 committing shortly

> Incorrect ATS Url used for Wire encrypted cluster
> -
>
> Key: YARN-8297
> URL: https://issues.apache.org/jira/browse/YARN-8297
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.1.0
>Reporter: Yesha Vora
>Assignee: Sunil G
>Priority: Blocker
> Attachments: YARN-8297.001.patch
>
>
> "Service" page uses incorrect web url for ATS in wire encrypted env. For ATS 
> urls, it uses https protocol with http port.
> This issue causes all ATS call to fail and UI does not display component 
> details.
> url used: 
> https://xxx:8198/ws/v2/timeline/apps/application_1526357251888_0022/entities/SERVICE_ATTEMPT?fields=ALL&_=1526415938320
> expected url : 
> https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/SERVICE_ATTEMPT?fields=ALL&_=1526415938320



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8293) In YARN Services UI, "User Name for service" should be completely removed in secure clusters

2018-05-17 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478844#comment-16478844
 ] 

Rohith Sharma K S commented on YARN-8293:
-

thanks [~sunilg] for the patch. I tested the patch and looks fine to me. 
[~eyang] Would you be able to commit this patch today? 

> In YARN Services UI, "User Name for service" should be completely removed in 
> secure clusters
> 
>
> Key: YARN-8293
> URL: https://issues.apache.org/jira/browse/YARN-8293
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Major
> Attachments: YARN-8293.001.patch
>
>
> "User Name for service" should be completely removed in secure clusters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8297) Incorrect ATS Url used for Wire encrypted cluster

2018-05-17 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478841#comment-16478841
 ] 

Rohith Sharma K S commented on YARN-8297:
-

One doubt in patch, Ajax call is made with async true. This could lead to an 
issue? It should be false right? 

> Incorrect ATS Url used for Wire encrypted cluster
> -
>
> Key: YARN-8297
> URL: https://issues.apache.org/jira/browse/YARN-8297
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.1.0
>Reporter: Yesha Vora
>Assignee: Sunil G
>Priority: Blocker
> Attachments: YARN-8297.001.patch
>
>
> "Service" page uses incorrect web url for ATS in wire encrypted env. For ATS 
> urls, it uses https protocol with http port.
> This issue causes all ATS call to fail and UI does not display component 
> details.
> url used: 
> https://xxx:8198/ws/v2/timeline/apps/application_1526357251888_0022/entities/SERVICE_ATTEMPT?fields=ALL&_=1526415938320
> expected url : 
> https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/SERVICE_ATTEMPT?fields=ALL&_=1526415938320



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8302) ATS v2 should handle HBase connection issue properly

2018-05-16 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S resolved YARN-8302.
-
Resolution: Won't Fix

Closing the JIRA as won't fix since it is related to configuration issue. 
Decreasing hbase client timeout can be tuned with above commented 
configurations. 

> ATS v2 should handle HBase connection issue properly
> 
>
> Key: YARN-8302
> URL: https://issues.apache.org/jira/browse/YARN-8302
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2
>Affects Versions: 3.1.0
>Reporter: Yesha Vora
>Priority: Major
>
> ATS v2 call times out with below error when it can't connect to HBase 
> instance.
> {code}
> bash-4.2$ curl -i -k -s -1  -H 'Content-Type: application/json'  -H 'Accept: 
> application/json' --max-time 5   --negotiate -u : 
> 'https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/YARN_CONTAINER?fields=ALL&_=1526425686092'
> curl: (28) Operation timed out after 5002 milliseconds with 0 bytes received
> {code}
> {code:title=ATS log}
> 2018-05-15 23:10:03,623 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=7, 
> retries=7, started=8165 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:13,651 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=8, 
> retries=8, started=18192 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:23,730 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=9, 
> retries=9, started=28272 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:33,788 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=10, 
> retries=10, started=38330 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1{code}
> There are two issues here.
> 1) Check why ATS can't connect to HBase
> 2) In case of connection error,  ATS call should not get timeout. It should 
> fail with proper error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8302) ATS v2 should handle HBase connection issue properly

2018-05-16 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478532#comment-16478532
 ] 

Rohith Sharma K S commented on YARN-8302:
-

If HBase is down for any reasons, hbase client will retry for 20 minutes with 
default configurations loaded. Reducing the default value for 
*hbase.client.retries.number* from 15 to 7, decreased drastically from 20 
minutes to 1.5 minutes.

> ATS v2 should handle HBase connection issue properly
> 
>
> Key: YARN-8302
> URL: https://issues.apache.org/jira/browse/YARN-8302
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2
>Affects Versions: 3.1.0
>Reporter: Yesha Vora
>Priority: Major
>
> ATS v2 call times out with below error when it can't connect to HBase 
> instance.
> {code}
> bash-4.2$ curl -i -k -s -1  -H 'Content-Type: application/json'  -H 'Accept: 
> application/json' --max-time 5   --negotiate -u : 
> 'https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/YARN_CONTAINER?fields=ALL&_=1526425686092'
> curl: (28) Operation timed out after 5002 milliseconds with 0 bytes received
> {code}
> {code:title=ATS log}
> 2018-05-15 23:10:03,623 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=7, 
> retries=7, started=8165 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:13,651 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=8, 
> retries=8, started=18192 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:23,730 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=9, 
> retries=9, started=28272 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:33,788 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=10, 
> retries=10, started=38330 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1{code}
> There are two issues here.
> 1) Check why ATS can't connect to HBase
> 2) In case of connection error,  ATS call should not get timeout. It should 
> fail with proper error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5742) Serve aggregated logs of historical apps from timeline service

2018-05-16 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477177#comment-16477177
 ] 

Rohith Sharma K S edited comment on YARN-5742 at 5/16/18 9:46 AM:
--

I have created a separate JIRA for pulling put log servlet from AHSWebService 
i.e YARN-8304 and YARN-8303 for converting TimelineEntity to 
ApplicationReport/ApplicationAttemptReport/ContainerReport. 

This JIRA we can keep it only plugging log servlet into TimelineReader. 


was (Author: rohithsharma):
I have created a separate JIRA for pulling put log servlet from AHSWebService 
i.e YARN-8304 and for converting TimelineEntity to 
ApplicationReport/ApplicationAttemptReport/ContainerReport. 

This JIRA we can keep it only plugging log servlet into TimelineReader. 

> Serve aggregated logs of historical apps from timeline service
> --
>
> Key: YARN-5742
> URL: https://issues.apache.org/jira/browse/YARN-5742
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Varun Saxena
>Assignee: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-5742-POC-v0.patch
>
>
> ATSv1.5 daemon has servlet to serve aggregated logs. But enabling only ATSv2, 
> does not serve logs from CLI and UI for completed application. Log serving 
> story has completely broken in ATSv2.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5742) Serve aggregated logs of historical apps from timeline service

2018-05-16 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477177#comment-16477177
 ] 

Rohith Sharma K S commented on YARN-5742:
-

I have created a separate JIRA for pulling put log servlet from AHSWebService 
i.e YARN-8304 and for converting TimelineEntity to 
ApplicationReport/ApplicationAttemptReport/ContainerReport. 

This JIRA we can keep it only plugging log servlet into TimelineReader. 

> Serve aggregated logs of historical apps from timeline service
> --
>
> Key: YARN-5742
> URL: https://issues.apache.org/jira/browse/YARN-5742
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Varun Saxena
>Assignee: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-5742-POC-v0.patch
>
>
> ATSv1.5 daemon has servlet to serve aggregated logs. But enabling only ATSv2, 
> does not serve logs from CLI and UI for completed application. Log serving 
> story has completely broken in ATSv2.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8304) Provide generic log servlet for serving logs

2018-05-16 Thread Rohith Sharma K S (JIRA)
Rohith Sharma K S created YARN-8304:
---

 Summary: Provide generic log servlet for serving logs
 Key: YARN-8304
 URL: https://issues.apache.org/jira/browse/YARN-8304
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Rohith Sharma K S


AHSWebService has log serving REST APIs i.e getContainerLog* and getLogs which 
is used to view container logs from UI. It is tightly coupled with 
ApplicationBaseProtocol. And these APIs are exist in AHS. But ATSv2 is designed 
only with REST APIs.

Proposal is to add generic log servlet which could be plugged to ATSV1.5 or 
ATSv2.0 Reader. 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8303) YarnClient should contact TimelineReader for application/attempt/container report

2018-05-16 Thread Rohith Sharma K S (JIRA)
Rohith Sharma K S created YARN-8303:
---

 Summary: YarnClient should contact TimelineReader for 
application/attempt/container report
 Key: YARN-8303
 URL: https://issues.apache.org/jira/browse/YARN-8303
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Rohith Sharma K S


YarnClient get app/attempt/container information from RM. If RM doesn't have 
then queried to ahsClient. When ATSv2 is only enabled, yarnClient will result 
empty. 

YarnClient is used by many users which result in empty information for 
app/attempt/container report. 

Proposal is to have adapter from yarn client so that app/attempt/container 
reports can be generated from AHSv2Client which does REST API to TimelineReader 
and get the entity and convert it into app/attempt/container report.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7933) [atsv2 read acls] Add TimelineWriter#writeDomain

2018-05-16 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-7933:

Attachment: YARN-7933.06.patch

> [atsv2 read acls] Add TimelineWriter#writeDomain 
> -
>
> Key: YARN-7933
> URL: https://issues.apache.org/jira/browse/YARN-7933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vrushali C
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-7933.01.patch, YARN-7933.02.patch, 
> YARN-7933.03.patch, YARN-7933.04.patch, YARN-7933.05.patch, YARN-7933.06.patch
>
>
>  
> Add an API TimelineWriter#writeDomain for writing the domain info 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8302) ATS v2 should handle HBase connection issue properly

2018-05-15 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476872#comment-16476872
 ] 

Rohith Sharma K S commented on YARN-8302:
-

Thanks [~yeshavora] for creating the issue. I am able to reproduce the issue 
but query doesn't hang. With default HBase configuration for timeout, hbase 
client will retry for 20minutes and exit with error code 500. I feel this is 
right behavior. If you want to reduce the timeout then need to tune below hbase 
configurations
# ZK session timeout (*zookeeper.session.timeout*)
# RPC timeout (*hbase.rpc.timeout*)
# RecoverableZookeeper retry count and retry wait (*zookeeper.recovery.retry*, 
*zookeeper.recovery.retry.intervalmill*)
# Client retry count and wait (*hbase.client.retries.number*, 
*hbase.client.pause*) 

Note that reducing timeout to too less will end up in error in temporary glitch 
of network. Twenty minutes of retry should be feasible.

Do you want to reduce the retry timeout? 

> ATS v2 should handle HBase connection issue properly
> 
>
> Key: YARN-8302
> URL: https://issues.apache.org/jira/browse/YARN-8302
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2
>Affects Versions: 3.1.0
>Reporter: Yesha Vora
>Priority: Major
>
> ATS v2 call times out with below error when it can't connect to HBase 
> instance.
> {code}
> bash-4.2$ curl -i -k -s -1  -H 'Content-Type: application/json'  -H 'Accept: 
> application/json' --max-time 5   --negotiate -u : 
> 'https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/YARN_CONTAINER?fields=ALL&_=1526425686092'
> curl: (28) Operation timed out after 5002 milliseconds with 0 bytes received
> {code}
> {code:title=ATS log}
> 2018-05-15 23:10:03,623 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=7, 
> retries=7, started=8165 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:13,651 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=8, 
> retries=8, started=18192 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:23,730 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=9, 
> retries=9, started=28272 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:33,788 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=10, 
> retries=10, started=38330 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1{code}
> There are two issues here.
> 1) Check why ATS can't connect to HBase
> 2) In case of connection error,  ATS call should not get timeout. It should 
> fail with proper error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8302) ATS v2 should handle HBase connection issue properly

2018-05-15 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476873#comment-16476873
 ] 

Rohith Sharma K S commented on YARN-8302:
-

cc:/ [~vrushalic] [~haibo.chen]

> ATS v2 should handle HBase connection issue properly
> 
>
> Key: YARN-8302
> URL: https://issues.apache.org/jira/browse/YARN-8302
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2
>Affects Versions: 3.1.0
>Reporter: Yesha Vora
>Priority: Major
>
> ATS v2 call times out with below error when it can't connect to HBase 
> instance.
> {code}
> bash-4.2$ curl -i -k -s -1  -H 'Content-Type: application/json'  -H 'Accept: 
> application/json' --max-time 5   --negotiate -u : 
> 'https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/YARN_CONTAINER?fields=ALL&_=1526425686092'
> curl: (28) Operation timed out after 5002 milliseconds with 0 bytes received
> {code}
> {code:title=ATS log}
> 2018-05-15 23:10:03,623 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=7, 
> retries=7, started=8165 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:13,651 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=8, 
> retries=8, started=18192 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:23,730 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=9, 
> retries=9, started=28272 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:33,788 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=10, 
> retries=10, started=38330 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1{code}
> There are two issues here.
> 1) Check why ATS can't connect to HBase
> 2) In case of connection error,  ATS call should not get timeout. It should 
> fail with proper error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8130) Race condition when container events are published for KILLED applications

2018-05-15 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475340#comment-16475340
 ] 

Rohith Sharma K S commented on YARN-8130:
-

thanks to [~haibochen] and [~vrushalic] for the review. I back-ported to 
branch-3.1/branch-3.0/branch-2 as well. 

> Race condition when container events are published for KILLED applications
> --
>
> Key: YARN-8130
> URL: https://issues.apache.org/jira/browse/YARN-8130
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Reporter: Charan Hebri
>Assignee: Rohith Sharma K S
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 3.0.3
>
> Attachments: YARN-8130.01.patch, YARN-8130.02.patch, 
> YARN-8130.03.patch
>
>
> There seems to be a race condition happening when an application is KILLED 
> and the corresponding container event information is being published. For 
> completed containers, a YARN_CONTAINER_FINISHED event is generated but for 
> some containers in a KILLED application this information is missing. Below is 
> a node manager log snippet,
> {code:java}
> 2018-04-09 08:44:54,474 INFO  shuffle.ExternalShuffleBlockResolver 
> (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application 
> application_1523259757659_0003 removed, cleanupLocalDirs = false
> 2018-04-09 08:44:54,478 INFO  application.ApplicationImpl 
> (ApplicationImpl.java:handle(632)) - Application 
> application_1523259757659_0003 transitioned from 
> APPLICATION_RESOURCES_CLEANINGUP to FINISHED
> 2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher 
> (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been 
> removed before the entity could be published for 
> TimelineEntity[type='YARN_CONTAINER', 
> id='container_1523259757659_0003_01_02']
> 2018-04-09 08:44:54,478 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just 
> finished : application_1523259757659_0003
> 2018-04-09 08:44:54,488 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs 
> for container container_1523259757659_0003_01_01. Current good log dirs 
> are /grid/0/hadoop/yarn/log
> 2018-04-09 08:44:54,492 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs 
> for container container_1523259757659_0003_01_02. Current good log dirs 
> are /grid/0/hadoop/yarn/log
> 2018-04-09 08:44:55,470 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(192)) - The collector service for 
> application_1523259757659_0003 was removed
> 2018-04-09 08:44:55,472 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:handle(1572)) - couldn't find application 
> application_1523259757659_0003 while processing FINISH_APPS event. The 
> ResourceManager allocated resources for this application to the NodeManager 
> but no active containers were found to process{code}
> The container id specified in the log, 
> *container_1523259757659_0003_01_02* is the one that has the finished 
> event missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8130) Race condition when container events are published for KILLED applications

2018-05-15 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8130:

Fix Version/s: 3.0.3
   3.1.1
   2.10.0

> Race condition when container events are published for KILLED applications
> --
>
> Key: YARN-8130
> URL: https://issues.apache.org/jira/browse/YARN-8130
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Reporter: Charan Hebri
>Assignee: Rohith Sharma K S
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 3.0.3
>
> Attachments: YARN-8130.01.patch, YARN-8130.02.patch, 
> YARN-8130.03.patch
>
>
> There seems to be a race condition happening when an application is KILLED 
> and the corresponding container event information is being published. For 
> completed containers, a YARN_CONTAINER_FINISHED event is generated but for 
> some containers in a KILLED application this information is missing. Below is 
> a node manager log snippet,
> {code:java}
> 2018-04-09 08:44:54,474 INFO  shuffle.ExternalShuffleBlockResolver 
> (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application 
> application_1523259757659_0003 removed, cleanupLocalDirs = false
> 2018-04-09 08:44:54,478 INFO  application.ApplicationImpl 
> (ApplicationImpl.java:handle(632)) - Application 
> application_1523259757659_0003 transitioned from 
> APPLICATION_RESOURCES_CLEANINGUP to FINISHED
> 2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher 
> (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been 
> removed before the entity could be published for 
> TimelineEntity[type='YARN_CONTAINER', 
> id='container_1523259757659_0003_01_02']
> 2018-04-09 08:44:54,478 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just 
> finished : application_1523259757659_0003
> 2018-04-09 08:44:54,488 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs 
> for container container_1523259757659_0003_01_01. Current good log dirs 
> are /grid/0/hadoop/yarn/log
> 2018-04-09 08:44:54,492 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs 
> for container container_1523259757659_0003_01_02. Current good log dirs 
> are /grid/0/hadoop/yarn/log
> 2018-04-09 08:44:55,470 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(192)) - The collector service for 
> application_1523259757659_0003 was removed
> 2018-04-09 08:44:55,472 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:handle(1572)) - couldn't find application 
> application_1523259757659_0003 while processing FINISH_APPS event. The 
> ResourceManager allocated resources for this application to the NodeManager 
> but no active containers were found to process{code}
> The container id specified in the log, 
> *container_1523259757659_0003_01_02* is the one that has the finished 
> event missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7933) [atsv2 read acls] Add TimelineWriter#writeDomain

2018-05-14 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474013#comment-16474013
 ] 

Rohith Sharma K S commented on YARN-7933:
-

bq. Isn't  it the case that a TimelineClient must be able to authenticate with 
the TimelineCollector first before it can post data to that TimelineCollector?
Here the intention I added is with ATS1.5 approach i.e if same client or 
different client publishes same domain id then collector need to check for ACLs 
for domain i.e owner. In our design, I was not sure about should we check owner 
for domain id, so I added a TODO. If this is not our design in future, we can 
remove this at any point of time.

bq. Where do we check the delegation token on inside 
PerNodeTimelineCollectorService?
Timeline Token verification is at filter layer at the time of http connection 
establishment i.e even before it reaches servlets. Follow the classes 
NodeTimelineCollectorManager#startWebApp TimelineAuthenticationFilter 



> [atsv2 read acls] Add TimelineWriter#writeDomain 
> -
>
> Key: YARN-7933
> URL: https://issues.apache.org/jira/browse/YARN-7933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vrushali C
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-7933.01.patch, YARN-7933.02.patch, 
> YARN-7933.03.patch, YARN-7933.04.patch, YARN-7933.05.patch
>
>
>  
> Add an API TimelineWriter#writeDomain for writing the domain info 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8130) Race condition when container events are published for KILLED applications

2018-05-11 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472786#comment-16472786
 ] 

Rohith Sharma K S commented on YARN-8130:
-

Test failures are unrelated to this patch. [~haibochen]/[~vrushalic] could you 
please help to commit this.

> Race condition when container events are published for KILLED applications
> --
>
> Key: YARN-8130
> URL: https://issues.apache.org/jira/browse/YARN-8130
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Reporter: Charan Hebri
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8130.01.patch, YARN-8130.02.patch, 
> YARN-8130.03.patch
>
>
> There seems to be a race condition happening when an application is KILLED 
> and the corresponding container event information is being published. For 
> completed containers, a YARN_CONTAINER_FINISHED event is generated but for 
> some containers in a KILLED application this information is missing. Below is 
> a node manager log snippet,
> {code:java}
> 2018-04-09 08:44:54,474 INFO  shuffle.ExternalShuffleBlockResolver 
> (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application 
> application_1523259757659_0003 removed, cleanupLocalDirs = false
> 2018-04-09 08:44:54,478 INFO  application.ApplicationImpl 
> (ApplicationImpl.java:handle(632)) - Application 
> application_1523259757659_0003 transitioned from 
> APPLICATION_RESOURCES_CLEANINGUP to FINISHED
> 2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher 
> (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been 
> removed before the entity could be published for 
> TimelineEntity[type='YARN_CONTAINER', 
> id='container_1523259757659_0003_01_02']
> 2018-04-09 08:44:54,478 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just 
> finished : application_1523259757659_0003
> 2018-04-09 08:44:54,488 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs 
> for container container_1523259757659_0003_01_01. Current good log dirs 
> are /grid/0/hadoop/yarn/log
> 2018-04-09 08:44:54,492 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs 
> for container container_1523259757659_0003_01_02. Current good log dirs 
> are /grid/0/hadoop/yarn/log
> 2018-04-09 08:44:55,470 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(192)) - The collector service for 
> application_1523259757659_0003 was removed
> 2018-04-09 08:44:55,472 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:handle(1572)) - couldn't find application 
> application_1523259757659_0003 while processing FINISH_APPS event. The 
> ResourceManager allocated resources for this application to the NodeManager 
> but no active containers were found to process{code}
> The container id specified in the log, 
> *container_1523259757659_0003_01_02* is the one that has the finished 
> event missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8130) Race condition when container events are published for KILLED applications

2018-05-11 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472274#comment-16472274
 ] 

Rohith Sharma K S commented on YARN-8130:
-

make sense.. updated the patch as per comments. [~haibochen] could you take a 
look at attached patch? 

> Race condition when container events are published for KILLED applications
> --
>
> Key: YARN-8130
> URL: https://issues.apache.org/jira/browse/YARN-8130
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Reporter: Charan Hebri
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8130.01.patch, YARN-8130.02.patch, 
> YARN-8130.03.patch
>
>
> There seems to be a race condition happening when an application is KILLED 
> and the corresponding container event information is being published. For 
> completed containers, a YARN_CONTAINER_FINISHED event is generated but for 
> some containers in a KILLED application this information is missing. Below is 
> a node manager log snippet,
> {code:java}
> 2018-04-09 08:44:54,474 INFO  shuffle.ExternalShuffleBlockResolver 
> (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application 
> application_1523259757659_0003 removed, cleanupLocalDirs = false
> 2018-04-09 08:44:54,478 INFO  application.ApplicationImpl 
> (ApplicationImpl.java:handle(632)) - Application 
> application_1523259757659_0003 transitioned from 
> APPLICATION_RESOURCES_CLEANINGUP to FINISHED
> 2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher 
> (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been 
> removed before the entity could be published for 
> TimelineEntity[type='YARN_CONTAINER', 
> id='container_1523259757659_0003_01_02']
> 2018-04-09 08:44:54,478 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just 
> finished : application_1523259757659_0003
> 2018-04-09 08:44:54,488 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs 
> for container container_1523259757659_0003_01_01. Current good log dirs 
> are /grid/0/hadoop/yarn/log
> 2018-04-09 08:44:54,492 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs 
> for container container_1523259757659_0003_01_02. Current good log dirs 
> are /grid/0/hadoop/yarn/log
> 2018-04-09 08:44:55,470 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(192)) - The collector service for 
> application_1523259757659_0003 was removed
> 2018-04-09 08:44:55,472 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:handle(1572)) - couldn't find application 
> application_1523259757659_0003 while processing FINISH_APPS event. The 
> ResourceManager allocated resources for this application to the NodeManager 
> but no active containers were found to process{code}
> The container id specified in the log, 
> *container_1523259757659_0003_01_02* is the one that has the finished 
> event missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8130) Race condition when container events are published for KILLED applications

2018-05-11 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8130:

Attachment: YARN-8130.03.patch

> Race condition when container events are published for KILLED applications
> --
>
> Key: YARN-8130
> URL: https://issues.apache.org/jira/browse/YARN-8130
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Reporter: Charan Hebri
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8130.01.patch, YARN-8130.02.patch, 
> YARN-8130.03.patch
>
>
> There seems to be a race condition happening when an application is KILLED 
> and the corresponding container event information is being published. For 
> completed containers, a YARN_CONTAINER_FINISHED event is generated but for 
> some containers in a KILLED application this information is missing. Below is 
> a node manager log snippet,
> {code:java}
> 2018-04-09 08:44:54,474 INFO  shuffle.ExternalShuffleBlockResolver 
> (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application 
> application_1523259757659_0003 removed, cleanupLocalDirs = false
> 2018-04-09 08:44:54,478 INFO  application.ApplicationImpl 
> (ApplicationImpl.java:handle(632)) - Application 
> application_1523259757659_0003 transitioned from 
> APPLICATION_RESOURCES_CLEANINGUP to FINISHED
> 2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher 
> (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been 
> removed before the entity could be published for 
> TimelineEntity[type='YARN_CONTAINER', 
> id='container_1523259757659_0003_01_02']
> 2018-04-09 08:44:54,478 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just 
> finished : application_1523259757659_0003
> 2018-04-09 08:44:54,488 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs 
> for container container_1523259757659_0003_01_01. Current good log dirs 
> are /grid/0/hadoop/yarn/log
> 2018-04-09 08:44:54,492 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs 
> for container container_1523259757659_0003_01_02. Current good log dirs 
> are /grid/0/hadoop/yarn/log
> 2018-04-09 08:44:55,470 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(192)) - The collector service for 
> application_1523259757659_0003 was removed
> 2018-04-09 08:44:55,472 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:handle(1572)) - couldn't find application 
> application_1523259757659_0003 while processing FINISH_APPS event. The 
> ResourceManager allocated resources for this application to the NodeManager 
> but no active containers were found to process{code}
> The container id specified in the log, 
> *container_1523259757659_0003_01_02* is the one that has the finished 
> event missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8130) Race condition when container events are published for KILLED applications

2018-05-11 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472189#comment-16472189
 ] 

Rohith Sharma K S commented on YARN-8130:
-

Do you mean we register another event type in NMTimelinePublisher that removes 
appId while processing it? 

> Race condition when container events are published for KILLED applications
> --
>
> Key: YARN-8130
> URL: https://issues.apache.org/jira/browse/YARN-8130
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Reporter: Charan Hebri
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8130.01.patch, YARN-8130.02.patch
>
>
> There seems to be a race condition happening when an application is KILLED 
> and the corresponding container event information is being published. For 
> completed containers, a YARN_CONTAINER_FINISHED event is generated but for 
> some containers in a KILLED application this information is missing. Below is 
> a node manager log snippet,
> {code:java}
> 2018-04-09 08:44:54,474 INFO  shuffle.ExternalShuffleBlockResolver 
> (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application 
> application_1523259757659_0003 removed, cleanupLocalDirs = false
> 2018-04-09 08:44:54,478 INFO  application.ApplicationImpl 
> (ApplicationImpl.java:handle(632)) - Application 
> application_1523259757659_0003 transitioned from 
> APPLICATION_RESOURCES_CLEANINGUP to FINISHED
> 2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher 
> (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been 
> removed before the entity could be published for 
> TimelineEntity[type='YARN_CONTAINER', 
> id='container_1523259757659_0003_01_02']
> 2018-04-09 08:44:54,478 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just 
> finished : application_1523259757659_0003
> 2018-04-09 08:44:54,488 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs 
> for container container_1523259757659_0003_01_01. Current good log dirs 
> are /grid/0/hadoop/yarn/log
> 2018-04-09 08:44:54,492 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs 
> for container container_1523259757659_0003_01_02. Current good log dirs 
> are /grid/0/hadoop/yarn/log
> 2018-04-09 08:44:55,470 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(192)) - The collector service for 
> application_1523259757659_0003 was removed
> 2018-04-09 08:44:55,472 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:handle(1572)) - couldn't find application 
> application_1523259757659_0003 while processing FINISH_APPS event. The 
> ResourceManager allocated resources for this application to the NodeManager 
> but no active containers were found to process{code}
> The container id specified in the log, 
> *container_1523259757659_0003_01_02* is the one that has the finished 
> event missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8130) Race condition when container events are published for KILLED applications

2018-05-11 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472177#comment-16472177
 ] 

Rohith Sharma K S commented on YARN-8130:
-

bq. we just generate a new event upon Applicaton_FINISHED that is handled by 
the dispatcher inside NMTimelinePublisher?
sorry didn't get it. Could you explain more

> Race condition when container events are published for KILLED applications
> --
>
> Key: YARN-8130
> URL: https://issues.apache.org/jira/browse/YARN-8130
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Reporter: Charan Hebri
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8130.01.patch, YARN-8130.02.patch
>
>
> There seems to be a race condition happening when an application is KILLED 
> and the corresponding container event information is being published. For 
> completed containers, a YARN_CONTAINER_FINISHED event is generated but for 
> some containers in a KILLED application this information is missing. Below is 
> a node manager log snippet,
> {code:java}
> 2018-04-09 08:44:54,474 INFO  shuffle.ExternalShuffleBlockResolver 
> (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application 
> application_1523259757659_0003 removed, cleanupLocalDirs = false
> 2018-04-09 08:44:54,478 INFO  application.ApplicationImpl 
> (ApplicationImpl.java:handle(632)) - Application 
> application_1523259757659_0003 transitioned from 
> APPLICATION_RESOURCES_CLEANINGUP to FINISHED
> 2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher 
> (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been 
> removed before the entity could be published for 
> TimelineEntity[type='YARN_CONTAINER', 
> id='container_1523259757659_0003_01_02']
> 2018-04-09 08:44:54,478 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just 
> finished : application_1523259757659_0003
> 2018-04-09 08:44:54,488 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs 
> for container container_1523259757659_0003_01_01. Current good log dirs 
> are /grid/0/hadoop/yarn/log
> 2018-04-09 08:44:54,492 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs 
> for container container_1523259757659_0003_01_02. Current good log dirs 
> are /grid/0/hadoop/yarn/log
> 2018-04-09 08:44:55,470 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(192)) - The collector service for 
> application_1523259757659_0003 was removed
> 2018-04-09 08:44:55,472 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:handle(1572)) - couldn't find application 
> application_1523259757659_0003 while processing FINISH_APPS event. The 
> ResourceManager allocated resources for this application to the NodeManager 
> but no active containers were found to process{code}
> The container id specified in the log, 
> *container_1523259757659_0003_01_02* is the one that has the finished 
> event missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7933) [atsv2 read acls] Add TimelineWriter#writeDomain

2018-05-11 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-7933:

Attachment: YARN-7933.05.patch

> [atsv2 read acls] Add TimelineWriter#writeDomain 
> -
>
> Key: YARN-7933
> URL: https://issues.apache.org/jira/browse/YARN-7933
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vrushali C
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-7933.01.patch, YARN-7933.02.patch, 
> YARN-7933.03.patch, YARN-7933.04.patch, YARN-7933.05.patch
>
>
>  
> Add an API TimelineWriter#writeDomain for writing the domain info 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



<    1   2   3   4   5   6   7   8   9   10   >