[jira] [Commented] (YARN-8155) Improve ATSv2 client logging in RM and NM publisher
[ https://issues.apache.org/jira/browse/YARN-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512147#comment-16512147 ] Rohith Sharma K S commented on YARN-8155: - committed to trunk/branch-3.1/branch-3.0.. branch-2 compilation is failing [~abmodi] would you give branch-2 patch? > Improve ATSv2 client logging in RM and NM publisher > --- > > Key: YARN-8155 > URL: https://issues.apache.org/jira/browse/YARN-8155 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Rohith Sharma K S >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-8155.001.patch, YARN-8155.002.patch, > YARN-8155.003.patch, YARN-8155.004.patch, YARN-8155.005.patch > > > We see that NM logs are filled with larger stack trace of NotFoundException > if collector is removed from one of the NM and other NMs are still publishing > the entities. > > This Jira is to improve the logging in NM so that we log with informative > message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8155) Improve ATSv2 client logging in RM and NM publisher
[ https://issues.apache.org/jira/browse/YARN-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8155: Summary: Improve ATSv2 client logging in RM and NM publisher (was: Improve the logging in NMTimelinePublisher and TimelineCollectorWebService) > Improve ATSv2 client logging in RM and NM publisher > --- > > Key: YARN-8155 > URL: https://issues.apache.org/jira/browse/YARN-8155 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-8155.001.patch, YARN-8155.002.patch, > YARN-8155.003.patch, YARN-8155.004.patch, YARN-8155.005.patch > > > We see that NM logs are filled with larger stack trace of NotFoundException > if collector is removed from one of the NM and other NMs are still publishing > the entities. > > This Jira is to improve the logging in NM so that we log with informative > message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8155) Improve ATSv2 client logging in RM and NM publisher
[ https://issues.apache.org/jira/browse/YARN-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8155: Issue Type: Improvement (was: Bug) > Improve ATSv2 client logging in RM and NM publisher > --- > > Key: YARN-8155 > URL: https://issues.apache.org/jira/browse/YARN-8155 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Rohith Sharma K S >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-8155.001.patch, YARN-8155.002.patch, > YARN-8155.003.patch, YARN-8155.004.patch, YARN-8155.005.patch > > > We see that NM logs are filled with larger stack trace of NotFoundException > if collector is removed from one of the NM and other NMs are still publishing > the entities. > > This Jira is to improve the logging in NM so that we log with informative > message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8155) Improve the logging in NMTimelinePublisher and TimelineCollectorWebService
[ https://issues.apache.org/jira/browse/YARN-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512030#comment-16512030 ] Rohith Sharma K S commented on YARN-8155: - Committing shortly.. > Improve the logging in NMTimelinePublisher and TimelineCollectorWebService > -- > > Key: YARN-8155 > URL: https://issues.apache.org/jira/browse/YARN-8155 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-8155.001.patch, YARN-8155.002.patch, > YARN-8155.003.patch, YARN-8155.004.patch, YARN-8155.005.patch > > > We see that NM logs are filled with larger stack trace of NotFoundException > if collector is removed from one of the NM and other NMs are still publishing > the entities. > > This Jira is to improve the logging in NM so that we log with informative > message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-8302) ATS v2 should handle HBase connection issue properly
[ https://issues.apache.org/jira/browse/YARN-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S reopened YARN-8302: - Reopening the JIRA to discuss and figure out alternative approach other than reducing retry count > ATS v2 should handle HBase connection issue properly > > > Key: YARN-8302 > URL: https://issues.apache.org/jira/browse/YARN-8302 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2 >Affects Versions: 3.1.0 >Reporter: Yesha Vora >Priority: Major > > ATS v2 call times out with below error when it can't connect to HBase > instance. > {code} > bash-4.2$ curl -i -k -s -1 -H 'Content-Type: application/json' -H 'Accept: > application/json' --max-time 5 --negotiate -u : > 'https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/YARN_CONTAINER?fields=ALL&_=1526425686092' > curl: (28) Operation timed out after 5002 milliseconds with 0 bytes received > {code} > {code:title=ATS log} > 2018-05-15 23:10:03,623 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=7, > retries=7, started=8165 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1 > 2018-05-15 23:10:13,651 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=8, > retries=8, started=18192 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1 > 2018-05-15 23:10:23,730 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=9, > retries=9, started=28272 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1 > 2018-05-15 23:10:33,788 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=10, > retries=10, started=38330 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1{code} > There are two issues here. > 1) Check why ATS can't connect to HBase > 2) In case of connection error, ATS call should not get timeout. It should > fail with proper error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8411) Enable stopped system services to be started during RM start
[ https://issues.apache.org/jira/browse/YARN-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510593#comment-16510593 ] Rohith Sharma K S commented on YARN-8411: - thanks [~billie.rinaldi] for the patch. It looks good to me as well. I have one generic doubt on start operations, does service start identifies if service is already running? Because in system services services are long running and in case of RM restart or HA switch, by default we are going to start services. > Enable stopped system services to be started during RM start > > > Key: YARN-8411 > URL: https://issues.apache.org/jira/browse/YARN-8411 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi >Assignee: Billie Rinaldi >Priority: Critical > Attachments: YARN-8411.01.patch, YARN-8411.02.patch > > > With YARN-8048, the RM can launch system services using the YARN service > framework. If the service app is in a stopped state, user intervention is > required to delete the service so that the RM can launch it again when the RM > is restarted. It would be an improvement for the RM to be able to start a > stopped system service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8405) RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773
[ https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510588#comment-16510588 ] Rohith Sharma K S commented on YARN-8405: - bq. Could you also cherry pick to branch-2.9? Done > RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773 > -- > > Key: YARN-8405 > URL: https://issues.apache.org/jira/browse/YARN-8405 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.1.0 >Reporter: Rohith Sharma K S >Assignee: Íñigo Goiri >Priority: Major > Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.4 > > Attachments: YARN-8405.000.patch, YARN-8405.001.patch, > YARN-8405.002.patch, YARN-8405.003.patch > > > HADOOP-14773 changes the ACL for > yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, > /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now > behavior changed from setting acls to parent node. As a result, parent node > /rmstore is set to default acl. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8405) RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773
[ https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8405: Fix Version/s: 2.9.2 > RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773 > -- > > Key: YARN-8405 > URL: https://issues.apache.org/jira/browse/YARN-8405 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.1.0 >Reporter: Rohith Sharma K S >Assignee: Íñigo Goiri >Priority: Major > Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.4 > > Attachments: YARN-8405.000.patch, YARN-8405.001.patch, > YARN-8405.002.patch, YARN-8405.003.patch > > > HADOOP-14773 changes the ACL for > yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, > /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now > behavior changed from setting acls to parent node. As a result, parent node > /rmstore is set to default acl. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8413) Flow activity page is failing with "Timeline server failed with an error"
[ https://issues.apache.org/jira/browse/YARN-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8413: Fix Version/s: 3.1.1 3.2.0 > Flow activity page is failing with "Timeline server failed with an error" > - > > Key: YARN-8413 > URL: https://issues.apache.org/jira/browse/YARN-8413 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Sunil Govindan >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8413.001.patch > > > Flow activity page is fail to load with "Timeline server failed with an error" > This page uses incorrect flow call > "https://localhost:8188/ws/v2/timeline/flows?_=1528755339836; and it is > failing to load. > 1) Its using localhost instead ATS v2 hostname > 2) Its using ATS v1.5 http port instead ATS v2 https port > The correct rest call is "https://: port>/ws/v2/timeline/flows?_=1528755339836" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8405) RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773
[ https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509513#comment-16509513 ] Rohith Sharma K S commented on YARN-8405: - test failures are unrelated to patch.. committing shortly. > RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773 > -- > > Key: YARN-8405 > URL: https://issues.apache.org/jira/browse/YARN-8405 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.1.0 >Reporter: Rohith Sharma K S >Assignee: Íñigo Goiri >Priority: Major > Attachments: YARN-8405.000.patch, YARN-8405.001.patch, > YARN-8405.002.patch, YARN-8405.003.patch > > > HADOOP-14773 changes the ACL for > yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, > /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now > behavior changed from setting acls to parent node. As a result, parent node > /rmstore is set to default acl. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8413) Flow activity page is failing with "Timeline server failed with an error"
[ https://issues.apache.org/jira/browse/YARN-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509514#comment-16509514 ] Rohith Sharma K S commented on YARN-8413: - +1 lgtm > Flow activity page is failing with "Timeline server failed with an error" > - > > Key: YARN-8413 > URL: https://issues.apache.org/jira/browse/YARN-8413 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Sunil Govindan >Priority: Major > Attachments: YARN-8413.001.patch > > > Flow activity page is fail to load with "Timeline server failed with an error" > This page uses incorrect flow call > "https://localhost:8188/ws/v2/timeline/flows?_=1528755339836; and it is > failing to load. > 1) Its using localhost instead ATS v2 hostname > 2) Its using ATS v1.5 http port instead ATS v2 https port > The correct rest call is "https://: port>/ws/v2/timeline/flows?_=1528755339836" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8405) RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773
[ https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16508491#comment-16508491 ] Rohith Sharma K S commented on YARN-8405: - Ahh.. I see it :-) Apologies for missing it. +1 lgtm.. pending jenkins > RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773 > -- > > Key: YARN-8405 > URL: https://issues.apache.org/jira/browse/YARN-8405 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.1.0 >Reporter: Rohith Sharma K S >Assignee: Íñigo Goiri >Priority: Major > Attachments: YARN-8405.000.patch, YARN-8405.001.patch, > YARN-8405.002.patch, YARN-8405.003.patch > > > HADOOP-14773 changes the ACL for > yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, > /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now > behavior changed from setting acls to parent node. As a result, parent node > /rmstore is set to default acl. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8405) RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773
[ https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16508412#comment-16508412 ] Rohith Sharma K S commented on YARN-8405: - thanks [~elgoiri] for patch! My concern on patch is ZKCuratorManager is utils class which is already released. This would cause a problem right? > RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773 > -- > > Key: YARN-8405 > URL: https://issues.apache.org/jira/browse/YARN-8405 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.1.0 >Reporter: Rohith Sharma K S >Priority: Major > Attachments: YARN-8405.000.patch, YARN-8405.001.patch, > YARN-8405.002.patch, YARN-8405.003.patch > > > HADOOP-14773 changes the ACL for > yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, > /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now > behavior changed from setting acls to parent node. As a result, parent node > /rmstore is set to default acl. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8303) YarnClient should contact TimelineReader for application/attempt/container report
[ https://issues.apache.org/jira/browse/YARN-8303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16507256#comment-16507256 ] Rohith Sharma K S commented on YARN-8303: - All is yours:-) Go ahead.. > YarnClient should contact TimelineReader for application/attempt/container > report > - > > Key: YARN-8303 > URL: https://issues.apache.org/jira/browse/YARN-8303 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Rohith Sharma K S >Assignee: Abhishek Modi >Priority: Critical > > YarnClient get app/attempt/container information from RM. If RM doesn't have > then queried to ahsClient. When ATSv2 is only enabled, yarnClient will result > empty. > YarnClient is used by many users which result in empty information for > app/attempt/container report. > Proposal is to have adapter from yarn client so that app/attempt/container > reports can be generated from AHSv2Client which does REST API to > TimelineReader and get the entity and convert it into app/attempt/container > report. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8405) RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773
[ https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506794#comment-16506794 ] Rohith Sharma K S commented on YARN-8405: - [~elgoiri] would you please update the patch with test case? > RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773 > -- > > Key: YARN-8405 > URL: https://issues.apache.org/jira/browse/YARN-8405 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.1.0 >Reporter: Rohith Sharma K S >Priority: Major > Attachments: YARN-8405.000.patch, YARN-8405.001.patch, > YARN-8405.002.patch > > > HADOOP-14773 changes the ACL for > yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, > /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now > behavior changed from setting acls to parent node. As a result, parent node > /rmstore is set to default acl. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8405) RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773
[ https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505866#comment-16505866 ] Rohith Sharma K S commented on YARN-8405: - thanks [~elgoiri] for the patch. # Since ZKCuratorManager is in util class, changing public API would be an issue. Instead of changing existing API, we can add new method with _createRootDirRecursively(String path, List zkAcl)_ bq. do you have a test proposal? You can change permission values to verify it. Note that TestZKRMStateStore#testZKRootPathAcls verifies for /rmstore/ZKRMStateRoot but not to /rmstore. Below is the test code that fails without patch. {code} diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java index 4cba2664d15..6c421157158 100644 --- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java @@ -419,13 +419,14 @@ private static boolean verifyZKACL(String id, String scheme, int perm, public void testZKRootPathAcls() throws Exception { StateChangeRequestInfo req = new StateChangeRequestInfo( HAServiceProtocol.RequestSource.REQUEST_BY_USER); -String rootPath = -YarnConfiguration.DEFAULT_ZK_RM_STATE_STORE_PARENT_PATH + "/" + -ZKRMStateStore.ROOT_ZNODE_NAME; +String parentPath = YarnConfiguration.DEFAULT_ZK_RM_STATE_STORE_PARENT_PATH; +String rootPath = parentPath + "/" + ZKRMStateStore.ROOT_ZNODE_NAME; // Start RM with HA enabled Configuration conf = createHARMConf("rm1,rm2", "rm1", 1234, false, curatorTestingServer); +conf.set(YarnConfiguration.RM_ZK_ACL, "world:anyone:rwca"); +int perm = 23;// rwca=1+2+4+16 ResourceManager rm = new MockRM(conf); rm.start(); rm.getRMContext().getRMAdminService().transitionToActive(req); @@ -436,10 +437,16 @@ public void testZKRootPathAcls() throws Exception { verifyZKACL("digest", "localhost", Perms.CREATE | Perms.DELETE, acls); verifyZKACL( "world", "anyone", Perms.ALL ^ (Perms.CREATE | Perms.DELETE), acls); + +acls = +((ZKRMStateStore) rm.getRMContext().getStateStore()).getACL(parentPath); +assertEquals(1, acls.size()); +assertEquals(perm, acls.get(0).getPerms()); rm.close(); // Now start RM with HA disabled. NoAuth Exception should not be thrown. conf.setBoolean(YarnConfiguration.RM_HA_ENABLED, false); +conf.set(YarnConfiguration.RM_ZK_ACL, YarnConfiguration.DEFAULT_RM_ZK_ACL); rm = new MockRM(conf); rm.start(); rm.getRMContext().getRMAdminService().transitionToActive(req); {code} > RM zk-state-store.parent-path ACLs has been changed since HADOOP-14773 > -- > > Key: YARN-8405 > URL: https://issues.apache.org/jira/browse/YARN-8405 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.1.0 >Reporter: Rohith Sharma K S >Priority: Major > Attachments: YARN-8405.000.patch, YARN-8405.001.patch, > YARN-8405.002.patch > > > HADOOP-14773 changes the ACL for > yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, > /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now > behavior changed from setting acls to parent node. As a result, parent node > /rmstore is set to default acl. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8397) Thread leak in ActivitiesManager
[ https://issues.apache.org/jira/browse/YARN-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8397: Summary: Thread leak in ActivitiesManager (was: ActivitiesManager thread doesn't handles InterruptedException ) > Thread leak in ActivitiesManager > > > Key: YARN-8397 > URL: https://issues.apache.org/jira/browse/YARN-8397 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8397.01.patch > > > It is observed while using MiniYARNCluster, MiniYARNCluster#stop doesn't stop > JVM. > Thread dump shows that ActivitiesManager is in timed_waiting state. > {code} > "Thread-43" #66 prio=5 os_prio=31 tid=0x7ffea09fd000 nid=0xa103 waiting > on condition [0x76f1] >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:142) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8405) RM zk-state-store.parent-path acls has been changed since HADOOP-14773
[ https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504937#comment-16504937 ] Rohith Sharma K S commented on YARN-8405: - HADOOP-14741 added ZKCuratorManager. The method in create() send null zkAcl. But in ZKRMStateStore, this method send zkAcl value. {code} /** * Create a ZNode. * @param path Path of the ZNode. * @return If the ZNode was created. * @throws Exception If it cannot contact Zookeeper. */ public boolean create(final String path) throws Exception { return create(path, null); } {code} > RM zk-state-store.parent-path acls has been changed since HADOOP-14773 > -- > > Key: YARN-8405 > URL: https://issues.apache.org/jira/browse/YARN-8405 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.1.0 >Reporter: Rohith Sharma K S >Priority: Major > > HADOOP-14773 changes the ACL for > yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, > /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now > behavior changed from setting acls to parent node. As a result, parent node > /rmstore is set to default acl. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8386) App log can not be viewed from Logs tab in secure cluster
[ https://issues.apache.org/jira/browse/YARN-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8386: Fix Version/s: 3.1.1 3.2.0 > App log can not be viewed from Logs tab in secure cluster > -- > > Key: YARN-8386 > URL: https://issues.apache.org/jira/browse/YARN-8386 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Affects Versions: 3.1.0 >Reporter: Yesha Vora >Assignee: Sunil Govindan >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8386.001.patch, YARN-8386.002.patch > > > App Logs can not be viewed from UI2 logs tab. > Steps: > 1) Launch yarn service > 2) Let application finish and go to Logs tab to view AM log > Here, service am api is failing with 401 authentication error. > {code} > Request URL: > http://xxx:8188/ws/v1/applicationhistory/containers/container_e09_1527737134553_0034_01_01/logs/serviceam.log?_=1527799590942 > Request Method: GET > Status Code: 401 Authentication required > Response > html> > > > Error 401 Authentication required > > HTTP ERROR 401 > Problem accessing > /ws/v1/applicationhistory/containers/container_e09_1527737134553_0034_01_01/logs/serviceam.log. > Reason: > Authentication required > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8386) App log can not be viewed from Logs tab in secure cluster
[ https://issues.apache.org/jira/browse/YARN-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504871#comment-16504871 ] Rohith Sharma K S commented on YARN-8386: - committed to trunk/branch-3.1.. branch-3.0 cherry-picking causing an issues. [~sunilg] would you help to cherry pick to branch-3.0? I am keeping Jira open as long we cherry pick to branch-3.0. Is this also required for branch-2 ? > App log can not be viewed from Logs tab in secure cluster > -- > > Key: YARN-8386 > URL: https://issues.apache.org/jira/browse/YARN-8386 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Affects Versions: 3.1.0 >Reporter: Yesha Vora >Assignee: Sunil Govindan >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8386.001.patch, YARN-8386.002.patch > > > App Logs can not be viewed from UI2 logs tab. > Steps: > 1) Launch yarn service > 2) Let application finish and go to Logs tab to view AM log > Here, service am api is failing with 401 authentication error. > {code} > Request URL: > http://xxx:8188/ws/v1/applicationhistory/containers/container_e09_1527737134553_0034_01_01/logs/serviceam.log?_=1527799590942 > Request Method: GET > Status Code: 401 Authentication required > Response > html> > > > Error 401 Authentication required > > HTTP ERROR 401 > Problem accessing > /ws/v1/applicationhistory/containers/container_e09_1527737134553_0034_01_01/logs/serviceam.log. > Reason: > Authentication required > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8386) App log can not be viewed from Logs tab in secure cluster
[ https://issues.apache.org/jira/browse/YARN-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504857#comment-16504857 ] Rohith Sharma K S commented on YARN-8386: - +1, tested the patch in secure cluster and non-secure cluster. committing shortly > App log can not be viewed from Logs tab in secure cluster > -- > > Key: YARN-8386 > URL: https://issues.apache.org/jira/browse/YARN-8386 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Affects Versions: 3.1.0 >Reporter: Yesha Vora >Assignee: Sunil Govindan >Priority: Critical > Attachments: YARN-8386.001.patch, YARN-8386.002.patch > > > App Logs can not be viewed from UI2 logs tab. > Steps: > 1) Launch yarn service > 2) Let application finish and go to Logs tab to view AM log > Here, service am api is failing with 401 authentication error. > {code} > Request URL: > http://xxx:8188/ws/v1/applicationhistory/containers/container_e09_1527737134553_0034_01_01/logs/serviceam.log?_=1527799590942 > Request Method: GET > Status Code: 401 Authentication required > Response > html> > > > Error 401 Authentication required > > HTTP ERROR 401 > Problem accessing > /ws/v1/applicationhistory/containers/container_e09_1527737134553_0034_01_01/logs/serviceam.log. > Reason: > Authentication required > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8405) RM zk-state-store.parent-path acls has been changed since HADOOP-14773
[ https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8405: Affects Version/s: 2.9.0 > RM zk-state-store.parent-path acls has been changed since HADOOP-14773 > -- > > Key: YARN-8405 > URL: https://issues.apache.org/jira/browse/YARN-8405 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.1.0 >Reporter: Rohith Sharma K S >Priority: Major > > HADOOP-14773 changes the ACL for > yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, > /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now > behavior changed from setting acls to parent node. As a result, parent node > /rmstore is set to default acl. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8405) RM zk-state-store.parent-path acls has been changed since HADOOP-14773
[ https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8405: Target Version/s: 2.10.0, 3.1.1 > RM zk-state-store.parent-path acls has been changed since HADOOP-14773 > -- > > Key: YARN-8405 > URL: https://issues.apache.org/jira/browse/YARN-8405 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.1.0 >Reporter: Rohith Sharma K S >Priority: Major > > HADOOP-14773 changes the ACL for > yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, > /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now > behavior changed from setting acls to parent node. As a result, parent node > /rmstore is set to default acl. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8405) RM zk-state-store.parent-path acls has been changed since HADOOP-14773
[ https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8405: Affects Version/s: 3.1.0 > RM zk-state-store.parent-path acls has been changed since HADOOP-14773 > -- > > Key: YARN-8405 > URL: https://issues.apache.org/jira/browse/YARN-8405 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.1.0 >Reporter: Rohith Sharma K S >Priority: Major > > HADOOP-14773 changes the ACL for > yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, > /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now > behavior changed from setting acls to parent node. As a result, parent node > /rmstore is set to default acl. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8405) RM zk-state-store.parent-path acls has been changed since HADOOP-14773
[ https://issues.apache.org/jira/browse/YARN-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504659#comment-16504659 ] Rohith Sharma K S commented on YARN-8405: - Below is the difference *Before* : {code} [zk: localhost:2181(CONNECTED) 0] getAcl /rmstore 'sasl,'rm : cdrwa {code} *After*: {code} [zk: localhost:2181(CONNECTED) 1] getAcl /rmstore 'world,'anyone : cdrwa [zk: localhost:2181(CONNECTED) 2] getAcl /rmstore/ZKRMStateRoot 'sasl,'rm : rwa 'digest,'ctr-e138-1518143905142-346048-01-08.test.site:C1u8x7GQW9SdBpprg1Gov7bAAf8= : cd {code} The reason is while creating parent node recursively, ACLs are not set. Once parent node is created, then for further node creation acls are set. cc:/ [~subru] [~elgoiri] > RM zk-state-store.parent-path acls has been changed since HADOOP-14773 > -- > > Key: YARN-8405 > URL: https://issues.apache.org/jira/browse/YARN-8405 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Major > > HADOOP-14773 changes the ACL for > yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, > /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now > behavior changed from setting acls to parent node. As a result, parent node > /rmstore is set to default acl. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8405) RM zk-state-store.parent-path acls has been changed since HADOOP-14773
Rohith Sharma K S created YARN-8405: --- Summary: RM zk-state-store.parent-path acls has been changed since HADOOP-14773 Key: YARN-8405 URL: https://issues.apache.org/jira/browse/YARN-8405 Project: Hadoop YARN Issue Type: Bug Reporter: Rohith Sharma K S HADOOP-14773 changes the ACL for yarn.resourcemanager.zk-state-store.parent-path. Earlier to HADOOP-14773, /rmstore used set acls with yarn.resourcemanager.zk-acl value. But now behavior changed from setting acls to parent node. As a result, parent node /rmstore is set to default acl. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8404) RM Event dispatcher is blocked if ATS1/1.5 server is not running.
[ https://issues.apache.org/jira/browse/YARN-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504401#comment-16504401 ] Rohith Sharma K S commented on YARN-8404: - cc :/ [~naganarasimha...@apache.org] [~jlowe] [~leftnoteasy] [~vinodkv] [~sjlee0] [~vrushalic] do you see any critical issues making async for appFinished flow? > RM Event dispatcher is blocked if ATS1/1.5 server is not running. > -- > > Key: YARN-8404 > URL: https://issues.apache.org/jira/browse/YARN-8404 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0, 3.0.2 >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Blocker > Attachments: YARN-8404.01.patch > > > It is observed that if ATS1/1.5 daemon is not running, RM recovery is delayed > as long as timeline client get timed out for each applications. By default, > timed out will take around 5 mins. If completed applications are more then > amount of time RM will wait is *(number of completed applications in a > cluster * 5 minutes)* which is kind of hanged. > Primary reason for this behavior is YARN-3044 YARN-4129 which refactor > existing system metric publisher. This refactoring made appFinished event as > synchronous which was asynchronous earlier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8404) RM Event dispatcher is blocked if ATS1/1.5 server is not running.
[ https://issues.apache.org/jira/browse/YARN-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504313#comment-16504313 ] Rohith Sharma K S commented on YARN-8404: - Attached patch that retain original behavior that is appFinished event processing as asynchronous. [~sunilg] Please review the patch. > RM Event dispatcher is blocked if ATS1/1.5 server is not running. > -- > > Key: YARN-8404 > URL: https://issues.apache.org/jira/browse/YARN-8404 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Critical > Attachments: YARN-8404.01.patch > > > It is observed that if ATS1/1.5 daemon is not running, RM recovery is delayed > as long as timeline client get timed out for each applications. By default, > timed out will take around 5 mins. If completed applications are more then > amount of time RM will wait is *(number of completed applications in a > cluster * 5 minutes)* which is kind of hanged. > Primary reason for this behavior is YARN-3044 YARN-4129 which refactor > existing system metric publisher. This refactoring made appFinished event as > synchronous which was asynchronous earlier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8404) RM Event dispatcher is blocked if ATS1/1.5 server is not running.
[ https://issues.apache.org/jira/browse/YARN-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8404: Attachment: YARN-8404.01.patch > RM Event dispatcher is blocked if ATS1/1.5 server is not running. > -- > > Key: YARN-8404 > URL: https://issues.apache.org/jira/browse/YARN-8404 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Critical > Attachments: YARN-8404.01.patch > > > It is observed that if ATS1/1.5 daemon is not running, RM recovery is delayed > as long as timeline client get timed out for each applications. By default, > timed out will take around 5 mins. If completed applications are more then > amount of time RM will wait is *(number of completed applications in a > cluster * 5 minutes)* which is kind of hanged. > Primary reason for this behavior is YARN-3044 YARN-4129 which refactor > existing system metric publisher. This refactoring made appFinished event as > synchronous which was asynchronous earlier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8404) RM Event dispatcher is blocked if ATS1/1.5 server is not running.
[ https://issues.apache.org/jira/browse/YARN-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8404: Summary: RM Event dispatcher is blocked if ATS1/1.5 server is not running. (was: RM Recovery is delayed much if ATS1/1.5 server is not running. ) > RM Event dispatcher is blocked if ATS1/1.5 server is not running. > -- > > Key: YARN-8404 > URL: https://issues.apache.org/jira/browse/YARN-8404 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Critical > > It is observed that if ATS1/1.5 daemon is not running, RM recovery is delayed > as long as timeline client get timed out for each applications. By default, > timed out will take around 5 mins. If completed applications are more then > amount of time RM will wait is *(number of completed applications in a > cluster * 5 minutes)* which is kind of hanged. > Primary reason for this behavior is YARN-3044 YARN-4129 which refactor > existing system metric publisher. This refactoring made appFinished event as > synchronous which was asynchronous earlier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8404) RM Recovery is delayed much if ATS1/1.5 server is not running.
[ https://issues.apache.org/jira/browse/YARN-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504271#comment-16504271 ] Rohith Sharma K S commented on YARN-8404: - Making TimelineServiceV1Publisher#appFinished synchronous is very dangerous. If ATS1/1.5 daemon is down then primary AsyncDispatcher thread is blocked which makes primary dispatcher event queue grow up. > RM Recovery is delayed much if ATS1/1.5 server is not running. > --- > > Key: YARN-8404 > URL: https://issues.apache.org/jira/browse/YARN-8404 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Critical > > It is observed that if ATS1/1.5 daemon is not running, RM recovery is delayed > as long as timeline client get timed out for each applications. By default, > timed out will take around 5 mins. If completed applications are more then > amount of time RM will wait is *(number of completed applications in a > cluster * 5 minutes)* which is kind of hanged. > Primary reason for this behavior is YARN-3044 YARN-4129 which refactor > existing system metric publisher. This refactoring made appFinished event as > synchronous which was asynchronous earlier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Moved] (YARN-8404) RM Recovery is delayed much if ATS1/1.5 server is not running.
[ https://issues.apache.org/jira/browse/YARN-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S moved MAPREDUCE-7106 to YARN-8404: Key: YARN-8404 (was: MAPREDUCE-7106) Project: Hadoop YARN (was: Hadoop Map/Reduce) > RM Recovery is delayed much if ATS1/1.5 server is not running. > --- > > Key: YARN-8404 > URL: https://issues.apache.org/jira/browse/YARN-8404 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Critical > > It is observed that if ATS1/1.5 daemon is not running, RM recovery is delayed > as long as timeline client get timed out for each applications. By default, > timed out will take around 5 mins. If completed applications are more then > amount of time RM will wait is *(number of completed applications in a > cluster * 5 minutes)* which is kind of hanged. > Primary reason for this behavior is YARN-3044 YARN-4129 which refactor > existing system metric publisher. This refactoring made appFinished event as > synchronous which was asynchronous earlier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8399) NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode
[ https://issues.apache.org/jira/browse/YARN-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8399: Fix Version/s: (was: 3.0.3) 3.2.0 > NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode > -- > > Key: YARN-8399 > URL: https://issues.apache.org/jira/browse/YARN-8399 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineservice >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8399.001.patch, YARN-8399.002.patch, > YARN-8399.003.patch > > > Getting 403 GSS exception while accessing NM http port via curl. > {code:java} > curl -k -i --negotiate -u: https://:/node > HTTP/1.1 401 Authentication required > Date: Tue, 05 Jun 2018 17:59:00 GMT > Date: Tue, 05 Jun 2018 17:59:00 GMT > Pragma: no-cache > WWW-Authenticate: Negotiate > Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly > Cache-Control: must-revalidate,no-cache,no-store > Content-Type: text/html;charset=iso-8859-1 > Content-Length: 264 > HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism > level: Request is a replay (34)){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8399) NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode
[ https://issues.apache.org/jira/browse/YARN-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8399: Target Version/s: 2.10.0, 3.2.0, 3.1.1, 3.0.3 (was: 3.1.1) > NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode > -- > > Key: YARN-8399 > URL: https://issues.apache.org/jira/browse/YARN-8399 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineservice >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8399.001.patch, YARN-8399.002.patch, > YARN-8399.003.patch > > > Getting 403 GSS exception while accessing NM http port via curl. > {code:java} > curl -k -i --negotiate -u: https://:/node > HTTP/1.1 401 Authentication required > Date: Tue, 05 Jun 2018 17:59:00 GMT > Date: Tue, 05 Jun 2018 17:59:00 GMT > Pragma: no-cache > WWW-Authenticate: Negotiate > Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly > Cache-Control: must-revalidate,no-cache,no-store > Content-Type: text/html;charset=iso-8859-1 > Content-Length: 264 > HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism > level: Request is a replay (34)){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8399) NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode
[ https://issues.apache.org/jira/browse/YARN-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8399: Fix Version/s: 3.0.3 3.1.1 > NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode > -- > > Key: YARN-8399 > URL: https://issues.apache.org/jira/browse/YARN-8399 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineservice >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8399.001.patch, YARN-8399.002.patch, > YARN-8399.003.patch > > > Getting 403 GSS exception while accessing NM http port via curl. > {code:java} > curl -k -i --negotiate -u: https://:/node > HTTP/1.1 401 Authentication required > Date: Tue, 05 Jun 2018 17:59:00 GMT > Date: Tue, 05 Jun 2018 17:59:00 GMT > Pragma: no-cache > WWW-Authenticate: Negotiate > Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly > Cache-Control: must-revalidate,no-cache,no-store > Content-Type: text/html;charset=iso-8859-1 > Content-Length: 264 > HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism > level: Request is a replay (34)){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8399) NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode
[ https://issues.apache.org/jira/browse/YARN-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504259#comment-16504259 ] Rohith Sharma K S commented on YARN-8399: - I committed to trunk/branch-3.1.. patch apply to branch-3.0/branch-2 fails. I tried to resolve it and later compilation failed for tests. Since this impact for branch-3.0/branch-2 as well, I am keeping the jira open. [~sunilg] can you provide patch for branch-3.0 and branch-2? > NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode > -- > > Key: YARN-8399 > URL: https://issues.apache.org/jira/browse/YARN-8399 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineservice >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > Attachments: YARN-8399.001.patch, YARN-8399.002.patch, > YARN-8399.003.patch > > > Getting 403 GSS exception while accessing NM http port via curl. > {code:java} > curl -k -i --negotiate -u: https://:/node > HTTP/1.1 401 Authentication required > Date: Tue, 05 Jun 2018 17:59:00 GMT > Date: Tue, 05 Jun 2018 17:59:00 GMT > Pragma: no-cache > WWW-Authenticate: Negotiate > Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly > Cache-Control: must-revalidate,no-cache,no-store > Content-Type: text/html;charset=iso-8859-1 > Content-Length: 264 > HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism > level: Request is a replay (34)){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8399) NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode
[ https://issues.apache.org/jira/browse/YARN-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504228#comment-16504228 ] Rohith Sharma K S commented on YARN-8399: - forgot to commit yesterday, committing it now > NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode > -- > > Key: YARN-8399 > URL: https://issues.apache.org/jira/browse/YARN-8399 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineservice >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > Attachments: YARN-8399.001.patch, YARN-8399.002.patch, > YARN-8399.003.patch > > > Getting 403 GSS exception while accessing NM http port via curl. > {code:java} > curl -k -i --negotiate -u: https://:/node > HTTP/1.1 401 Authentication required > Date: Tue, 05 Jun 2018 17:59:00 GMT > Date: Tue, 05 Jun 2018 17:59:00 GMT > Pragma: no-cache > WWW-Authenticate: Negotiate > Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly > Cache-Control: must-revalidate,no-cache,no-store > Content-Type: text/html;charset=iso-8859-1 > Content-Length: 264 > HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism > level: Request is a replay (34)){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8399) NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode
[ https://issues.apache.org/jira/browse/YARN-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503617#comment-16503617 ] Rohith Sharma K S commented on YARN-8399: - +1 committing shortly > NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode > -- > > Key: YARN-8399 > URL: https://issues.apache.org/jira/browse/YARN-8399 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineservice >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > Attachments: YARN-8399.001.patch, YARN-8399.002.patch, > YARN-8399.003.patch > > > Getting 403 GSS exception while accessing NM http port via curl. > {code:java} > curl -k -i --negotiate -u: https://:/node > HTTP/1.1 401 Authentication required > Date: Tue, 05 Jun 2018 17:59:00 GMT > Date: Tue, 05 Jun 2018 17:59:00 GMT > Pragma: no-cache > WWW-Authenticate: Negotiate > Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly > Cache-Control: must-revalidate,no-cache,no-store > Content-Type: text/html;charset=iso-8859-1 > Content-Length: 264 > HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism > level: Request is a replay (34)){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8401) Yarnui2 not working with out internet connection
[ https://issues.apache.org/jira/browse/YARN-8401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503604#comment-16503604 ] Rohith Sharma K S commented on YARN-8401: - [~sunilg] I remember similar issue we were faced for normal http2 server start up for RM also because of jetty server used to download from java.sun.com. I guess this was issue with dns entry? > Yarnui2 not working with out internet connection > > > Key: YARN-8401 > URL: https://issues.apache.org/jira/browse/YARN-8401 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Blocker > Attachments: YARN-8401.001.patch > > > {code} > 2018-06-06 21:10:58,611 WARN org.eclipse.jetty.webapp.WebAppContext: Failed > startup of context > o.e.j.w.WebAppContext@108a46d6{/ui2,file:///opt/HA/310/install/hadoop/resourcemanager/share/hadoop/yarn/webapps/ui2/,null} > java.net.UnknownHostException: java.sun.com > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at java.net.Socket.connect(Socket.java:538) > at sun.net.NetworkClient.doConnect(NetworkClient.java:180) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) > at sun.net.www.http.HttpClient.(HttpClient.java:211) > at sun.net.www.http.HttpClient.New(HttpClient.java:308) > at sun.net.www.http.HttpClient.New(HttpClient.java:326) > at > sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1168) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1104) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:998) > at > sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:932) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1512) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1440) > at > com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:646) > at > com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1300) > at > com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:1267) > at > com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:263) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1164) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1050) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:964) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:333) > at org.eclipse.jetty.xml.XmlParser.parse(XmlParser.java:255) > at org.eclipse.jetty.webapp.Descriptor.parse(Descriptor.java:54) > at > org.eclipse.jetty.webapp.WebDescriptor.parse(WebDescriptor.java:207) > at org.eclipse.jetty.webapp.MetaData.setWebXml(MetaData.java:189) > at > org.eclipse.jetty.webapp.WebXmlConfiguration.preConfigure(WebXmlConfiguration.java:60) > at > org.eclipse.jetty.webapp.WebAppContext.preConfigure(WebAppContext.java:485) > at > org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:521) > at >
[jira] [Commented] (YARN-8399) NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode
[ https://issues.apache.org/jira/browse/YARN-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503346#comment-16503346 ] Rohith Sharma K S commented on YARN-8399: - +1 > NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode > -- > > Key: YARN-8399 > URL: https://issues.apache.org/jira/browse/YARN-8399 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineservice >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > Attachments: YARN-8399.001.patch, YARN-8399.002.patch > > > Getting 403 GSS exception while accessing NM http port via curl. > {code:java} > curl -k -i --negotiate -u: https://:/node > HTTP/1.1 401 Authentication required > Date: Tue, 05 Jun 2018 17:59:00 GMT > Date: Tue, 05 Jun 2018 17:59:00 GMT > Pragma: no-cache > WWW-Authenticate: Negotiate > Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly > Cache-Control: must-revalidate,no-cache,no-store > Content-Type: text/html;charset=iso-8859-1 > Content-Length: 264 > HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism > level: Request is a replay (34)){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6904) [ATSv2] Fix findbugs warnings
[ https://issues.apache.org/jira/browse/YARN-6904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503179#comment-16503179 ] Rohith Sharma K S commented on YARN-6904: - This JIRA we can close as it was happening in YARN-5355 branch which is pretty older. All those changes got rebased against trunk and corrected before branch merge into trunk. > [ATSv2] Fix findbugs warnings > - > > Key: YARN-6904 > URL: https://issues.apache.org/jira/browse/YARN-6904 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: YARN-5355 >Reporter: Rohith Sharma K S >Assignee: Abhishek Modi >Priority: Major > > Many extant findbugs warnings are reported branch YARN-5355 > [Jenkins|https://issues.apache.org/jira/browse/YARN-6130?focusedCommentId=16105786=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16105786] > This need to be investigated and fix one by one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8155) Improve the logging in NMTimelinePublisher and TimelineCollectorWebService
[ https://issues.apache.org/jira/browse/YARN-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502937#comment-16502937 ] Rohith Sharma K S commented on YARN-8155: - looks fine to me.. [~vrushalic] [~haibochen] would you take a look at the patch? I will commit it later of today if no more objections. > Improve the logging in NMTimelinePublisher and TimelineCollectorWebService > -- > > Key: YARN-8155 > URL: https://issues.apache.org/jira/browse/YARN-8155 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-8155.001.patch, YARN-8155.002.patch, > YARN-8155.003.patch, YARN-8155.004.patch > > > We see that NM logs are filled with larger stack trace of NotFoundException > if collector is removed from one of the NM and other NMs are still publishing > the entities. > > This Jira is to improve the logging in NM so that we log with informative > message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8399) NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode
[ https://issues.apache.org/jira/browse/YARN-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502364#comment-16502364 ] Rohith Sharma K S commented on YARN-8399: - Should we make this change at auxiliary service level? > NodeManager is giving 403 GSS exception post upgrade to 3.1 in secure mode > -- > > Key: YARN-8399 > URL: https://issues.apache.org/jira/browse/YARN-8399 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineservice >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > Attachments: YARN-8399.001.patch > > > Getting 403 GSS exception while accessing NM http port via curl. > {code:java} > curl -k -i --negotiate -u: https://:/node > HTTP/1.1 401 Authentication required > Date: Tue, 05 Jun 2018 17:59:00 GMT > Date: Tue, 05 Jun 2018 17:59:00 GMT > Pragma: no-cache > WWW-Authenticate: Negotiate > Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly > Cache-Control: must-revalidate,no-cache,no-store > Content-Type: text/html;charset=iso-8859-1 > Content-Length: 264 > HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism > level: Request is a replay (34)){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8396) Click on an individual container continuously spins and doesn't load the page
[ https://issues.apache.org/jira/browse/YARN-8396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501840#comment-16501840 ] Rohith Sharma K S commented on YARN-8396: - thank to [~sunilg] for the patch! > Click on an individual container continuously spins and doesn't load the page > - > > Key: YARN-8396 > URL: https://issues.apache.org/jira/browse/YARN-8396 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Charan Hebri >Assignee: Sunil Govindan >Priority: Blocker > Fix For: 3.2.0, 3.1.1 > > Attachments: Screen Shot 2018-05-31 at 3.24.09 PM.png, > YARN-8396.001.patch > > > For a running application, a click on an individual container leads to an > infinite spinner which doesn't load the corresponding page. To reproduce, > with a running application click: > Nodes -> \{Node_HTTP_Address} -> List of Containers on this Node -> > \{Container_id} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8397) ActivitiesManager thread doesn't handles InterruptedException
[ https://issues.apache.org/jira/browse/YARN-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501737#comment-16501737 ] Rohith Sharma K S commented on YARN-8397: - Findbug error is from fair scheduler, not related to patch. > ActivitiesManager thread doesn't handles InterruptedException > -- > > Key: YARN-8397 > URL: https://issues.apache.org/jira/browse/YARN-8397 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8397.01.patch > > > It is observed while using MiniYARNCluster, MiniYARNCluster#stop doesn't stop > JVM. > Thread dump shows that ActivitiesManager is in timed_waiting state. > {code} > "Thread-43" #66 prio=5 os_prio=31 tid=0x7ffea09fd000 nid=0xa103 waiting > on condition [0x76f1] >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:142) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8155) Improve the logging in NMTimelinePublisher and TimelineCollectorWebService
[ https://issues.apache.org/jira/browse/YARN-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501603#comment-16501603 ] Rohith Sharma K S commented on YARN-8155: - TimelineServiceV2Publisher printing exception message which could fill up logs very quickly. Probably we should log exception only in debug log but not for info logs. > Improve the logging in NMTimelinePublisher and TimelineCollectorWebService > -- > > Key: YARN-8155 > URL: https://issues.apache.org/jira/browse/YARN-8155 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-8155.001.patch, YARN-8155.002.patch, > YARN-8155.003.patch > > > We see that NM logs are filled with larger stack trace of NotFoundException > if collector is removed from one of the NM and other NMs are still publishing > the entities. > > This Jira is to improve the logging in NM so that we log with informative > message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8397) ActivitiesManager thread doesn't handles InterruptedException
[ https://issues.apache.org/jira/browse/YARN-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S reassigned YARN-8397: --- Assignee: Rohith Sharma K S > ActivitiesManager thread doesn't handles InterruptedException > -- > > Key: YARN-8397 > URL: https://issues.apache.org/jira/browse/YARN-8397 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8397.01.patch > > > It is observed while using MiniYARNCluster, MiniYARNCluster#stop doesn't stop > JVM. > Thread dump shows that ActivitiesManager is in timed_waiting state. > {code} > "Thread-43" #66 prio=5 os_prio=31 tid=0x7ffea09fd000 nid=0xa103 waiting > on condition [0x76f1] >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:142) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8397) ActivitiesManager thread doesn't handles InterruptedException
[ https://issues.apache.org/jira/browse/YARN-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501582#comment-16501582 ] Rohith Sharma K S commented on YARN-8397: - There were 2 issues which got fixed in this patch. # ActivityManager doesn't handle InterruptedException. # Since activityManager was not stopped anywhere, in RM HA scenario there was one thread leak on every RM switch. I > ActivitiesManager thread doesn't handles InterruptedException > -- > > Key: YARN-8397 > URL: https://issues.apache.org/jira/browse/YARN-8397 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Major > Attachments: YARN-8397.01.patch > > > It is observed while using MiniYARNCluster, MiniYARNCluster#stop doesn't stop > JVM. > Thread dump shows that ActivitiesManager is in timed_waiting state. > {code} > "Thread-43" #66 prio=5 os_prio=31 tid=0x7ffea09fd000 nid=0xa103 waiting > on condition [0x76f1] >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:142) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8397) ActivitiesManager thread doesn't handles InterruptedException
[ https://issues.apache.org/jira/browse/YARN-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8397: Attachment: YARN-8397.01.patch > ActivitiesManager thread doesn't handles InterruptedException > -- > > Key: YARN-8397 > URL: https://issues.apache.org/jira/browse/YARN-8397 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Major > Attachments: YARN-8397.01.patch > > > It is observed while using MiniYARNCluster, MiniYARNCluster#stop doesn't stop > JVM. > Thread dump shows that ActivitiesManager is in timed_waiting state. > {code} > "Thread-43" #66 prio=5 os_prio=31 tid=0x7ffea09fd000 nid=0xa103 waiting > on condition [0x76f1] >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:142) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8397) ActivitiesManager thread doesn't handles InterruptedException
Rohith Sharma K S created YARN-8397: --- Summary: ActivitiesManager thread doesn't handles InterruptedException Key: YARN-8397 URL: https://issues.apache.org/jira/browse/YARN-8397 Project: Hadoop YARN Issue Type: Bug Reporter: Rohith Sharma K S It is observed while using MiniYARNCluster, MiniYARNCluster#stop doesn't stop JVM. Thread dump shows that ActivitiesManager is in timed_waiting state. {code} "Thread-43" #66 prio=5 os_prio=31 tid=0x7ffea09fd000 nid=0xa103 waiting on condition [0x76f1] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:142) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8397) ActivitiesManager thread doesn't handles InterruptedException
[ https://issues.apache.org/jira/browse/YARN-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501568#comment-16501568 ] Rohith Sharma K S commented on YARN-8397: - This is due to ActivitiesManager#serviceStop interrupt the thread and come out. But exception handling in run method ignores exception and continue in loop. {code} cleanUpThread = new Thread(new Runnable() { @Override public void run() { while (true) { // some code try { Thread.sleep(5000); } catch (Exception e) { // ignore } } } }); {code} > ActivitiesManager thread doesn't handles InterruptedException > -- > > Key: YARN-8397 > URL: https://issues.apache.org/jira/browse/YARN-8397 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Major > > It is observed while using MiniYARNCluster, MiniYARNCluster#stop doesn't stop > JVM. > Thread dump shows that ActivitiesManager is in timed_waiting state. > {code} > "Thread-43" #66 prio=5 os_prio=31 tid=0x7ffea09fd000 nid=0xa103 waiting > on condition [0x76f1] >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:142) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8155) Improve the logging in NMTimelinePublisher and TimelineCollectorWebService
[ https://issues.apache.org/jira/browse/YARN-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499958#comment-16499958 ] Rohith Sharma K S commented on YARN-8155: - thanks Abhishek Modi for patch. it looks reasonable to me. Would you add similar change in TimelineServiceV2Publisher as well.? TimelineCollectorWebService # Catching notfoundexception and converting into web application exception changing return code. We should still retain return code not found right? [~vrushalic] [~haibo.chen] would you take a look at this patch please? > Improve the logging in NMTimelinePublisher and TimelineCollectorWebService > -- > > Key: YARN-8155 > URL: https://issues.apache.org/jira/browse/YARN-8155 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-8155.001.patch, YARN-8155.002.patch > > > We see that NM logs are filled with larger stack trace of NotFoundException > if collector is removed from one of the NM and other NMs are still publishing > the entities. > > This Jira is to improve the logging in NM so that we log with informative > message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8319) More YARN pages need to honor yarn.resourcemanager.display.per-user-apps
[ https://issues.apache.org/jira/browse/YARN-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498857#comment-16498857 ] Rohith Sharma K S commented on YARN-8319: - [~sunilg] if it is applicable for branch-2, please feel free to backport to branch-2 as well. > More YARN pages need to honor yarn.resourcemanager.display.per-user-apps > > > Key: YARN-8319 > URL: https://issues.apache.org/jira/browse/YARN-8319 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Reporter: Vinod Kumar Vavilapalli >Assignee: Sunil Govindan >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8319.001.patch, YARN-8319.002.patch, > YARN-8319.003.patch, YARN-8319.addendum.001.patch > > > When this config is on > - Per queue page on UI2 should filter app list by user > -- TODO: Verify the same with UI1 Per-queue page > - ATSv2 with UI2 should filter list of all users' flows and flow activities > - Per Node pages > -- Listing of apps and containers on a per-node basis should filter apps and > containers by user. > To this end, because this is no longer just for resourcemanager, we should > also deprecate {{yarn.resourcemanager.display.per-user-apps}} in favor of > {{yarn.webapp.filter-app-list-by-user}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8372) ApplicationAttemptNotFoundException should be handled correctly by Distributed Shell App Master
[ https://issues.apache.org/jira/browse/YARN-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497823#comment-16497823 ] Rohith Sharma K S commented on YARN-8372: - Thanks [~suma.shivaprasad] for the patch. Approach looks good to me. Would you look at test and checkstyle errors? > ApplicationAttemptNotFoundException should be handled correctly by > Distributed Shell App Master > --- > > Key: YARN-8372 > URL: https://issues.apache.org/jira/browse/YARN-8372 > Project: Hadoop YARN > Issue Type: Bug > Components: distributed-shell >Reporter: Charan Hebri >Assignee: Suma Shivaprasad >Priority: Major > Attachments: YARN-8372.1.patch, YARN-8372.2.patch > > > {noformat} > try { > response = client.allocate(progress); > } catch (ApplicationAttemptNotFoundException e) { > handler.onShutdownRequest(); > LOG.info("Shutdown requested. Stopping callback."); > return;{noformat} > is a code snippet from AMRMClientAsyncImpl. The corresponding > onShutdownRequest call for the Distributed Shell App master, > {noformat} > @Override > public void onShutdownRequest() { > done = true; > }{noformat} > Due to the above change, the current behavior is that whenever an application > attempt fails due to a NM restart (NM where the DS AM is running), an > ApplicationAttemptNotFoundException is thrown and all containers for that > attempt including the ones that are running on other NMs are killed by the AM > and marked as COMPLETE. The subsequent attempt spawns new containers just > like a new attempt. This behavior is different to a Map Reduce application > where the containers are not killed. > cc [~rohithsharma] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8383) TimelineServer 1.5 start fails with NoClassDefFoundError
[ https://issues.apache.org/jira/browse/YARN-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8383: Target Version/s: 2.8.5 I see this behavior only 2.8.4 but not in 2.9.x release line. It occurs every time, just download 2.8.4 and configure 1.5 properties. JsonFactory file is from jackson-core-x.x.x.jar. In hadoop-2.8.4, this jar not found. But in hadoop-2.9.x, I see this jar available in hdfs lib. *In hadoop-2.8.4:* Though tools/lib has jackson-core-2.2.3.jar, this won't be loaded for daemon start. {code:java} HW12723:hadoop-2.8.4 rsharmaks$ find ./ -iname "jackson-core-*.jar" .//share/hadoop/common/lib/jackson-core-asl-1.9.13.jar .//share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar .//share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-core-asl-1.9.13.jar .//share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/jackson-core-asl-1.9.13.jar .//share/hadoop/mapreduce/lib/jackson-core-asl-1.9.13.jar .//share/hadoop/tools/lib/jackson-core-2.2.3.jar .//share/hadoop/tools/lib/jackson-core-asl-1.9.13.jar .//share/hadoop/yarn/lib/jackson-core-asl-1.9.13.jar {code} *In hadoop-2.9.x :* Observer that jackson-core-2.7.8.jar is there in hdfs/lib which will be loaded for timelineserver. Though tools/lib has this jar, this won't be loaded for daemon start. {code:java} HW12723:hadoop-2.9.0 rsharmaks$ find ./ -iname "jackson-core-*.jar" .//share/hadoop/common/lib/jackson-core-asl-1.9.13.jar .//share/hadoop/hdfs/lib/jackson-core-2.7.8.jar .//share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar .//share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-core-2.7.8.jar .//share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-core-asl-1.9.13.jar .//share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/jackson-core-asl-1.9.13.jar .//share/hadoop/mapreduce/lib/jackson-core-asl-1.9.13.jar .//share/hadoop/tools/lib/jackson-core-2.7.8.jar .//share/hadoop/tools/lib/jackson-core-asl-1.9.13.jar .//share/hadoop/yarn/lib/jackson-core-asl-1.9.13.jar HW12723:hadoop-2.9.0 rsharmaks$ cd ../hadoop-2.9.1 HW12723:hadoop-2.9.1 rsharmaks$ find ./ -iname "jackson-core-*.jar" .//share/hadoop/common/lib/jackson-core-asl-1.9.13.jar .//share/hadoop/hdfs/lib/jackson-core-2.7.8.jar .//share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar .//share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-core-2.7.8.jar .//share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-core-asl-1.9.13.jar .//share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/jackson-core-asl-1.9.13.jar .//share/hadoop/mapreduce/lib/jackson-core-asl-1.9.13.jar .//share/hadoop/tools/lib/jackson-core-2.7.8.jar .//share/hadoop/tools/lib/jackson-core-asl-1.9.13.jar .//share/hadoop/yarn/lib/jackson-core-asl-1.9.13.jar {code} I couldn't get 2.8.3 release artifact to verify it. cc :/ [~jlowe] > TimelineServer 1.5 start fails with NoClassDefFoundError > > > Key: YARN-8383 > URL: https://issues.apache.org/jira/browse/YARN-8383 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.4 >Reporter: Rohith Sharma K S >Priority: Blocker > > TimelineServer 1.5 start fails with NoClassDefFoundError. > {noformat} > 2018-05-31 22:10:58,548 FATAL > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer: > Error starting ApplicationHistoryServer > java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/JsonFactory > at > org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.(RollingLevelDBTimelineStore.java:174) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2306) > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2271) > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367) > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.createSummaryStore(EntityGroupFSTimelineStore.java:239) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.serviceInit(EntityGroupFSTimelineStore.java:146) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:180)
[jira] [Created] (YARN-8383) TimelineServer 1.5 start fails with NoClassDefFoundError
Rohith Sharma K S created YARN-8383: --- Summary: TimelineServer 1.5 start fails with NoClassDefFoundError Key: YARN-8383 URL: https://issues.apache.org/jira/browse/YARN-8383 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.8.4 Reporter: Rohith Sharma K S TimelineServer 1.5 start fails with NoClassDefFoundError. {noformat} 2018-05-31 22:10:58,548 FATAL org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer: Error starting ApplicationHistoryServer java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/JsonFactory at org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.(RollingLevelDBTimelineStore.java:174) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2306) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2271) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.createSummaryStore(EntityGroupFSTimelineStore.java:239) at org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.serviceInit(EntityGroupFSTimelineStore.java:146) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:180) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:190) Caused by: java.lang.ClassNotFoundException: com.fasterxml.jackson.core.JsonFactory at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 15 more {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8368) yarn app start cli should print applicationId
[ https://issues.apache.org/jira/browse/YARN-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496188#comment-16496188 ] Rohith Sharma K S commented on YARN-8368: - Thanks [~billie.rinaldi] for review and committing the patch.. > yarn app start cli should print applicationId > - > > Key: YARN-8368 > URL: https://issues.apache.org/jira/browse/YARN-8368 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Rohith Sharma K S >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8368.01.patch, YARN-8368.02.patch > > > yarn app start cli should print the application Id similar to yarn launch cmd. > {code:java} > bash-4.2$ yarn app -start hbase-app-test > WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of > YARN_LOGFILE. > WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of > YARN_PID_DIR. > 18/05/24 15:15:53 INFO client.RMProxy: Connecting to ResourceManager at > xxx/xxx:8050 > 18/05/24 15:15:54 INFO client.RMProxy: Connecting to ResourceManager at > xxx/xxx:8050 > 18/05/24 15:15:55 INFO client.ApiServiceClient: Service hbase-app-test is > successfully started.{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8380) Support shared mounts in docker runtime
[ https://issues.apache.org/jira/browse/YARN-8380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496185#comment-16496185 ] Rohith Sharma K S commented on YARN-8380: - Just fyi.. Whenever docker volume is mounted with shared, I see an error like *_docker: Error response from daemon: linux mounts: Could not find source mount of /var/lib/kubelet_*. Do you know any reason for this? > Support shared mounts in docker runtime > --- > > Key: YARN-8380 > URL: https://issues.apache.org/jira/browse/YARN-8380 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Billie Rinaldi >Assignee: Billie Rinaldi >Priority: Major > > The docker run command supports the mount type shared, but currently we are > only supporting ro and rw mount types in the docker runtime. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8372) ApplicationAttemptNotFoundException should be handled correctly by Distributed Shell App Master
[ https://issues.apache.org/jira/browse/YARN-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494690#comment-16494690 ] Rohith Sharma K S commented on YARN-8372: - DS app master should handle shutdown request properly whether to clean up or not based on the attempt number check. Current behavior clean up all the running containers! > ApplicationAttemptNotFoundException should be handled correctly by > Distributed Shell App Master > --- > > Key: YARN-8372 > URL: https://issues.apache.org/jira/browse/YARN-8372 > Project: Hadoop YARN > Issue Type: Bug > Components: distributed-shell >Reporter: Charan Hebri >Priority: Major > > {noformat} > try { > response = client.allocate(progress); > } catch (ApplicationAttemptNotFoundException e) { > handler.onShutdownRequest(); > LOG.info("Shutdown requested. Stopping callback."); > return;{noformat} > is a code snippet from AMRMClientAsyncImpl. The corresponding > onShutdownRequest call for the Distributed Shell App master, > {noformat} > @Override > public void onShutdownRequest() { > done = true; > }{noformat} > Due to the above change, the current behavior is that whenever an application > attempt fails due to a NM restart (NM where the DS AM is running), an > ApplicationAttemptNotFoundException is thrown and all containers for that > attempt including the ones that are running on other NMs are killed by the AM > and marked as COMPLETE. The subsequent attempt spawns new containers just > like a new attempt. This behavior is different to a Map Reduce application > where the containers are not killed. > cc [~rohithsharma] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8369) Javadoc build failed due to "bad use of '>'"
[ https://issues.apache.org/jira/browse/YARN-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492584#comment-16492584 ] Rohith Sharma K S commented on YARN-8369: - On further compiling java doc, it fails at CapacitySchedulerPreemptionUtils. We also need to fix this. {code:java} [ERROR] /opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/CapacitySchedulerPreemptionUtils.java:139: error: malformed HTML [ERROR]*stop preempt container when any major resource type <= 0 for to- [ERROR] ^ [ERROR] /opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/CapacitySchedulerPreemptionUtils.java:143: error: malformed HTML [ERROR]*stop preempt container when: all major resource type <= 0 for {code} > Javadoc build failed due to "bad use of '>'" > > > Key: YARN-8369 > URL: https://issues.apache.org/jira/browse/YARN-8369 > Project: Hadoop YARN > Issue Type: Bug > Components: build, docs >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma >Priority: Major > Attachments: YARN-8369.1.patch > > > {noformat} > $ mvn javadoc:javadoc --projects > hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common > ... > [ERROR] > /hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:263: > error: bad use of '>' > [ERROR]* included) has a >0 value. > [ERROR] ^ > [ERROR] > /hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:266: > error: bad use of '>' > [ERROR]* @return returns true if any resource is >0 > [ERROR] ^ > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8371) Javadoc error in ResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S resolved YARN-8371. - Resolution: Duplicate > Javadoc error in ResourceCalculator > --- > > Key: YARN-8371 > URL: https://issues.apache.org/jira/browse/YARN-8371 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > > Hadoop package build fails with java doc error > {code} > [ERROR] > /opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:263: > error: bad use of '>' > [ERROR]* included) has a >0 value. > [ERROR] ^ > [ERROR] > /opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:266: > error: bad use of '>' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8292) Fix the dominant resource preemption cannot happen when some of the resource vector becomes negative
[ https://issues.apache.org/jira/browse/YARN-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492565#comment-16492565 ] Rohith Sharma K S commented on YARN-8292: - This causes java doc errors in trunk and branch-3.1. See YARN-8371 > Fix the dominant resource preemption cannot happen when some of the resource > vector becomes negative > > > Key: YARN-8292 > URL: https://issues.apache.org/jira/browse/YARN-8292 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Sumana Sathish >Assignee: Wangda Tan >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8292.001.patch, YARN-8292.002.patch, > YARN-8292.003.patch, YARN-8292.004.patch, YARN-8292.005.patch, > YARN-8292.006.patch, YARN-8292.007.patch, YARN-8292.008.patch, > YARN-8292.009.patch > > > This is an example of the problem: > > {code} > // guaranteed, max,used, pending > "root(=[30:18:6 30:18:6 12:12:6 1:1:1]);" + //root > "-a(=[10:6:2 10:6:2 6:6:3 0:0:0]);" + // a > "-b(=[10:6:2 10:6:2 6:6:3 0:0:0]);" + // b > "-c(=[10:6:2 10:6:2 0:0:0 1:1:1])"; // c > {code} > There're 3 resource types. Total resource of the cluster is 30:18:6 > For both of a/b, there're 3 containers running, each of container is 2:2:1. > Queue c uses 0 resource, and have 1:1:1 pending resource. > Under existing logic, preemption cannot happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8371) Javadoc error in ResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8371: Target Version/s: 3.2.0, 3.1.1 > Javadoc error in ResourceCalculator > --- > > Key: YARN-8371 > URL: https://issues.apache.org/jira/browse/YARN-8371 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > > Hadoop package build fails with java doc error > {code} > [ERROR] > /opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:263: > error: bad use of '>' > [ERROR]* included) has a >0 value. > [ERROR] ^ > [ERROR] > /opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:266: > error: bad use of '>' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8371) Javadoc error in ResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8371: Priority: Critical (was: Major) > Javadoc error in ResourceCalculator > --- > > Key: YARN-8371 > URL: https://issues.apache.org/jira/browse/YARN-8371 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > > Hadoop package build fails with java doc error > {code} > [ERROR] > /opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:263: > error: bad use of '>' > [ERROR]* included) has a >0 value. > [ERROR] ^ > [ERROR] > /opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:266: > error: bad use of '>' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8371) Javadoc error in ResourceCalculator
Rohith Sharma K S created YARN-8371: --- Summary: Javadoc error in ResourceCalculator Key: YARN-8371 URL: https://issues.apache.org/jira/browse/YARN-8371 Project: Hadoop YARN Issue Type: Bug Reporter: Rohith Sharma K S Hadoop package build fails with java doc error {code} [ERROR] /opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:263: error: bad use of '>' [ERROR]* included) has a >0 value. [ERROR] ^ [ERROR] /opt/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java:266: error: bad use of '>' {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8068) Application Priority field causes NPE in app timeline publish when Hadoop 2.7 based clients to 2.8+
[ https://issues.apache.org/jira/browse/YARN-8068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8068: Issue Type: Sub-task (was: Bug) Parent: YARN-8347 > Application Priority field causes NPE in app timeline publish when Hadoop 2.7 > based clients to 2.8+ > --- > > Key: YARN-8068 > URL: https://issues.apache.org/jira/browse/YARN-8068 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 2.8.3 >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Blocker > Fix For: 3.1.0, 2.10.0, 2.9.2, 3.0.3 > > Attachments: YARN-8068.001.patch > > > [TimelineServiceV1Publisher|eclipse-javadoc:%E2%98%82=hadoop-yarn-server-resourcemanager/src%5C/main%5C/java%3Corg.apache.hadoop.yarn.server.resourcemanager.metrics%7BTimelineServiceV1Publisher.java%E2%98%83TimelineServiceV1Publisher].appCreated > will cause NPE as we use like below > {code:java} > entityInfo.put(ApplicationMetricsConstants.APPLICATION_PRIORITY_INFO, > app.getApplicationPriority().getPriority());{code} > We have to handle this case while recovery. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8368) yarn app start cli should print applicationId
[ https://issues.apache.org/jira/browse/YARN-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8368: Attachment: YARN-8368.02.patch > yarn app start cli should print applicationId > - > > Key: YARN-8368 > URL: https://issues.apache.org/jira/browse/YARN-8368 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Rohith Sharma K S >Priority: Critical > Attachments: YARN-8368.01.patch, YARN-8368.02.patch > > > yarn app start cli should print the application Id similar to yarn launch cmd. > {code:java} > bash-4.2$ yarn app -start hbase-app-test > WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of > YARN_LOGFILE. > WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of > YARN_PID_DIR. > 18/05/24 15:15:53 INFO client.RMProxy: Connecting to ResourceManager at > xxx/xxx:8050 > 18/05/24 15:15:54 INFO client.RMProxy: Connecting to ResourceManager at > xxx/xxx:8050 > 18/05/24 15:15:55 INFO client.ApiServiceClient: Service hbase-app-test is > successfully started.{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8368) yarn app start cli should print applicationId
[ https://issues.apache.org/jira/browse/YARN-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8368: Attachment: YARN-8368.01.patch > yarn app start cli should print applicationId > - > > Key: YARN-8368 > URL: https://issues.apache.org/jira/browse/YARN-8368 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Rohith Sharma K S >Priority: Critical > Attachments: YARN-8368.01.patch > > > yarn app start cli should print the application Id similar to yarn launch cmd. > {code:java} > bash-4.2$ yarn app -start hbase-app-test > WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of > YARN_LOGFILE. > WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of > YARN_PID_DIR. > 18/05/24 15:15:53 INFO client.RMProxy: Connecting to ResourceManager at > xxx/xxx:8050 > 18/05/24 15:15:54 INFO client.RMProxy: Connecting to ResourceManager at > xxx/xxx:8050 > 18/05/24 15:15:55 INFO client.ApiServiceClient: Service hbase-app-test is > successfully started.{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8368) yarn app start cli should print applicationId
[ https://issues.apache.org/jira/browse/YARN-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S reassigned YARN-8368: --- Assignee: Rohith Sharma K S > yarn app start cli should print applicationId > - > > Key: YARN-8368 > URL: https://issues.apache.org/jira/browse/YARN-8368 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Rohith Sharma K S >Priority: Critical > Attachments: YARN-8368.01.patch > > > yarn app start cli should print the application Id similar to yarn launch cmd. > {code:java} > bash-4.2$ yarn app -start hbase-app-test > WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of > YARN_LOGFILE. > WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of > YARN_PID_DIR. > 18/05/24 15:15:53 INFO client.RMProxy: Connecting to ResourceManager at > xxx/xxx:8050 > 18/05/24 15:15:54 INFO client.RMProxy: Connecting to ResourceManager at > xxx/xxx:8050 > 18/05/24 15:15:55 INFO client.ApiServiceClient: Service hbase-app-test is > successfully started.{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8346) Upgrading to 3.1 kills running containers with error "Opportunistic container queue is full"
[ https://issues.apache.org/jira/browse/YARN-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489047#comment-16489047 ] Rohith Sharma K S commented on YARN-8346: - back ported to 2.9 as well. > Upgrading to 3.1 kills running containers with error "Opportunistic container > queue is full" > > > Key: YARN-8346 > URL: https://issues.apache.org/jira/browse/YARN-8346 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.0, 3.0.2 >Reporter: Rohith Sharma K S >Assignee: Jason Lowe >Priority: Blocker > Fix For: 3.1.0, 2.10.0, 3.2.0, 2.9.2, 3.0.3 > > Attachments: YARN-8346.001.patch > > > It is observed while rolling upgrade from 2.8.4 to 3.1 release, all the > running containers are killed and second attempt is launched for that > application. The diagnostics message is "Opportunistic container queue is > full" which is the reason for container killed. > In NM log, I see below logs for after container is recovered. > {noformat} > 2018-05-23 17:18:50,655 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: > Opportunistic container [container_e06_1527075664705_0001_01_01] will > not be queued at the NMsince max queue length [0] has been reached > {noformat} > Following steps are executed for rolling upgrade > # Install 2.8.4 cluster and launch a MR job with distributed cache enabled. > # Stop 2.8.4 RM. Start 3.1.0 RM with same configuration. > # Stop 2.8.4 NM batch by batch. Start 3.1.0 NM batch by batch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8346) Upgrading to 3.1 kills running containers with error "Opportunistic container queue is full"
[ https://issues.apache.org/jira/browse/YARN-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8346: Fix Version/s: 2.9.2 > Upgrading to 3.1 kills running containers with error "Opportunistic container > queue is full" > > > Key: YARN-8346 > URL: https://issues.apache.org/jira/browse/YARN-8346 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.0, 3.0.2 >Reporter: Rohith Sharma K S >Assignee: Jason Lowe >Priority: Blocker > Fix For: 3.1.0, 2.10.0, 3.2.0, 2.9.2, 3.0.3 > > Attachments: YARN-8346.001.patch > > > It is observed while rolling upgrade from 2.8.4 to 3.1 release, all the > running containers are killed and second attempt is launched for that > application. The diagnostics message is "Opportunistic container queue is > full" which is the reason for container killed. > In NM log, I see below logs for after container is recovered. > {noformat} > 2018-05-23 17:18:50,655 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: > Opportunistic container [container_e06_1527075664705_0001_01_01] will > not be queued at the NMsince max queue length [0] has been reached > {noformat} > Following steps are executed for rolling upgrade > # Install 2.8.4 cluster and launch a MR job with distributed cache enabled. > # Stop 2.8.4 RM. Start 3.1.0 RM with same configuration. > # Stop 2.8.4 NM batch by batch. Start 3.1.0 NM batch by batch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8319) More YARN pages need to honor yarn.resourcemanager.display.per-user-apps
[ https://issues.apache.org/jira/browse/YARN-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488667#comment-16488667 ] Rohith Sharma K S commented on YARN-8319: - +1, will commit it shortly > More YARN pages need to honor yarn.resourcemanager.display.per-user-apps > > > Key: YARN-8319 > URL: https://issues.apache.org/jira/browse/YARN-8319 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Reporter: Vinod Kumar Vavilapalli >Assignee: Sunil Govindan >Priority: Major > Attachments: YARN-8319.001.patch, YARN-8319.002.patch, > YARN-8319.003.patch > > > When this config is on > - Per queue page on UI2 should filter app list by user > -- TODO: Verify the same with UI1 Per-queue page > - ATSv2 with UI2 should filter list of all users' flows and flow activities > - Per Node pages > -- Listing of apps and containers on a per-node basis should filter apps and > containers by user. > To this end, because this is no longer just for resourcemanager, we should > also deprecate {{yarn.resourcemanager.display.per-user-apps}} in favor of > {{yarn.webapp.filter-app-list-by-user}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8346) Upgrading to 3.1 kills running containers with error "Opportunistic container queue is full"
[ https://issues.apache.org/jira/browse/YARN-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488526#comment-16488526 ] Rohith Sharma K S commented on YARN-8346: - committing shortly > Upgrading to 3.1 kills running containers with error "Opportunistic container > queue is full" > > > Key: YARN-8346 > URL: https://issues.apache.org/jira/browse/YARN-8346 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.0, 3.0.2 >Reporter: Rohith Sharma K S >Assignee: Jason Lowe >Priority: Blocker > Attachments: YARN-8346.001.patch > > > It is observed while rolling upgrade from 2.8.4 to 3.1 release, all the > running containers are killed and second attempt is launched for that > application. The diagnostics message is "Opportunistic container queue is > full" which is the reason for container killed. > In NM log, I see below logs for after container is recovered. > {noformat} > 2018-05-23 17:18:50,655 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: > Opportunistic container [container_e06_1527075664705_0001_01_01] will > not be queued at the NMsince max queue length [0] has been reached > {noformat} > Following steps are executed for rolling upgrade > # Install 2.8.4 cluster and launch a MR job with distributed cache enabled. > # Stop 2.8.4 RM. Start 3.1.0 RM with same configuration. > # Stop 2.8.4 NM batch by batch. Start 3.1.0 NM batch by batch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8346) Upgrading to 3.1 kills running containers with error "Opportunistic container queue is full"
[ https://issues.apache.org/jira/browse/YARN-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488439#comment-16488439 ] Rohith Sharma K S commented on YARN-8346: - Thanks [~jlowe] for quick turnaround. I verified the patch in cluster and working fine as expected. I am +1 for the patch. > Upgrading to 3.1 kills running containers with error "Opportunistic container > queue is full" > > > Key: YARN-8346 > URL: https://issues.apache.org/jira/browse/YARN-8346 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.0, 3.0.2 >Reporter: Rohith Sharma K S >Assignee: Jason Lowe >Priority: Blocker > Attachments: YARN-8346.001.patch > > > It is observed while rolling upgrade from 2.8.4 to 3.1 release, all the > running containers are killed and second attempt is launched for that > application. The diagnostics message is "Opportunistic container queue is > full" which is the reason for container killed. > In NM log, I see below logs for after container is recovered. > {noformat} > 2018-05-23 17:18:50,655 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: > Opportunistic container [container_e06_1527075664705_0001_01_01] will > not be queued at the NMsince max queue length [0] has been reached > {noformat} > Following steps are executed for rolling upgrade > # Install 2.8.4 cluster and launch a MR job with distributed cache enabled. > # Stop 2.8.4 RM. Start 3.1.0 RM with same configuration. > # Stop 2.8.4 NM batch by batch. Start 3.1.0 NM batch by batch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8346) Upgrading to 3.1 kills running containers with error "Opportunistic container queue is full"
[ https://issues.apache.org/jira/browse/YARN-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487175#comment-16487175 ] Rohith Sharma K S commented on YARN-8346: - In class ContainerScheduler#enqueueContainer, for recovered container from 2.8.4 execution type is not set which result in else condition with zero queue lenght. This is sending kill event for container resulting running containers are killed. {code} private boolean enqueueContainer(Container container) { boolean isGuaranteedContainer = container.getContainerTokenIdentifier(). getExecutionType() == ExecutionType.GUARANTEED; boolean isQueued; if (isGuaranteedContainer) { queuedGuaranteedContainers.put(container.getContainerId(), container); isQueued = true; } else { if (queuedOpportunisticContainers.size() < maxOppQueueLength) { LOG.info("Opportunistic container {} will be queued at the NM.", container.getContainerId()); queuedOpportunisticContainers.put( container.getContainerId(), container); isQueued = true; } else { LOG.info("Opportunistic container [{}] will not be queued at the NM" + "since max queue length [{}] has been reached", container.getContainerId(), maxOppQueueLength); container.sendKillEvent( ContainerExitStatus.KILLED_BY_CONTAINER_SCHEDULER, "Opportunistic container queue is full."); isQueued = false; } } {code} Since opportunistic container feature is exist in 2.9, this would also issue upgrading into 2.9 I think. cc:/ [~jlowe] [~arun.sur...@gmail.com] > Upgrading to 3.1 kills running containers with error "Opportunistic container > queue is full" > > > Key: YARN-8346 > URL: https://issues.apache.org/jira/browse/YARN-8346 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Major > > It is observed while rolling upgrade from 2.8.4 to 3.1 release, all the > running containers are killed and second attempt is launched for that > application. The diagnostics message is "Opportunistic container queue is > full" which is the reason for container killed. > In NM log, I see below logs for after container is recovered. > {noformat} > 2018-05-23 17:18:50,655 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: > Opportunistic container [container_e06_1527075664705_0001_01_01] will > not be queued at the NMsince max queue length [0] has been reached > {noformat} > Following steps are executed for rolling upgrade > # Install 2.8.4 cluster and launch a MR job with distributed cache enabled. > # Stop 2.8.4 RM. Start 3.1.0 RM with same configuration. > # Stop 2.8.4 NM batch by batch. Start 3.1.0 NM batch by batch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8346) Upgrading to 3.1 kills running containers with error "Opportunistic container queue is full"
Rohith Sharma K S created YARN-8346: --- Summary: Upgrading to 3.1 kills running containers with error "Opportunistic container queue is full" Key: YARN-8346 URL: https://issues.apache.org/jira/browse/YARN-8346 Project: Hadoop YARN Issue Type: Bug Reporter: Rohith Sharma K S It is observed while rolling upgrade from 2.8.4 to 3.1 release, all the running containers are killed and second attempt is launched for that application. The diagnostics message is "Opportunistic container queue is full" which is the reason for container killed. In NM log, I see below logs for after container is recovered. {noformat} 2018-05-23 17:18:50,655 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: Opportunistic container [container_e06_1527075664705_0001_01_01] will not be queued at the NMsince max queue length [0] has been reached {noformat} Following steps are executed for rolling upgrade # Install 2.8.4 cluster and launch a MR job with distributed cache enabled. # Stop 2.8.4 RM. Start 3.1.0 RM with same configuration. # Stop 2.8.4 NM batch by batch. Start 3.1.0 NM batch by batch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8155) Improve the logging in NMTimelinePublisher and TimelineCollectorWebService
[ https://issues.apache.org/jira/browse/YARN-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478908#comment-16478908 ] Rohith Sharma K S commented on YARN-8155: - [~abmodi] do you have time to update the patch? > Improve the logging in NMTimelinePublisher and TimelineCollectorWebService > -- > > Key: YARN-8155 > URL: https://issues.apache.org/jira/browse/YARN-8155 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Abhishek Modi >Priority: Major > > We see that NM logs are filled with larger stack trace of NotFoundException > if collector is removed from one of the NM and other NMs are still publishing > the entities. > > This Jira is to improve the logging in NM so that we log with informative > message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8297) Incorrect ATS Url used for Wire encrypted cluster
[ https://issues.apache.org/jira/browse/YARN-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8297: Attachment: (was: Ambari.txt) > Incorrect ATS Url used for Wire encrypted cluster > - > > Key: YARN-8297 > URL: https://issues.apache.org/jira/browse/YARN-8297 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Affects Versions: 3.1.0 >Reporter: Yesha Vora >Assignee: Sunil G >Priority: Blocker > Attachments: YARN-8297.001.patch > > > "Service" page uses incorrect web url for ATS in wire encrypted env. For ATS > urls, it uses https protocol with http port. > This issue causes all ATS call to fail and UI does not display component > details. > url used: > https://xxx:8198/ws/v2/timeline/apps/application_1526357251888_0022/entities/SERVICE_ATTEMPT?fields=ALL&_=1526415938320 > expected url : > https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/SERVICE_ATTEMPT?fields=ALL&_=1526415938320 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8297) Incorrect ATS Url used for Wire encrypted cluster
[ https://issues.apache.org/jira/browse/YARN-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8297: Attachment: Ambari.txt > Incorrect ATS Url used for Wire encrypted cluster > - > > Key: YARN-8297 > URL: https://issues.apache.org/jira/browse/YARN-8297 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Affects Versions: 3.1.0 >Reporter: Yesha Vora >Assignee: Sunil G >Priority: Blocker > Attachments: YARN-8297.001.patch > > > "Service" page uses incorrect web url for ATS in wire encrypted env. For ATS > urls, it uses https protocol with http port. > This issue causes all ATS call to fail and UI does not display component > details. > url used: > https://xxx:8198/ws/v2/timeline/apps/application_1526357251888_0022/entities/SERVICE_ATTEMPT?fields=ALL&_=1526415938320 > expected url : > https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/SERVICE_ATTEMPT?fields=ALL&_=1526415938320 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8297) Incorrect ATS Url used for Wire encrypted cluster
[ https://issues.apache.org/jira/browse/YARN-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478880#comment-16478880 ] Rohith Sharma K S commented on YARN-8297: - OK, +1 committing shortly > Incorrect ATS Url used for Wire encrypted cluster > - > > Key: YARN-8297 > URL: https://issues.apache.org/jira/browse/YARN-8297 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Affects Versions: 3.1.0 >Reporter: Yesha Vora >Assignee: Sunil G >Priority: Blocker > Attachments: YARN-8297.001.patch > > > "Service" page uses incorrect web url for ATS in wire encrypted env. For ATS > urls, it uses https protocol with http port. > This issue causes all ATS call to fail and UI does not display component > details. > url used: > https://xxx:8198/ws/v2/timeline/apps/application_1526357251888_0022/entities/SERVICE_ATTEMPT?fields=ALL&_=1526415938320 > expected url : > https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/SERVICE_ATTEMPT?fields=ALL&_=1526415938320 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8293) In YARN Services UI, "User Name for service" should be completely removed in secure clusters
[ https://issues.apache.org/jira/browse/YARN-8293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478844#comment-16478844 ] Rohith Sharma K S commented on YARN-8293: - thanks [~sunilg] for the patch. I tested the patch and looks fine to me. [~eyang] Would you be able to commit this patch today? > In YARN Services UI, "User Name for service" should be completely removed in > secure clusters > > > Key: YARN-8293 > URL: https://issues.apache.org/jira/browse/YARN-8293 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Sunil G >Assignee: Sunil G >Priority: Major > Attachments: YARN-8293.001.patch > > > "User Name for service" should be completely removed in secure clusters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8297) Incorrect ATS Url used for Wire encrypted cluster
[ https://issues.apache.org/jira/browse/YARN-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478841#comment-16478841 ] Rohith Sharma K S commented on YARN-8297: - One doubt in patch, Ajax call is made with async true. This could lead to an issue? It should be false right? > Incorrect ATS Url used for Wire encrypted cluster > - > > Key: YARN-8297 > URL: https://issues.apache.org/jira/browse/YARN-8297 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Affects Versions: 3.1.0 >Reporter: Yesha Vora >Assignee: Sunil G >Priority: Blocker > Attachments: YARN-8297.001.patch > > > "Service" page uses incorrect web url for ATS in wire encrypted env. For ATS > urls, it uses https protocol with http port. > This issue causes all ATS call to fail and UI does not display component > details. > url used: > https://xxx:8198/ws/v2/timeline/apps/application_1526357251888_0022/entities/SERVICE_ATTEMPT?fields=ALL&_=1526415938320 > expected url : > https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/SERVICE_ATTEMPT?fields=ALL&_=1526415938320 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8302) ATS v2 should handle HBase connection issue properly
[ https://issues.apache.org/jira/browse/YARN-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S resolved YARN-8302. - Resolution: Won't Fix Closing the JIRA as won't fix since it is related to configuration issue. Decreasing hbase client timeout can be tuned with above commented configurations. > ATS v2 should handle HBase connection issue properly > > > Key: YARN-8302 > URL: https://issues.apache.org/jira/browse/YARN-8302 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2 >Affects Versions: 3.1.0 >Reporter: Yesha Vora >Priority: Major > > ATS v2 call times out with below error when it can't connect to HBase > instance. > {code} > bash-4.2$ curl -i -k -s -1 -H 'Content-Type: application/json' -H 'Accept: > application/json' --max-time 5 --negotiate -u : > 'https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/YARN_CONTAINER?fields=ALL&_=1526425686092' > curl: (28) Operation timed out after 5002 milliseconds with 0 bytes received > {code} > {code:title=ATS log} > 2018-05-15 23:10:03,623 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=7, > retries=7, started=8165 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1 > 2018-05-15 23:10:13,651 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=8, > retries=8, started=18192 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1 > 2018-05-15 23:10:23,730 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=9, > retries=9, started=28272 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1 > 2018-05-15 23:10:33,788 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=10, > retries=10, started=38330 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1{code} > There are two issues here. > 1) Check why ATS can't connect to HBase > 2) In case of connection error, ATS call should not get timeout. It should > fail with proper error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8302) ATS v2 should handle HBase connection issue properly
[ https://issues.apache.org/jira/browse/YARN-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478532#comment-16478532 ] Rohith Sharma K S commented on YARN-8302: - If HBase is down for any reasons, hbase client will retry for 20 minutes with default configurations loaded. Reducing the default value for *hbase.client.retries.number* from 15 to 7, decreased drastically from 20 minutes to 1.5 minutes. > ATS v2 should handle HBase connection issue properly > > > Key: YARN-8302 > URL: https://issues.apache.org/jira/browse/YARN-8302 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2 >Affects Versions: 3.1.0 >Reporter: Yesha Vora >Priority: Major > > ATS v2 call times out with below error when it can't connect to HBase > instance. > {code} > bash-4.2$ curl -i -k -s -1 -H 'Content-Type: application/json' -H 'Accept: > application/json' --max-time 5 --negotiate -u : > 'https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/YARN_CONTAINER?fields=ALL&_=1526425686092' > curl: (28) Operation timed out after 5002 milliseconds with 0 bytes received > {code} > {code:title=ATS log} > 2018-05-15 23:10:03,623 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=7, > retries=7, started=8165 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1 > 2018-05-15 23:10:13,651 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=8, > retries=8, started=18192 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1 > 2018-05-15 23:10:23,730 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=9, > retries=9, started=28272 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1 > 2018-05-15 23:10:33,788 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=10, > retries=10, started=38330 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1{code} > There are two issues here. > 1) Check why ATS can't connect to HBase > 2) In case of connection error, ATS call should not get timeout. It should > fail with proper error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-5742) Serve aggregated logs of historical apps from timeline service
[ https://issues.apache.org/jira/browse/YARN-5742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477177#comment-16477177 ] Rohith Sharma K S edited comment on YARN-5742 at 5/16/18 9:46 AM: -- I have created a separate JIRA for pulling put log servlet from AHSWebService i.e YARN-8304 and YARN-8303 for converting TimelineEntity to ApplicationReport/ApplicationAttemptReport/ContainerReport. This JIRA we can keep it only plugging log servlet into TimelineReader. was (Author: rohithsharma): I have created a separate JIRA for pulling put log servlet from AHSWebService i.e YARN-8304 and for converting TimelineEntity to ApplicationReport/ApplicationAttemptReport/ContainerReport. This JIRA we can keep it only plugging log servlet into TimelineReader. > Serve aggregated logs of historical apps from timeline service > -- > > Key: YARN-5742 > URL: https://issues.apache.org/jira/browse/YARN-5742 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Varun Saxena >Assignee: Rohith Sharma K S >Priority: Critical > Attachments: YARN-5742-POC-v0.patch > > > ATSv1.5 daemon has servlet to serve aggregated logs. But enabling only ATSv2, > does not serve logs from CLI and UI for completed application. Log serving > story has completely broken in ATSv2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5742) Serve aggregated logs of historical apps from timeline service
[ https://issues.apache.org/jira/browse/YARN-5742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477177#comment-16477177 ] Rohith Sharma K S commented on YARN-5742: - I have created a separate JIRA for pulling put log servlet from AHSWebService i.e YARN-8304 and for converting TimelineEntity to ApplicationReport/ApplicationAttemptReport/ContainerReport. This JIRA we can keep it only plugging log servlet into TimelineReader. > Serve aggregated logs of historical apps from timeline service > -- > > Key: YARN-5742 > URL: https://issues.apache.org/jira/browse/YARN-5742 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Varun Saxena >Assignee: Rohith Sharma K S >Priority: Critical > Attachments: YARN-5742-POC-v0.patch > > > ATSv1.5 daemon has servlet to serve aggregated logs. But enabling only ATSv2, > does not serve logs from CLI and UI for completed application. Log serving > story has completely broken in ATSv2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8304) Provide generic log servlet for serving logs
Rohith Sharma K S created YARN-8304: --- Summary: Provide generic log servlet for serving logs Key: YARN-8304 URL: https://issues.apache.org/jira/browse/YARN-8304 Project: Hadoop YARN Issue Type: Sub-task Reporter: Rohith Sharma K S AHSWebService has log serving REST APIs i.e getContainerLog* and getLogs which is used to view container logs from UI. It is tightly coupled with ApplicationBaseProtocol. And these APIs are exist in AHS. But ATSv2 is designed only with REST APIs. Proposal is to add generic log servlet which could be plugged to ATSV1.5 or ATSv2.0 Reader. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8303) YarnClient should contact TimelineReader for application/attempt/container report
Rohith Sharma K S created YARN-8303: --- Summary: YarnClient should contact TimelineReader for application/attempt/container report Key: YARN-8303 URL: https://issues.apache.org/jira/browse/YARN-8303 Project: Hadoop YARN Issue Type: Sub-task Reporter: Rohith Sharma K S YarnClient get app/attempt/container information from RM. If RM doesn't have then queried to ahsClient. When ATSv2 is only enabled, yarnClient will result empty. YarnClient is used by many users which result in empty information for app/attempt/container report. Proposal is to have adapter from yarn client so that app/attempt/container reports can be generated from AHSv2Client which does REST API to TimelineReader and get the entity and convert it into app/attempt/container report. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7933) [atsv2 read acls] Add TimelineWriter#writeDomain
[ https://issues.apache.org/jira/browse/YARN-7933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-7933: Attachment: YARN-7933.06.patch > [atsv2 read acls] Add TimelineWriter#writeDomain > - > > Key: YARN-7933 > URL: https://issues.apache.org/jira/browse/YARN-7933 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vrushali C >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-7933.01.patch, YARN-7933.02.patch, > YARN-7933.03.patch, YARN-7933.04.patch, YARN-7933.05.patch, YARN-7933.06.patch > > > > Add an API TimelineWriter#writeDomain for writing the domain info -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8302) ATS v2 should handle HBase connection issue properly
[ https://issues.apache.org/jira/browse/YARN-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476872#comment-16476872 ] Rohith Sharma K S commented on YARN-8302: - Thanks [~yeshavora] for creating the issue. I am able to reproduce the issue but query doesn't hang. With default HBase configuration for timeout, hbase client will retry for 20minutes and exit with error code 500. I feel this is right behavior. If you want to reduce the timeout then need to tune below hbase configurations # ZK session timeout (*zookeeper.session.timeout*) # RPC timeout (*hbase.rpc.timeout*) # RecoverableZookeeper retry count and retry wait (*zookeeper.recovery.retry*, *zookeeper.recovery.retry.intervalmill*) # Client retry count and wait (*hbase.client.retries.number*, *hbase.client.pause*) Note that reducing timeout to too less will end up in error in temporary glitch of network. Twenty minutes of retry should be feasible. Do you want to reduce the retry timeout? > ATS v2 should handle HBase connection issue properly > > > Key: YARN-8302 > URL: https://issues.apache.org/jira/browse/YARN-8302 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2 >Affects Versions: 3.1.0 >Reporter: Yesha Vora >Priority: Major > > ATS v2 call times out with below error when it can't connect to HBase > instance. > {code} > bash-4.2$ curl -i -k -s -1 -H 'Content-Type: application/json' -H 'Accept: > application/json' --max-time 5 --negotiate -u : > 'https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/YARN_CONTAINER?fields=ALL&_=1526425686092' > curl: (28) Operation timed out after 5002 milliseconds with 0 bytes received > {code} > {code:title=ATS log} > 2018-05-15 23:10:03,623 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=7, > retries=7, started=8165 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1 > 2018-05-15 23:10:13,651 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=8, > retries=8, started=18192 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1 > 2018-05-15 23:10:23,730 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=9, > retries=9, started=28272 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1 > 2018-05-15 23:10:33,788 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=10, > retries=10, started=38330 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1{code} > There are two issues here. > 1) Check why ATS can't connect to HBase > 2) In case of connection error, ATS call should not get timeout. It should > fail with proper error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8302) ATS v2 should handle HBase connection issue properly
[ https://issues.apache.org/jira/browse/YARN-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476873#comment-16476873 ] Rohith Sharma K S commented on YARN-8302: - cc:/ [~vrushalic] [~haibo.chen] > ATS v2 should handle HBase connection issue properly > > > Key: YARN-8302 > URL: https://issues.apache.org/jira/browse/YARN-8302 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2 >Affects Versions: 3.1.0 >Reporter: Yesha Vora >Priority: Major > > ATS v2 call times out with below error when it can't connect to HBase > instance. > {code} > bash-4.2$ curl -i -k -s -1 -H 'Content-Type: application/json' -H 'Accept: > application/json' --max-time 5 --negotiate -u : > 'https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/YARN_CONTAINER?fields=ALL&_=1526425686092' > curl: (28) Operation timed out after 5002 milliseconds with 0 bytes received > {code} > {code:title=ATS log} > 2018-05-15 23:10:03,623 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=7, > retries=7, started=8165 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1 > 2018-05-15 23:10:13,651 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=8, > retries=8, started=18192 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1 > 2018-05-15 23:10:23,730 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=9, > retries=9, started=28272 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1 > 2018-05-15 23:10:33,788 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=10, > retries=10, started=38330 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 > failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: xxx/xxx:17020, details=row > 'prod.timelineservice.app_flow, > ,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=xxx,17020,1526348294182, seqNum=-1{code} > There are two issues here. > 1) Check why ATS can't connect to HBase > 2) In case of connection error, ATS call should not get timeout. It should > fail with proper error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8130) Race condition when container events are published for KILLED applications
[ https://issues.apache.org/jira/browse/YARN-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475340#comment-16475340 ] Rohith Sharma K S commented on YARN-8130: - thanks to [~haibochen] and [~vrushalic] for the review. I back-ported to branch-3.1/branch-3.0/branch-2 as well. > Race condition when container events are published for KILLED applications > -- > > Key: YARN-8130 > URL: https://issues.apache.org/jira/browse/YARN-8130 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Reporter: Charan Hebri >Assignee: Rohith Sharma K S >Priority: Major > Fix For: 2.10.0, 3.2.0, 3.1.1, 3.0.3 > > Attachments: YARN-8130.01.patch, YARN-8130.02.patch, > YARN-8130.03.patch > > > There seems to be a race condition happening when an application is KILLED > and the corresponding container event information is being published. For > completed containers, a YARN_CONTAINER_FINISHED event is generated but for > some containers in a KILLED application this information is missing. Below is > a node manager log snippet, > {code:java} > 2018-04-09 08:44:54,474 INFO shuffle.ExternalShuffleBlockResolver > (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application > application_1523259757659_0003 removed, cleanupLocalDirs = false > 2018-04-09 08:44:54,478 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1523259757659_0003 transitioned from > APPLICATION_RESOURCES_CLEANINGUP to FINISHED > 2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher > (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been > removed before the entity could be published for > TimelineEntity[type='YARN_CONTAINER', > id='container_1523259757659_0003_01_02'] > 2018-04-09 08:44:54,478 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just > finished : application_1523259757659_0003 > 2018-04-09 08:44:54,488 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs > for container container_1523259757659_0003_01_01. Current good log dirs > are /grid/0/hadoop/yarn/log > 2018-04-09 08:44:54,492 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs > for container container_1523259757659_0003_01_02. Current good log dirs > are /grid/0/hadoop/yarn/log > 2018-04-09 08:44:55,470 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(192)) - The collector service for > application_1523259757659_0003 was removed > 2018-04-09 08:44:55,472 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:handle(1572)) - couldn't find application > application_1523259757659_0003 while processing FINISH_APPS event. The > ResourceManager allocated resources for this application to the NodeManager > but no active containers were found to process{code} > The container id specified in the log, > *container_1523259757659_0003_01_02* is the one that has the finished > event missing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8130) Race condition when container events are published for KILLED applications
[ https://issues.apache.org/jira/browse/YARN-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8130: Fix Version/s: 3.0.3 3.1.1 2.10.0 > Race condition when container events are published for KILLED applications > -- > > Key: YARN-8130 > URL: https://issues.apache.org/jira/browse/YARN-8130 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Reporter: Charan Hebri >Assignee: Rohith Sharma K S >Priority: Major > Fix For: 2.10.0, 3.2.0, 3.1.1, 3.0.3 > > Attachments: YARN-8130.01.patch, YARN-8130.02.patch, > YARN-8130.03.patch > > > There seems to be a race condition happening when an application is KILLED > and the corresponding container event information is being published. For > completed containers, a YARN_CONTAINER_FINISHED event is generated but for > some containers in a KILLED application this information is missing. Below is > a node manager log snippet, > {code:java} > 2018-04-09 08:44:54,474 INFO shuffle.ExternalShuffleBlockResolver > (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application > application_1523259757659_0003 removed, cleanupLocalDirs = false > 2018-04-09 08:44:54,478 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1523259757659_0003 transitioned from > APPLICATION_RESOURCES_CLEANINGUP to FINISHED > 2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher > (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been > removed before the entity could be published for > TimelineEntity[type='YARN_CONTAINER', > id='container_1523259757659_0003_01_02'] > 2018-04-09 08:44:54,478 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just > finished : application_1523259757659_0003 > 2018-04-09 08:44:54,488 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs > for container container_1523259757659_0003_01_01. Current good log dirs > are /grid/0/hadoop/yarn/log > 2018-04-09 08:44:54,492 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs > for container container_1523259757659_0003_01_02. Current good log dirs > are /grid/0/hadoop/yarn/log > 2018-04-09 08:44:55,470 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(192)) - The collector service for > application_1523259757659_0003 was removed > 2018-04-09 08:44:55,472 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:handle(1572)) - couldn't find application > application_1523259757659_0003 while processing FINISH_APPS event. The > ResourceManager allocated resources for this application to the NodeManager > but no active containers were found to process{code} > The container id specified in the log, > *container_1523259757659_0003_01_02* is the one that has the finished > event missing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7933) [atsv2 read acls] Add TimelineWriter#writeDomain
[ https://issues.apache.org/jira/browse/YARN-7933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474013#comment-16474013 ] Rohith Sharma K S commented on YARN-7933: - bq. Isn't it the case that a TimelineClient must be able to authenticate with the TimelineCollector first before it can post data to that TimelineCollector? Here the intention I added is with ATS1.5 approach i.e if same client or different client publishes same domain id then collector need to check for ACLs for domain i.e owner. In our design, I was not sure about should we check owner for domain id, so I added a TODO. If this is not our design in future, we can remove this at any point of time. bq. Where do we check the delegation token on inside PerNodeTimelineCollectorService? Timeline Token verification is at filter layer at the time of http connection establishment i.e even before it reaches servlets. Follow the classes NodeTimelineCollectorManager#startWebApp TimelineAuthenticationFilter > [atsv2 read acls] Add TimelineWriter#writeDomain > - > > Key: YARN-7933 > URL: https://issues.apache.org/jira/browse/YARN-7933 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vrushali C >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-7933.01.patch, YARN-7933.02.patch, > YARN-7933.03.patch, YARN-7933.04.patch, YARN-7933.05.patch > > > > Add an API TimelineWriter#writeDomain for writing the domain info -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8130) Race condition when container events are published for KILLED applications
[ https://issues.apache.org/jira/browse/YARN-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472786#comment-16472786 ] Rohith Sharma K S commented on YARN-8130: - Test failures are unrelated to this patch. [~haibochen]/[~vrushalic] could you please help to commit this. > Race condition when container events are published for KILLED applications > -- > > Key: YARN-8130 > URL: https://issues.apache.org/jira/browse/YARN-8130 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Reporter: Charan Hebri >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8130.01.patch, YARN-8130.02.patch, > YARN-8130.03.patch > > > There seems to be a race condition happening when an application is KILLED > and the corresponding container event information is being published. For > completed containers, a YARN_CONTAINER_FINISHED event is generated but for > some containers in a KILLED application this information is missing. Below is > a node manager log snippet, > {code:java} > 2018-04-09 08:44:54,474 INFO shuffle.ExternalShuffleBlockResolver > (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application > application_1523259757659_0003 removed, cleanupLocalDirs = false > 2018-04-09 08:44:54,478 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1523259757659_0003 transitioned from > APPLICATION_RESOURCES_CLEANINGUP to FINISHED > 2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher > (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been > removed before the entity could be published for > TimelineEntity[type='YARN_CONTAINER', > id='container_1523259757659_0003_01_02'] > 2018-04-09 08:44:54,478 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just > finished : application_1523259757659_0003 > 2018-04-09 08:44:54,488 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs > for container container_1523259757659_0003_01_01. Current good log dirs > are /grid/0/hadoop/yarn/log > 2018-04-09 08:44:54,492 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs > for container container_1523259757659_0003_01_02. Current good log dirs > are /grid/0/hadoop/yarn/log > 2018-04-09 08:44:55,470 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(192)) - The collector service for > application_1523259757659_0003 was removed > 2018-04-09 08:44:55,472 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:handle(1572)) - couldn't find application > application_1523259757659_0003 while processing FINISH_APPS event. The > ResourceManager allocated resources for this application to the NodeManager > but no active containers were found to process{code} > The container id specified in the log, > *container_1523259757659_0003_01_02* is the one that has the finished > event missing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8130) Race condition when container events are published for KILLED applications
[ https://issues.apache.org/jira/browse/YARN-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472274#comment-16472274 ] Rohith Sharma K S commented on YARN-8130: - make sense.. updated the patch as per comments. [~haibochen] could you take a look at attached patch? > Race condition when container events are published for KILLED applications > -- > > Key: YARN-8130 > URL: https://issues.apache.org/jira/browse/YARN-8130 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Reporter: Charan Hebri >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8130.01.patch, YARN-8130.02.patch, > YARN-8130.03.patch > > > There seems to be a race condition happening when an application is KILLED > and the corresponding container event information is being published. For > completed containers, a YARN_CONTAINER_FINISHED event is generated but for > some containers in a KILLED application this information is missing. Below is > a node manager log snippet, > {code:java} > 2018-04-09 08:44:54,474 INFO shuffle.ExternalShuffleBlockResolver > (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application > application_1523259757659_0003 removed, cleanupLocalDirs = false > 2018-04-09 08:44:54,478 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1523259757659_0003 transitioned from > APPLICATION_RESOURCES_CLEANINGUP to FINISHED > 2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher > (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been > removed before the entity could be published for > TimelineEntity[type='YARN_CONTAINER', > id='container_1523259757659_0003_01_02'] > 2018-04-09 08:44:54,478 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just > finished : application_1523259757659_0003 > 2018-04-09 08:44:54,488 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs > for container container_1523259757659_0003_01_01. Current good log dirs > are /grid/0/hadoop/yarn/log > 2018-04-09 08:44:54,492 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs > for container container_1523259757659_0003_01_02. Current good log dirs > are /grid/0/hadoop/yarn/log > 2018-04-09 08:44:55,470 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(192)) - The collector service for > application_1523259757659_0003 was removed > 2018-04-09 08:44:55,472 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:handle(1572)) - couldn't find application > application_1523259757659_0003 while processing FINISH_APPS event. The > ResourceManager allocated resources for this application to the NodeManager > but no active containers were found to process{code} > The container id specified in the log, > *container_1523259757659_0003_01_02* is the one that has the finished > event missing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8130) Race condition when container events are published for KILLED applications
[ https://issues.apache.org/jira/browse/YARN-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8130: Attachment: YARN-8130.03.patch > Race condition when container events are published for KILLED applications > -- > > Key: YARN-8130 > URL: https://issues.apache.org/jira/browse/YARN-8130 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Reporter: Charan Hebri >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8130.01.patch, YARN-8130.02.patch, > YARN-8130.03.patch > > > There seems to be a race condition happening when an application is KILLED > and the corresponding container event information is being published. For > completed containers, a YARN_CONTAINER_FINISHED event is generated but for > some containers in a KILLED application this information is missing. Below is > a node manager log snippet, > {code:java} > 2018-04-09 08:44:54,474 INFO shuffle.ExternalShuffleBlockResolver > (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application > application_1523259757659_0003 removed, cleanupLocalDirs = false > 2018-04-09 08:44:54,478 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1523259757659_0003 transitioned from > APPLICATION_RESOURCES_CLEANINGUP to FINISHED > 2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher > (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been > removed before the entity could be published for > TimelineEntity[type='YARN_CONTAINER', > id='container_1523259757659_0003_01_02'] > 2018-04-09 08:44:54,478 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just > finished : application_1523259757659_0003 > 2018-04-09 08:44:54,488 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs > for container container_1523259757659_0003_01_01. Current good log dirs > are /grid/0/hadoop/yarn/log > 2018-04-09 08:44:54,492 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs > for container container_1523259757659_0003_01_02. Current good log dirs > are /grid/0/hadoop/yarn/log > 2018-04-09 08:44:55,470 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(192)) - The collector service for > application_1523259757659_0003 was removed > 2018-04-09 08:44:55,472 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:handle(1572)) - couldn't find application > application_1523259757659_0003 while processing FINISH_APPS event. The > ResourceManager allocated resources for this application to the NodeManager > but no active containers were found to process{code} > The container id specified in the log, > *container_1523259757659_0003_01_02* is the one that has the finished > event missing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8130) Race condition when container events are published for KILLED applications
[ https://issues.apache.org/jira/browse/YARN-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472189#comment-16472189 ] Rohith Sharma K S commented on YARN-8130: - Do you mean we register another event type in NMTimelinePublisher that removes appId while processing it? > Race condition when container events are published for KILLED applications > -- > > Key: YARN-8130 > URL: https://issues.apache.org/jira/browse/YARN-8130 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Reporter: Charan Hebri >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8130.01.patch, YARN-8130.02.patch > > > There seems to be a race condition happening when an application is KILLED > and the corresponding container event information is being published. For > completed containers, a YARN_CONTAINER_FINISHED event is generated but for > some containers in a KILLED application this information is missing. Below is > a node manager log snippet, > {code:java} > 2018-04-09 08:44:54,474 INFO shuffle.ExternalShuffleBlockResolver > (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application > application_1523259757659_0003 removed, cleanupLocalDirs = false > 2018-04-09 08:44:54,478 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1523259757659_0003 transitioned from > APPLICATION_RESOURCES_CLEANINGUP to FINISHED > 2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher > (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been > removed before the entity could be published for > TimelineEntity[type='YARN_CONTAINER', > id='container_1523259757659_0003_01_02'] > 2018-04-09 08:44:54,478 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just > finished : application_1523259757659_0003 > 2018-04-09 08:44:54,488 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs > for container container_1523259757659_0003_01_01. Current good log dirs > are /grid/0/hadoop/yarn/log > 2018-04-09 08:44:54,492 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs > for container container_1523259757659_0003_01_02. Current good log dirs > are /grid/0/hadoop/yarn/log > 2018-04-09 08:44:55,470 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(192)) - The collector service for > application_1523259757659_0003 was removed > 2018-04-09 08:44:55,472 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:handle(1572)) - couldn't find application > application_1523259757659_0003 while processing FINISH_APPS event. The > ResourceManager allocated resources for this application to the NodeManager > but no active containers were found to process{code} > The container id specified in the log, > *container_1523259757659_0003_01_02* is the one that has the finished > event missing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8130) Race condition when container events are published for KILLED applications
[ https://issues.apache.org/jira/browse/YARN-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472177#comment-16472177 ] Rohith Sharma K S commented on YARN-8130: - bq. we just generate a new event upon Applicaton_FINISHED that is handled by the dispatcher inside NMTimelinePublisher? sorry didn't get it. Could you explain more > Race condition when container events are published for KILLED applications > -- > > Key: YARN-8130 > URL: https://issues.apache.org/jira/browse/YARN-8130 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Reporter: Charan Hebri >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8130.01.patch, YARN-8130.02.patch > > > There seems to be a race condition happening when an application is KILLED > and the corresponding container event information is being published. For > completed containers, a YARN_CONTAINER_FINISHED event is generated but for > some containers in a KILLED application this information is missing. Below is > a node manager log snippet, > {code:java} > 2018-04-09 08:44:54,474 INFO shuffle.ExternalShuffleBlockResolver > (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application > application_1523259757659_0003 removed, cleanupLocalDirs = false > 2018-04-09 08:44:54,478 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1523259757659_0003 transitioned from > APPLICATION_RESOURCES_CLEANINGUP to FINISHED > 2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher > (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been > removed before the entity could be published for > TimelineEntity[type='YARN_CONTAINER', > id='container_1523259757659_0003_01_02'] > 2018-04-09 08:44:54,478 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just > finished : application_1523259757659_0003 > 2018-04-09 08:44:54,488 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs > for container container_1523259757659_0003_01_01. Current good log dirs > are /grid/0/hadoop/yarn/log > 2018-04-09 08:44:54,492 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs > for container container_1523259757659_0003_01_02. Current good log dirs > are /grid/0/hadoop/yarn/log > 2018-04-09 08:44:55,470 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(192)) - The collector service for > application_1523259757659_0003 was removed > 2018-04-09 08:44:55,472 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:handle(1572)) - couldn't find application > application_1523259757659_0003 while processing FINISH_APPS event. The > ResourceManager allocated resources for this application to the NodeManager > but no active containers were found to process{code} > The container id specified in the log, > *container_1523259757659_0003_01_02* is the one that has the finished > event missing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7933) [atsv2 read acls] Add TimelineWriter#writeDomain
[ https://issues.apache.org/jira/browse/YARN-7933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-7933: Attachment: YARN-7933.05.patch > [atsv2 read acls] Add TimelineWriter#writeDomain > - > > Key: YARN-7933 > URL: https://issues.apache.org/jira/browse/YARN-7933 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vrushali C >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-7933.01.patch, YARN-7933.02.patch, > YARN-7933.03.patch, YARN-7933.04.patch, YARN-7933.05.patch > > > > Add an API TimelineWriter#writeDomain for writing the domain info -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org