[jira] [Updated] (YARN-10735) Unmanaged AM is won't populate AMRMToken to ApplicationReport in secure cluster
[ https://issues.apache.org/jira/browse/YARN-10735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wang, Xinglong updated YARN-10735: -- Attachment: (was: YARN-10735.001.patch) > Unmanaged AM is won't populate AMRMToken to ApplicationReport in secure > cluster > --- > > Key: YARN-10735 > URL: https://issues.apache.org/jira/browse/YARN-10735 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wang, Xinglong >Assignee: Wang, Xinglong >Priority: Minor > > With kerberos enabled, NPE will be reported when launching > UnmanagedAMLauncher. > It is due to there is no AMRMToken is returned in ApplicationReport. After a > while investigation, it turns out that RMAppImpl has a bad if condition > inside createAndGetApplicationReport > {code:java} > 21/04/14 02:46:01 INFO unmanagedamlauncher.UnmanagedAMLauncher: Initializing > Client > 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Starting > Client > 21/04/14 02:46:02 INFO client.AHSProxy: Connecting to Application History > server at /0.0.0.0:10200 > 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting up > application submission context for ASM > 21/04/14 02:46:02 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting > unmanaged AM > 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Submitting > application to ASM > 21/04/14 02:46:03 INFO impl.YarnClientImpl: Submitted application > application_1618393442264_0002 > 21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Got > application report from ASM for, appId=2, > appAttemptId=appattempt_1618393442264_0002_01, clientToAMToken=Token { > kind: YARN_CLIENT_TOKEN, service: }, appDiagnostics=AM container is > launched, waiting for AM container to Register with RM, appMasterHost=N/A, > appQueue=abc, appMasterRpcPort=-1, appStartTime=1618393562917, > yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=N/A, > appUser=abc > 21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Launching AM > with application attempt id appattempt_1618393442264_0002_01 > 21/04/14 02:46:04 FATAL unmanagedamlauncher.UnmanagedAMLauncher: Error > running Client > java.lang.NullPointerException > at > org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.launchAM(UnmanagedAMLauncher.java:186) > at > org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.run(UnmanagedAMLauncher.java:354) > at > org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.main(UnmanagedAMLauncher.java:111) > {code} > > {code:java} > public ApplicationReport createAndGetApplicationReport(String clientUserName, > boolean allowAccess) { > .. > if (currentAttempt != null && > currentAttempt.getAppAttemptState() == > RMAppAttemptState.LAUNCHED) { > if (getApplicationSubmissionContext().getUnmanagedAM() && > clientUserName != null && getUser().equals(clientUserName)) { > Token token = currentAttempt.getAMRMToken(); > if (token != null) { > amrmToken = BuilderUtils.newAMRMToken(token.getIdentifier(), > token.getKind().toString(), token.getPassword(), > token.getService().toString()); > } > } > } > {code} > clientUserName is fullName of a kerberos principle like a...@domain.com > whereas getUser() will return the username recorded in RMAppImpl which is > short name. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10735) Unmanaged AM is won't populate AMRMToken to ApplicationReport in secure cluster
[ https://issues.apache.org/jira/browse/YARN-10735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wang, Xinglong updated YARN-10735: -- Description: With kerberos enabled, NPE will be reported when launching UnmanagedAMLauncher. It is due to there is no AMRMToken is returned in ApplicationReport. After a while investigation, it turns out that RMAppImpl has a bad if condition inside createAndGetApplicationReport {code:java} 21/04/14 02:46:01 INFO unmanagedamlauncher.UnmanagedAMLauncher: Initializing Client 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Starting Client 21/04/14 02:46:02 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting up application submission context for ASM 21/04/14 02:46:02 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting unmanaged AM 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Submitting application to ASM 21/04/14 02:46:03 INFO impl.YarnClientImpl: Submitted application application_1618393442264_0002 21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Got application report from ASM for, appId=2, appAttemptId=appattempt_1618393442264_0002_01, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service: }, appDiagnostics=AM container is launched, waiting for AM container to Register with RM, appMasterHost=N/A, appQueue=abc, appMasterRpcPort=-1, appStartTime=1618393562917, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=N/A, appUser=abc 21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Launching AM with application attempt id appattempt_1618393442264_0002_01 21/04/14 02:46:04 FATAL unmanagedamlauncher.UnmanagedAMLauncher: Error running Client java.lang.NullPointerException at org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.launchAM(UnmanagedAMLauncher.java:186) at org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.run(UnmanagedAMLauncher.java:354) at org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.main(UnmanagedAMLauncher.java:111) {code} {code:java} public ApplicationReport createAndGetApplicationReport(String clientUserName, boolean allowAccess) { .. if (currentAttempt != null && currentAttempt.getAppAttemptState() == RMAppAttemptState.LAUNCHED) { if (getApplicationSubmissionContext().getUnmanagedAM() && clientUserName != null && getUser().equals(clientUserName)) { Token token = currentAttempt.getAMRMToken(); if (token != null) { amrmToken = BuilderUtils.newAMRMToken(token.getIdentifier(), token.getKind().toString(), token.getPassword(), token.getService().toString()); } } } {code} clientUserName is fullName of a kerberos principle like a...@domain.com whereas getUser() will return the username recorded in RMAppImpl which is short name. was: With kerberos enabled, NPE will be reported when launching UnmanagedAMLauncher. It is due to there is no AMRMToken is returned in ApplicationReport. After a while investigation, it turns out that RMAppImpl has a bad if condition inside createAndGetApplicationReport {code:java} 21/04/14 02:46:01 INFO unmanagedamlauncher.UnmanagedAMLauncher: Initializing Client 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Starting Client 21/04/14 02:46:02 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting up application submission context for ASM 21/04/14 02:46:02 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting unmanaged AM 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Submitting application to ASM 21/04/14 02:46:03 INFO impl.YarnClientImpl: Submitted application application_1618393442264_0002 21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Got application report from ASM for, appId=2, appAttemptId=appattempt_1618393442264_0002_01, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service: }, appDiagnostics=AM container is launched, waiting for AM container to Register with RM, appMasterHost=N/A, appQueue=hdmi-default, appMasterRpcPort=-1, appStartTime=1618393562917, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=N/A, appUser=b_carmel 21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Launching AM with application attempt id appattempt_1618393442264_0002_01 21/04/14 02:46:04 FATAL
[jira] [Updated] (YARN-10735) Unmanaged AM is won't populate AMRMToken to ApplicationReport in secure cluster
[ https://issues.apache.org/jira/browse/YARN-10735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wang, Xinglong updated YARN-10735: -- Attachment: YARN-10735.001.patch > Unmanaged AM is won't populate AMRMToken to ApplicationReport in secure > cluster > --- > > Key: YARN-10735 > URL: https://issues.apache.org/jira/browse/YARN-10735 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wang, Xinglong >Assignee: Wang, Xinglong >Priority: Minor > Attachments: YARN-10735.001.patch > > > With kerberos enabled, NPE will be reported when launching > UnmanagedAMLauncher. > It is due to there is no AMRMToken is returned in ApplicationReport. After a > while investigation, it turns out that RMAppImpl has a bad if condition > inside createAndGetApplicationReport > {code:java} > 21/04/14 02:46:01 INFO unmanagedamlauncher.UnmanagedAMLauncher: Initializing > Client > 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Starting > Client > 21/04/14 02:46:02 INFO client.AHSProxy: Connecting to Application History > server at /0.0.0.0:10200 > 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting up > application submission context for ASM > 21/04/14 02:46:02 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting > unmanaged AM > 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Submitting > application to ASM > 21/04/14 02:46:03 INFO impl.YarnClientImpl: Submitted application > application_1618393442264_0002 > 21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Got > application report from ASM for, appId=2, > appAttemptId=appattempt_1618393442264_0002_01, clientToAMToken=Token { > kind: YARN_CLIENT_TOKEN, service: }, appDiagnostics=AM container is > launched, waiting for AM container to Register with RM, appMasterHost=N/A, > appQueue=hdmi-default, appMasterRpcPort=-1, appStartTime=1618393562917, > yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=N/A, > appUser=b_carmel > 21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Launching AM > with application attempt id appattempt_1618393442264_0002_01 > 21/04/14 02:46:04 FATAL unmanagedamlauncher.UnmanagedAMLauncher: Error > running Client > java.lang.NullPointerException > at > org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.launchAM(UnmanagedAMLauncher.java:186) > at > org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.run(UnmanagedAMLauncher.java:354) > at > org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.main(UnmanagedAMLauncher.java:111) > {code} > > {code:java} > public ApplicationReport createAndGetApplicationReport(String clientUserName, > boolean allowAccess) { > .. > if (currentAttempt != null && > currentAttempt.getAppAttemptState() == > RMAppAttemptState.LAUNCHED) { > if (getApplicationSubmissionContext().getUnmanagedAM() && > clientUserName != null && getUser().equals(clientUserName)) { > Token token = currentAttempt.getAMRMToken(); > if (token != null) { > amrmToken = BuilderUtils.newAMRMToken(token.getIdentifier(), > token.getKind().toString(), token.getPassword(), > token.getService().toString()); > } > } > } > {code} > clientUserName is fullName of a kerberos principle like a...@domain.com > whereas getUser() will return the username recorded in RMAppImpl which is > short name. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10735) Unmanaged AM is won't populate AMRMToken to ApplicationReport in secure cluster
Wang, Xinglong created YARN-10735: - Summary: Unmanaged AM is won't populate AMRMToken to ApplicationReport in secure cluster Key: YARN-10735 URL: https://issues.apache.org/jira/browse/YARN-10735 Project: Hadoop YARN Issue Type: Bug Reporter: Wang, Xinglong Assignee: Wang, Xinglong With kerberos enabled, NPE will be reported when launching UnmanagedAMLauncher. It is due to there is no AMRMToken is returned in ApplicationReport. After a while investigation, it turns out that RMAppImpl has a bad if condition inside createAndGetApplicationReport {code:java} 21/04/14 02:46:01 INFO unmanagedamlauncher.UnmanagedAMLauncher: Initializing Client 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Starting Client 21/04/14 02:46:02 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting up application submission context for ASM 21/04/14 02:46:02 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting unmanaged AM 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Submitting application to ASM 21/04/14 02:46:03 INFO impl.YarnClientImpl: Submitted application application_1618393442264_0002 21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Got application report from ASM for, appId=2, appAttemptId=appattempt_1618393442264_0002_01, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service: }, appDiagnostics=AM container is launched, waiting for AM container to Register with RM, appMasterHost=N/A, appQueue=hdmi-default, appMasterRpcPort=-1, appStartTime=1618393562917, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=N/A, appUser=b_carmel 21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Launching AM with application attempt id appattempt_1618393442264_0002_01 21/04/14 02:46:04 FATAL unmanagedamlauncher.UnmanagedAMLauncher: Error running Client java.lang.NullPointerException at org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.launchAM(UnmanagedAMLauncher.java:186) at org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.run(UnmanagedAMLauncher.java:354) at org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.main(UnmanagedAMLauncher.java:111) {code} {code:java} public ApplicationReport createAndGetApplicationReport(String clientUserName, boolean allowAccess) { .. if (currentAttempt != null && currentAttempt.getAppAttemptState() == RMAppAttemptState.LAUNCHED) { if (getApplicationSubmissionContext().getUnmanagedAM() && clientUserName != null && getUser().equals(clientUserName)) { Token token = currentAttempt.getAMRMToken(); if (token != null) { amrmToken = BuilderUtils.newAMRMToken(token.getIdentifier(), token.getKind().toString(), token.getPassword(), token.getService().toString()); } } } {code} clientUserName is fullName of a kerberos principle like a...@domain.com whereas getUser() will return the username recorded in RMAppImpl which is short name. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9980) App hangs in accepted when moved from DEFAULT_PARTITION queue to an exclusive partition queue
[ https://issues.apache.org/jira/browse/YARN-9980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wang, Xinglong updated YARN-9980: - Attachment: YARN-9980.001.patch > App hangs in accepted when moved from DEFAULT_PARTITION queue to an exclusive > partition queue > - > > Key: YARN-9980 > URL: https://issues.apache.org/jira/browse/YARN-9980 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wang, Xinglong >Assignee: Wang, Xinglong >Priority: Minor > Attachments: Screen Shot 2019-11-14 at 5.11.39 PM.png, > YARN-9980.001.patch > > > App hangs in accpeted when moved from DEFAULT_PARTITION queue to an exclusive > partition queue. > queue_root > queue_a - default_partition > queue_b - exclusive partition x, default partition is x > When an app is submitted to queue_a, with AM_LABEL_EXPRESSION unset, RM will > give default_partition as AM_LABEL_EXPRESSION to this app, then it gets an > am1 and runs. And if later, the app is moved to queue_b, and the am1 is > preempted/killed/failed, it will schedule another am2 if am retry number > allows. But this time the resource request for this am2 is with > AM_LABEL_EXPRESSION = default_partition, the issue is queue_b don't have any > resource with default_partition, then this app will be in accepted state > forever in RM UI. > My understanding is that, since the app was submitted with no > AM_LABEL_EXPRESSION, And in the code base, we allow in our code for such kind > of app to run with current queue's default partition. > Here for the move queue scenario, we should also let the app to run > successfully. That means am2 should get queue_b's default partition x > resource to run instead of pending forever. > In our production, we have a landing queue with default_partition, we have > some kind of route mechanism to route apps in this queue to other queues > including queues with exclusive partition. > !Screen Shot 2019-11-14 at 5.11.39 PM.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9980) App hangs in accepted when moved from DEFAULT_PARTITION queue to an exclusive partition queue
Wang, Xinglong created YARN-9980: Summary: App hangs in accepted when moved from DEFAULT_PARTITION queue to an exclusive partition queue Key: YARN-9980 URL: https://issues.apache.org/jira/browse/YARN-9980 Project: Hadoop YARN Issue Type: Improvement Reporter: Wang, Xinglong Assignee: Wang, Xinglong Attachments: Screen Shot 2019-11-14 at 5.11.39 PM.png App hangs in accpeted when moved from DEFAULT_PARTITION queue to an exclusive partition queue. queue_root queue_a - default_partition queue_b - exclusive partition x, default partition is x When an app is submitted to queue_a, with AM_LABEL_EXPRESSION unset, RM will give default_partition as AM_LABEL_EXPRESSION to this app, then it gets an am1 and runs. And if later, the app is moved to queue_b, and the am1 is preempted/killed/failed, it will schedule another am2 if am retry number allows. But this time the resource request for this am2 is with AM_LABEL_EXPRESSION = default_partition, the issue is queue_b don't have any resource with default_partition, then this app will be in accepted state forever in RM UI. My understanding is that, since the app was submitted with no AM_LABEL_EXPRESSION, And in the code base, we allow in our code for such kind of app to run with current queue's default partition. Here for the move queue scenario, we should also let the app to run successfully. That means am2 should get queue_b's default partition x resource to run instead of pending forever. In our production, we have a landing queue with default_partition, we have some kind of route mechanism to route apps in this queue to other queues including queues with exclusive partition. !Screen Shot 2019-11-14 at 5.11.39 PM.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5718) TimelineClient (and other places in YARN) shouldn't over-write HDFS client retry settings which could cause unexpected behavior
[ https://issues.apache.org/jira/browse/YARN-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951623#comment-16951623 ] Wang, Xinglong commented on YARN-5718: -- I went through hdfs code, and also found the issue is only with No-HA hdfs setup. The original description is not correct. As the following code, only in Non_HA case, retry config will be used. In HA case, RetryPolicies.failoverOnNetworkException will be used. {code:java} public class NameNodeProxies { public static ProxyAndInfo createProxy(Configuration conf, URI nameNodeUri, Class xface, AtomicBoolean fallbackToSimpleAuth) throws IOException { AbstractNNFailoverProxyProvider failoverProxyProvider = createFailoverProxyProvider(conf, nameNodeUri, xface, true, fallbackToSimpleAuth); if (failoverProxyProvider == null) { // Non-HA case return createNonHAProxy(conf, NameNode.getAddress(conf, nameNodeUri), xface, UserGroupInformation.getCurrentUser(), true, fallbackToSimpleAuth); } else { // HA case Conf config = new Conf(conf); T proxy = (T) RetryProxy.create(xface, failoverProxyProvider, RetryPolicies.failoverOnNetworkException( RetryPolicies.TRY_ONCE_THEN_FAIL, config.maxFailoverAttempts, config.maxRetryAttempts, config.failoverSleepBaseMillis, config.failoverSleepMaxMillis)); {code} > TimelineClient (and other places in YARN) shouldn't over-write HDFS client > retry settings which could cause unexpected behavior > --- > > Key: YARN-5718 > URL: https://issues.apache.org/jira/browse/YARN-5718 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, timelineclient >Reporter: Junping Du >Assignee: Junping Du >Priority: Major > Fix For: 3.0.0-alpha2 > > Attachments: YARN-5718-v2.1.patch, YARN-5718-v2.patch, YARN-5718.patch > > > In one HA cluster, after NN failed over, we noticed that job is getting > failed as TimelineClient failed to retry connection to proper NN. This is > because we are overwrite hdfs client settings that hard code retry policy to > be enabled that conflict NN failed-over case - hdfs client should fail fast > so can retry on another NN. > We shouldn't assume any retry policy for hdfs client at all places in YARN. > This should keep consistent with HDFS settings that has different retry > polices in different deployment case. Thus, we should clean up these hard > code settings in YARN, include: FileSystemTimelineWriter, > FileSystemRMStateStore and FileSystemNodeLabelsStore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5748) Backport YARN-5718 to branch-2
[ https://issues.apache.org/jira/browse/YARN-5748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951621#comment-16951621 ] Wang, Xinglong commented on YARN-5748: -- I went through hdfs code, and also found the issue is only with No-HA hdfs setup. The original description is not correct. As the following code, only in Non_HA case, retry config will be used. In HA case, RetryPolicies.failoverOnNetworkException will be used. {code:java} public static ProxyAndInfo createProxy(Configuration conf, URI nameNodeUri, Class xface, AtomicBoolean fallbackToSimpleAuth) throws IOException { AbstractNNFailoverProxyProvider failoverProxyProvider = createFailoverProxyProvider(conf, nameNodeUri, xface, true, fallbackToSimpleAuth); if (failoverProxyProvider == null) { // Non-HA case return createNonHAProxy(conf, NameNode.getAddress(conf, nameNodeUri), xface, UserGroupInformation.getCurrentUser(), true, fallbackToSimpleAuth); } else { // HA case Conf config = new Conf(conf); T proxy = (T) RetryProxy.create(xface, failoverProxyProvider, RetryPolicies.failoverOnNetworkException( RetryPolicies.TRY_ONCE_THEN_FAIL, config.maxFailoverAttempts, config.maxRetryAttempts, config.failoverSleepBaseMillis, config.failoverSleepMaxMillis)); {code} > Backport YARN-5718 to branch-2 > -- > > Key: YARN-5748 > URL: https://issues.apache.org/jira/browse/YARN-5748 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Masatake Iwasaki >Priority: Major > Attachments: YARN-5748-branch-2.001.patch, > YARN-5748-branch-2.002.patch > > > In YARN-5718, we have identify several unnecessary config to over-write HDFS > client behavior in several components of YARN (FSRMStore, TimelineClient, > NodeLabelStore, etc.) which cause job failure in some cases (NN HA, etc.) - > that's definitely belongs to bug. In YARN-5718, we proposed to remove the > config as it shouldn't be supposed to work, which get committed to trunk > already as alpha stage has more flexibility for incompatible changes. In > branch-2, we want to play a bit more safe and get more discussion. > Obviously, there are several options here: > 1. Don't fix anything, let bug exist > 2. Fix the bug, but keep the configuration, or mark it deprecated and add > some explanation to say this configuration is not supposed to work any more. > 3. Exactly like YARN-5718, fix the bug and remove the unnecessary > configuration. > This ticket is filed for more discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode
[ https://issues.apache.org/jira/browse/YARN-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940752#comment-16940752 ] Wang, Xinglong commented on YARN-9847: -- I was not aware of YARN-6967. They are good enough for this issue. We can close this one. > ZKRMStateStore will cause zk connection loss when writing huge data into znode > -- > > Key: YARN-9847 > URL: https://issues.apache.org/jira/browse/YARN-9847 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wang, Xinglong >Assignee: Wang, Xinglong >Priority: Minor > Attachments: YARN-9847.001.patch, YARN-9847.002.patch > > > Recently, we encountered RM ZK connection issue due to RM was trying to write > huge data into znode. This behavior will zk report Len error and then cause > zk session connection loss. And eventually RM would crash due to zk > connection issue. > *The fix* > In order to protect ResouceManager from crash due to this. > This fix is trying to limit the size of data for attemp by limiting the > diagnostic info when writing ApplicationAttemptStateData into znode. The size > will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will > be also used by zookeeper server. > *The story* > ResourceManager Log > {code:java} > 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session > 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, > unexpected error, closing socket connection and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) > at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > 2019-07-29 04:27:35,459 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109) > at
[jira] [Created] (YARN-9854) RM jetty hang due to WebAppProxyServlet lacks of timeout while doing proxyLink
Wang, Xinglong created YARN-9854: Summary: RM jetty hang due to WebAppProxyServlet lacks of timeout while doing proxyLink Key: YARN-9854 URL: https://issues.apache.org/jira/browse/YARN-9854 Project: Hadoop YARN Issue Type: Improvement Components: amrmproxy, resourcemanager, webapp Reporter: Wang, Xinglong Assignee: Wang, Xinglong RM will proxy url request to [http://rm:port/proxy/application_x] to AM or related history server. Recently we met an issue https://issues.apache.org/jira/browse/SPARK-26961 which will cause Spark AM hang forever. And we have a monitor tool to access [http://rm:port/proxy/application_x] periodically. Thus all proxied connection to the hang spark AM will also hang forever due to WebAppProxyServlet is lacking of socket connection timeout setting while initialize httpclient towards this spark AM. The jetty server holding RM servlets is with limited threads. In this case, each time one such thread will hang due to waiting for Spark AM response. Eventually all jetty threads serving http traffic hang and caused all RM web links not responsive. If we give timeout config to httpclient, we will be free of this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode
[ https://issues.apache.org/jira/browse/YARN-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936506#comment-16936506 ] Wang, Xinglong commented on YARN-9847: -- [~tangzhankun], instance of ApplicationAttemptStateData will be fully serialized into app attempt znode and the instance contains several fields besides diagnostics info as following including startTime, finalTrackingUrl, diagnostics, exitStatus etc. So if diagnostics is 100KB, then ApplicationAttemptStateData will be bigger than 100KB. {code:java} public abstract class ApplicationAttemptStateData { public static ApplicationAttemptStateData newInstance( ApplicationAttemptId attemptId, Container container, Credentials attemptTokens, long startTime, RMAppAttemptState finalState, String finalTrackingUrl, String diagnostics, FinalApplicationStatus amUnregisteredFinalStatus, int exitStatus, long finishTime, Map resourceSecondsMap, Map preemptedResourceSecondsMap) { ApplicationAttemptStateData attemptStateData = Records.newRecord(ApplicationAttemptStateData.class); attemptStateData.setAttemptId(attemptId); attemptStateData.setMasterContainer(container); attemptStateData.setAppAttemptTokens(attemptTokens); attemptStateData.setState(finalState); attemptStateData.setFinalTrackingUrl(finalTrackingUrl); attemptStateData.setDiagnostics(diagnostics == null ? "" : diagnostics); attemptStateData.setStartTime(startTime); attemptStateData.setFinalApplicationStatus(amUnregisteredFinalStatus); attemptStateData.setAMContainerExitStatus(exitStatus); attemptStateData.setFinishTime(finishTime); attemptStateData.setMemorySeconds(RMServerUtils .getOrDefault(resourceSecondsMap, ResourceInformation.MEMORY_MB.getName(), 0L)); attemptStateData.setVcoreSeconds(RMServerUtils .getOrDefault(resourceSecondsMap, ResourceInformation.VCORES.getName(), 0L)); attemptStateData.setPreemptedMemorySeconds(RMServerUtils .getOrDefault(preemptedResourceSecondsMap, ResourceInformation.MEMORY_MB.getName(), 0L)); attemptStateData.setPreemptedVcoreSeconds(RMServerUtils .getOrDefault(preemptedResourceSecondsMap, ResourceInformation.VCORES.getName(), 0L)); attemptStateData.setResourceSecondsMap(resourceSecondsMap); attemptStateData .setPreemptedResourceSecondsMap(preemptedResourceSecondsMap); return attemptStateData; } {code} In the test, I limited znode size to be 100KB which means serialized ApplicationAttemptStateData should be below or equal with 100KB. And I generated 100KB diagnostics data within ApplicationAttemptStateData to make sure fully serialized ApplicationAttemptStateData will bigger than 100KB which will trigger the truncate logic. *Original * ApplicationAttemptStateData serialized size > 100KB ApplicationAttemptStateData.diagnostics serialized size = 100KB *Truncated* ApplicationAttemptStateData serialized size = 100KB ApplicationAttemptStateData.diagnostics serialized size < 100KB due to truncation happened on this field only. This is why this assert stands. {code:java} assertNotEquals("", attempt1.getDiagnostics(), attemptStateData1.getDiagnostics()); {code} > ZKRMStateStore will cause zk connection loss when writing huge data into znode > -- > > Key: YARN-9847 > URL: https://issues.apache.org/jira/browse/YARN-9847 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wang, Xinglong >Assignee: Wang, Xinglong >Priority: Minor > Attachments: YARN-9847.001.patch, YARN-9847.002.patch > > > Recently, we encountered RM ZK connection issue due to RM was trying to write > huge data into znode. This behavior will zk report Len error and then cause > zk session connection loss. And eventually RM would crash due to zk > connection issue. > *The fix* > In order to protect ResouceManager from crash due to this. > This fix is trying to limit the size of data for attemp by limiting the > diagnostic info when writing ApplicationAttemptStateData into znode. The size > will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will > be also used by zookeeper server. > *The story* > ResourceManager Log > {code:java} > 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session > 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, > unexpected error, closing socket connection and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at
[jira] [Commented] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode
[ https://issues.apache.org/jira/browse/YARN-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936353#comment-16936353 ] Wang, Xinglong commented on YARN-9847: -- [~tangzhankun] https://issues.apache.org/jira/browse/YARN-5006 fixed znode size issue when user was trying to submit a new application with huge applicationData. It concerns with the following znode which stores ApplicationStateData.java. {code:java} /rmstore/ZKRMStateRoot/RMAppRoot/application_ {code} , while this ticket is to solve znode size issue in case an app attempt sent huge diagnostic info data from AM to RM as its failure diagnostics. It concerns with the following znode which stores ApplicationAttemptStateData.java {code:java} /rmstore/ZKRMStateRoot/RMAppRoot/application_/appattempt_xxx_01 {code} In both cases, RM will lost connection to zk and eventually quit. https://issues.apache.org/jira/browse/YARN-5006 will report exception and reject the problematic application during application submission. And this ticket will truncate the over-sized diagnostics info to fit attemptStateData into znode so that to prevent issue. They are 2 different cases and need different logic to handle. > ZKRMStateStore will cause zk connection loss when writing huge data into znode > -- > > Key: YARN-9847 > URL: https://issues.apache.org/jira/browse/YARN-9847 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wang, Xinglong >Assignee: Wang, Xinglong >Priority: Minor > Attachments: YARN-9847.001.patch, YARN-9847.002.patch > > > Recently, we encountered RM ZK connection issue due to RM was trying to write > huge data into znode. This behavior will zk report Len error and then cause > zk session connection loss. And eventually RM would crash due to zk > connection issue. > *The fix* > In order to protect ResouceManager from crash due to this. > This fix is trying to limit the size of data for attemp by limiting the > diagnostic info when writing ApplicationAttemptStateData into znode. The size > will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will > be also used by zookeeper server. > *The story* > ResourceManager Log > {code:java} > 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session > 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, > unexpected error, closing socket connection and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) > at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > 2019-07-29 04:27:35,459 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299) > at >
[jira] [Commented] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode
[ https://issues.apache.org/jira/browse/YARN-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935747#comment-16935747 ] Wang, Xinglong commented on YARN-9847: -- I see https://issues.apache.org/jira/browse/YARN-5006 introduced a configuration. Then in favor of using yarn.resourcemanager.zk-max-znode-size.bytes other than -Djute.maxbuffer. > ZKRMStateStore will cause zk connection loss when writing huge data into znode > -- > > Key: YARN-9847 > URL: https://issues.apache.org/jira/browse/YARN-9847 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wang, Xinglong >Assignee: Wang, Xinglong >Priority: Minor > Attachments: YARN-9847.001.patch > > > Recently, we encountered RM ZK connection issue due to RM was trying to write > huge data into znode. This behavior will zk report Len error and then cause > zk session connection loss. And eventually RM would crash due to zk > connection issue. > *The fix* > In order to protect ResouceManager from crash due to this. > This fix is trying to limit the size of data for attemp by limiting the > diagnostic info when writing ApplicationAttemptStateData into znode. The size > will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will > be also used by zookeeper server. > *The story* > ResourceManager Log > {code:java} > 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session > 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, > unexpected error, closing socket connection and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) > at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > 2019-07-29 04:27:35,459 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at >
[jira] [Commented] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode
[ https://issues.apache.org/jira/browse/YARN-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935745#comment-16935745 ] Wang, Xinglong commented on YARN-9847: -- [~tangzhankun] It will only truncate the diagnostic string to make the serialized bytes smaller. This will not affect appattemp recovery from zk because the. > ZKRMStateStore will cause zk connection loss when writing huge data into znode > -- > > Key: YARN-9847 > URL: https://issues.apache.org/jira/browse/YARN-9847 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wang, Xinglong >Assignee: Wang, Xinglong >Priority: Minor > Attachments: YARN-9847.001.patch > > > Recently, we encountered RM ZK connection issue due to RM was trying to write > huge data into znode. This behavior will zk report Len error and then cause > zk session connection loss. And eventually RM would crash due to zk > connection issue. > *The fix* > In order to protect ResouceManager from crash due to this. > This fix is trying to limit the size of data for attemp by limiting the > diagnostic info when writing ApplicationAttemptStateData into znode. The size > will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will > be also used by zookeeper server. > *The story* > ResourceManager Log > {code:java} > 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session > 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, > unexpected error, closing socket connection and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) > at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > 2019-07-29 04:27:35,459 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at >
[jira] [Created] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode
Wang, Xinglong created YARN-9847: Summary: ZKRMStateStore will cause zk connection loss when writing huge data into znode Key: YARN-9847 URL: https://issues.apache.org/jira/browse/YARN-9847 Project: Hadoop YARN Issue Type: Improvement Reporter: Wang, Xinglong Assignee: Wang, Xinglong Recently, we encountered RM ZK connection issue due to RM was trying to write huge data into znode. This behavior will zk report Len error and then cause zk session connection loss. And eventually RM would crash due to zk connection issue. *The fix* In order to protect ResouceManager from crash due to this. This fix is trying to limit the size of data for attemp by limiting the diagnostic info when writing ApplicationAttemptStateData into znode. The size will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will be also used by zookeeper server. *The story* ResourceManager Log {code:java} 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) 2019-07-29 04:27:35,459 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109) at java.lang.Thread.run(Thread.java:745) {code} ResourceManager will retry to connect to zookeeper until it exhausted retry number and then give up. {code:java} 2019-07-29 02:25:06,404 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 999 2019-07-29 02:25:06,718 INFO org.apache.zookeeper.client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism. 2019-07-29 02:25:06,718 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 2019-07-29 02:25:06,404 INFO
[jira] [Updated] (YARN-9494) ApplicationHistoryServer endpoint access wrongly requested
[ https://issues.apache.org/jira/browse/YARN-9494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wang, Xinglong updated YARN-9494: - Attachment: YARN-9494.002.patch > ApplicationHistoryServer endpoint access wrongly requested > -- > > Key: YARN-9494 > URL: https://issues.apache.org/jira/browse/YARN-9494 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2 >Reporter: Wang, Xinglong >Priority: Minor > Attachments: YARN-9494.001.patch, YARN-9494.002.patch > > > With the following configuration, resource manager will redirect > https://resourcemanager.hadoop.com:50030/proxy/application_1553677175329_47053/ > to 0.0.0.0:10200 when resource manager can't find > application_1553677175329_47053 in applicationManager. > {code:java} > yarn.timeline-service.enabled = false > yarn.timeline-service.generic-application-history.enabled = true > {code} > However, in this case, there is no timeline service enabled, thus no > yarn.timeline-service.address defined, and 0.0.0.0:10200 will be used as > timelineserver access point. > This combination of configuration is a valid configuration, due to we have in > house tool to analyze the generic-applicaiton-history files generated by > resource manager. While we don't enable timeline service. > {code:java} > HTTP ERROR 500 > Problem accessing /proxy/application_1553677175329_47053/. Reason: > Call From x/10.22.59.23 to 0.0.0.0:10200 failed on connection > exception: java.net.ConnectException: Connection refused; For more details > see: http://wiki.apache.org/hadoop/ConnectionRefused > Caused by: > java.net.ConnectException: Call From x/10.22.59.23 to 0.0.0.0:10200 > failed on connection exception: java.net.ConnectException: Connection > refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > at sun.reflect.GeneratedConstructorAccessor240.newInstance(Unknown > Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1558) > at org.apache.hadoop.ipc.Client.call(Client.java:1498) > at org.apache.hadoop.ipc.Client.call(Client.java:1398) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationHistoryProtocolPBClientImpl.getApplicationReport(ApplicationHistoryProtocolPBClientImpl.java:108) > at > org.apache.hadoop.yarn.server.webproxy.AppReportFetcher.getApplicationReport(AppReportFetcher.java:137) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getApplicationReport(WebAppProxyServlet.java:251) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getFetchedAppReport(WebAppProxyServlet.java:491) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:329) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:178) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57) > at >
[jira] [Commented] (YARN-9494) ApplicationHistoryServer endpoint access wrongly requested
[ https://issues.apache.org/jira/browse/YARN-9494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819857#comment-16819857 ] Wang, Xinglong commented on YARN-9494: -- Attached an initial patch to demonstrate the idea. Tests related change comes later. > ApplicationHistoryServer endpoint access wrongly requested > -- > > Key: YARN-9494 > URL: https://issues.apache.org/jira/browse/YARN-9494 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2 >Reporter: Wang, Xinglong >Priority: Minor > Attachments: YARN-9494.001.patch > > > With the following configuration, resource manager will redirect > https://resourcemanager.hadoop.com:50030/proxy/application_1553677175329_47053/ > to 0.0.0.0:10200 when resource manager can't find > application_1553677175329_47053 in applicationManager. > {code:java} > yarn.timeline-service.enabled = false > yarn.timeline-service.generic-application-history.enabled = true > {code} > However, in this case, there is no timeline service enabled, thus no > yarn.timeline-service.address defined, and 0.0.0.0:10200 will be used as > timelineserver access point. > This combination of configuration is a valid configuration, due to we have in > house tool to analyze the generic-applicaiton-history files generated by > resource manager. While we don't enable timeline service. > {code:java} > HTTP ERROR 500 > Problem accessing /proxy/application_1553677175329_47053/. Reason: > Call From x/10.22.59.23 to 0.0.0.0:10200 failed on connection > exception: java.net.ConnectException: Connection refused; For more details > see: http://wiki.apache.org/hadoop/ConnectionRefused > Caused by: > java.net.ConnectException: Call From x/10.22.59.23 to 0.0.0.0:10200 > failed on connection exception: java.net.ConnectException: Connection > refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > at sun.reflect.GeneratedConstructorAccessor240.newInstance(Unknown > Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1558) > at org.apache.hadoop.ipc.Client.call(Client.java:1498) > at org.apache.hadoop.ipc.Client.call(Client.java:1398) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationHistoryProtocolPBClientImpl.getApplicationReport(ApplicationHistoryProtocolPBClientImpl.java:108) > at > org.apache.hadoop.yarn.server.webproxy.AppReportFetcher.getApplicationReport(AppReportFetcher.java:137) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getApplicationReport(WebAppProxyServlet.java:251) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getFetchedAppReport(WebAppProxyServlet.java:491) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:329) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:178) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57) > at >
[jira] [Updated] (YARN-9494) ApplicationHistoryServer endpoint access wrongly requested
[ https://issues.apache.org/jira/browse/YARN-9494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wang, Xinglong updated YARN-9494: - Attachment: YARN-9494.001.patch > ApplicationHistoryServer endpoint access wrongly requested > -- > > Key: YARN-9494 > URL: https://issues.apache.org/jira/browse/YARN-9494 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2 >Reporter: Wang, Xinglong >Priority: Minor > Attachments: YARN-9494.001.patch > > > With the following configuration, resource manager will redirect > https://resourcemanager.hadoop.com:50030/proxy/application_1553677175329_47053/ > to 0.0.0.0:10200 when resource manager can't find > application_1553677175329_47053 in applicationManager. > {code:java} > yarn.timeline-service.enabled = false > yarn.timeline-service.generic-application-history.enabled = true > {code} > However, in this case, there is no timeline service enabled, thus no > yarn.timeline-service.address defined, and 0.0.0.0:10200 will be used as > timelineserver access point. > This combination of configuration is a valid configuration, due to we have in > house tool to analyze the generic-applicaiton-history files generated by > resource manager. While we don't enable timeline service. > {code:java} > HTTP ERROR 500 > Problem accessing /proxy/application_1553677175329_47053/. Reason: > Call From x/10.22.59.23 to 0.0.0.0:10200 failed on connection > exception: java.net.ConnectException: Connection refused; For more details > see: http://wiki.apache.org/hadoop/ConnectionRefused > Caused by: > java.net.ConnectException: Call From x/10.22.59.23 to 0.0.0.0:10200 > failed on connection exception: java.net.ConnectException: Connection > refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > at sun.reflect.GeneratedConstructorAccessor240.newInstance(Unknown > Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1558) > at org.apache.hadoop.ipc.Client.call(Client.java:1498) > at org.apache.hadoop.ipc.Client.call(Client.java:1398) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationHistoryProtocolPBClientImpl.getApplicationReport(ApplicationHistoryProtocolPBClientImpl.java:108) > at > org.apache.hadoop.yarn.server.webproxy.AppReportFetcher.getApplicationReport(AppReportFetcher.java:137) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getApplicationReport(WebAppProxyServlet.java:251) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getFetchedAppReport(WebAppProxyServlet.java:491) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:329) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:178) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at >
[jira] [Created] (YARN-9494) ApplicationHistoryServer endpoint access wrongly requested
Wang, Xinglong created YARN-9494: Summary: ApplicationHistoryServer endpoint access wrongly requested Key: YARN-9494 URL: https://issues.apache.org/jira/browse/YARN-9494 Project: Hadoop YARN Issue Type: Bug Components: ATSv2 Reporter: Wang, Xinglong With the following configuration, resource manager will redirect https://resourcemanager.hadoop.com:50030/proxy/application_1553677175329_47053/ to 0.0.0.0:10200 when resource manager can't find application_1553677175329_47053 in applicationManager. {code:java} yarn.timeline-service.enabled = false yarn.timeline-service.generic-application-history.enabled = true {code} However, in this case, there is no timeline service enabled, thus no yarn.timeline-service.address defined, and 0.0.0.0:10200 will be used as timelineserver access point. This combination of configuration is a valid configuration, due to we have in house tool to analyze the generic-applicaiton-history files generated by resource manager. While we don't enable timeline service. {code:java} HTTP ERROR 500 Problem accessing /proxy/application_1553677175329_47053/. Reason: Call From x/10.22.59.23 to 0.0.0.0:10200 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused Caused by: java.net.ConnectException: Call From x/10.22.59.23 to 0.0.0.0:10200 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.GeneratedConstructorAccessor240.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1558) at org.apache.hadoop.ipc.Client.call(Client.java:1498) at org.apache.hadoop.ipc.Client.call(Client.java:1398) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationHistoryProtocolPBClientImpl.getApplicationReport(ApplicationHistoryProtocolPBClientImpl.java:108) at org.apache.hadoop.yarn.server.webproxy.AppReportFetcher.getApplicationReport(AppReportFetcher.java:137) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getApplicationReport(WebAppProxyServlet.java:251) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getFetchedAppReport(WebAppProxyServlet.java:491) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:329) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:178) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:617) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:576) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at