[jira] [Updated] (YARN-10735) Unmanaged AM is won't populate AMRMToken to ApplicationReport in secure cluster

2021-04-14 Thread Wang, Xinglong (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wang, Xinglong updated YARN-10735:
--
Attachment: (was: YARN-10735.001.patch)

> Unmanaged AM is won't populate AMRMToken to ApplicationReport in secure 
> cluster
> ---
>
> Key: YARN-10735
> URL: https://issues.apache.org/jira/browse/YARN-10735
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wang, Xinglong
>Assignee: Wang, Xinglong
>Priority: Minor
>
> With kerberos enabled, NPE will be reported when launching 
> UnmanagedAMLauncher.
> It is due to there is no AMRMToken is returned in ApplicationReport. After a 
> while investigation, it turns out that RMAppImpl has a bad if condition 
> inside createAndGetApplicationReport
> {code:java}
> 21/04/14 02:46:01 INFO unmanagedamlauncher.UnmanagedAMLauncher: Initializing 
> Client
> 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Starting 
> Client
> 21/04/14 02:46:02 INFO client.AHSProxy: Connecting to Application History 
> server at /0.0.0.0:10200
> 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting up 
> application submission context for ASM
> 21/04/14 02:46:02 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting 
> unmanaged AM
> 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Submitting 
> application to ASM
> 21/04/14 02:46:03 INFO impl.YarnClientImpl: Submitted application 
> application_1618393442264_0002
> 21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Got 
> application report from ASM for, appId=2, 
> appAttemptId=appattempt_1618393442264_0002_01, clientToAMToken=Token { 
> kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=AM container is 
> launched, waiting for AM container to Register with RM, appMasterHost=N/A, 
> appQueue=abc, appMasterRpcPort=-1, appStartTime=1618393562917, 
> yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=N/A, 
> appUser=abc
> 21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Launching AM 
> with application attempt id appattempt_1618393442264_0002_01
> 21/04/14 02:46:04 FATAL unmanagedamlauncher.UnmanagedAMLauncher: Error 
> running Client
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.launchAM(UnmanagedAMLauncher.java:186)
>   at 
> org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.run(UnmanagedAMLauncher.java:354)
>   at 
> org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.main(UnmanagedAMLauncher.java:111)
> {code}
>  
> {code:java}
>  public ApplicationReport createAndGetApplicationReport(String clientUserName,
>   boolean allowAccess) {
> ..
> if (currentAttempt != null && 
> currentAttempt.getAppAttemptState() == 
> RMAppAttemptState.LAUNCHED) {
>   if (getApplicationSubmissionContext().getUnmanagedAM() &&
>   clientUserName != null && getUser().equals(clientUserName)) {
> Token token = currentAttempt.getAMRMToken();
> if (token != null) {
>   amrmToken = BuilderUtils.newAMRMToken(token.getIdentifier(),
>   token.getKind().toString(), token.getPassword(),
>   token.getService().toString());
> }
>   }
> }
> {code}
> clientUserName is fullName of a kerberos principle like a...@domain.com 
> whereas getUser() will return the username recorded in RMAppImpl which is 
> short name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10735) Unmanaged AM is won't populate AMRMToken to ApplicationReport in secure cluster

2021-04-14 Thread Wang, Xinglong (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wang, Xinglong updated YARN-10735:
--
Description: 
With kerberos enabled, NPE will be reported when launching UnmanagedAMLauncher.
It is due to there is no AMRMToken is returned in ApplicationReport. After a 
while investigation, it turns out that RMAppImpl has a bad if condition inside 
createAndGetApplicationReport

{code:java}
21/04/14 02:46:01 INFO unmanagedamlauncher.UnmanagedAMLauncher: Initializing 
Client
21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Starting Client
21/04/14 02:46:02 INFO client.AHSProxy: Connecting to Application History 
server at /0.0.0.0:10200
21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting up 
application submission context for ASM
21/04/14 02:46:02 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
to rm2
21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting 
unmanaged AM
21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Submitting 
application to ASM
21/04/14 02:46:03 INFO impl.YarnClientImpl: Submitted application 
application_1618393442264_0002
21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Got application 
report from ASM for, appId=2, 
appAttemptId=appattempt_1618393442264_0002_01, clientToAMToken=Token { 
kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=AM container is launched, 
waiting for AM container to Register with RM, appMasterHost=N/A, appQueue=abc, 
appMasterRpcPort=-1, appStartTime=1618393562917, yarnAppState=ACCEPTED, 
distributedFinalState=UNDEFINED, appTrackingUrl=N/A, appUser=abc
21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Launching AM 
with application attempt id appattempt_1618393442264_0002_01
21/04/14 02:46:04 FATAL unmanagedamlauncher.UnmanagedAMLauncher: Error running 
Client
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.launchAM(UnmanagedAMLauncher.java:186)
at 
org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.run(UnmanagedAMLauncher.java:354)
at 
org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.main(UnmanagedAMLauncher.java:111)
{code}

 

{code:java}
 public ApplicationReport createAndGetApplicationReport(String clientUserName,
  boolean allowAccess) {
..
if (currentAttempt != null && 
currentAttempt.getAppAttemptState() == RMAppAttemptState.LAUNCHED) {
  if (getApplicationSubmissionContext().getUnmanagedAM() &&
  clientUserName != null && getUser().equals(clientUserName)) {
Token token = currentAttempt.getAMRMToken();
if (token != null) {
  amrmToken = BuilderUtils.newAMRMToken(token.getIdentifier(),
  token.getKind().toString(), token.getPassword(),
  token.getService().toString());
}
  }
}
{code}

clientUserName is fullName of a kerberos principle like a...@domain.com whereas 
getUser() will return the username recorded in RMAppImpl which is short name.

  was:
With kerberos enabled, NPE will be reported when launching UnmanagedAMLauncher.
It is due to there is no AMRMToken is returned in ApplicationReport. After a 
while investigation, it turns out that RMAppImpl has a bad if condition inside 
createAndGetApplicationReport

{code:java}
21/04/14 02:46:01 INFO unmanagedamlauncher.UnmanagedAMLauncher: Initializing 
Client
21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Starting Client
21/04/14 02:46:02 INFO client.AHSProxy: Connecting to Application History 
server at /0.0.0.0:10200
21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting up 
application submission context for ASM
21/04/14 02:46:02 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
to rm2
21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting 
unmanaged AM
21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Submitting 
application to ASM
21/04/14 02:46:03 INFO impl.YarnClientImpl: Submitted application 
application_1618393442264_0002
21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Got application 
report from ASM for, appId=2, 
appAttemptId=appattempt_1618393442264_0002_01, clientToAMToken=Token { 
kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=AM container is launched, 
waiting for AM container to Register with RM, appMasterHost=N/A, 
appQueue=hdmi-default, appMasterRpcPort=-1, appStartTime=1618393562917, 
yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=N/A, 
appUser=b_carmel
21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Launching AM 
with application attempt id appattempt_1618393442264_0002_01
21/04/14 02:46:04 FATAL 

[jira] [Updated] (YARN-10735) Unmanaged AM is won't populate AMRMToken to ApplicationReport in secure cluster

2021-04-14 Thread Wang, Xinglong (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wang, Xinglong updated YARN-10735:
--
Attachment: YARN-10735.001.patch

> Unmanaged AM is won't populate AMRMToken to ApplicationReport in secure 
> cluster
> ---
>
> Key: YARN-10735
> URL: https://issues.apache.org/jira/browse/YARN-10735
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wang, Xinglong
>Assignee: Wang, Xinglong
>Priority: Minor
> Attachments: YARN-10735.001.patch
>
>
> With kerberos enabled, NPE will be reported when launching 
> UnmanagedAMLauncher.
> It is due to there is no AMRMToken is returned in ApplicationReport. After a 
> while investigation, it turns out that RMAppImpl has a bad if condition 
> inside createAndGetApplicationReport
> {code:java}
> 21/04/14 02:46:01 INFO unmanagedamlauncher.UnmanagedAMLauncher: Initializing 
> Client
> 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Starting 
> Client
> 21/04/14 02:46:02 INFO client.AHSProxy: Connecting to Application History 
> server at /0.0.0.0:10200
> 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting up 
> application submission context for ASM
> 21/04/14 02:46:02 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting 
> unmanaged AM
> 21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Submitting 
> application to ASM
> 21/04/14 02:46:03 INFO impl.YarnClientImpl: Submitted application 
> application_1618393442264_0002
> 21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Got 
> application report from ASM for, appId=2, 
> appAttemptId=appattempt_1618393442264_0002_01, clientToAMToken=Token { 
> kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=AM container is 
> launched, waiting for AM container to Register with RM, appMasterHost=N/A, 
> appQueue=hdmi-default, appMasterRpcPort=-1, appStartTime=1618393562917, 
> yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=N/A, 
> appUser=b_carmel
> 21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Launching AM 
> with application attempt id appattempt_1618393442264_0002_01
> 21/04/14 02:46:04 FATAL unmanagedamlauncher.UnmanagedAMLauncher: Error 
> running Client
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.launchAM(UnmanagedAMLauncher.java:186)
>   at 
> org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.run(UnmanagedAMLauncher.java:354)
>   at 
> org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.main(UnmanagedAMLauncher.java:111)
> {code}
>  
> {code:java}
>  public ApplicationReport createAndGetApplicationReport(String clientUserName,
>   boolean allowAccess) {
> ..
> if (currentAttempt != null && 
> currentAttempt.getAppAttemptState() == 
> RMAppAttemptState.LAUNCHED) {
>   if (getApplicationSubmissionContext().getUnmanagedAM() &&
>   clientUserName != null && getUser().equals(clientUserName)) {
> Token token = currentAttempt.getAMRMToken();
> if (token != null) {
>   amrmToken = BuilderUtils.newAMRMToken(token.getIdentifier(),
>   token.getKind().toString(), token.getPassword(),
>   token.getService().toString());
> }
>   }
> }
> {code}
> clientUserName is fullName of a kerberos principle like a...@domain.com 
> whereas getUser() will return the username recorded in RMAppImpl which is 
> short name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10735) Unmanaged AM is won't populate AMRMToken to ApplicationReport in secure cluster

2021-04-14 Thread Wang, Xinglong (Jira)
Wang, Xinglong created YARN-10735:
-

 Summary: Unmanaged AM is won't populate AMRMToken to 
ApplicationReport in secure cluster
 Key: YARN-10735
 URL: https://issues.apache.org/jira/browse/YARN-10735
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wang, Xinglong
Assignee: Wang, Xinglong


With kerberos enabled, NPE will be reported when launching UnmanagedAMLauncher.
It is due to there is no AMRMToken is returned in ApplicationReport. After a 
while investigation, it turns out that RMAppImpl has a bad if condition inside 
createAndGetApplicationReport

{code:java}
21/04/14 02:46:01 INFO unmanagedamlauncher.UnmanagedAMLauncher: Initializing 
Client
21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Starting Client
21/04/14 02:46:02 INFO client.AHSProxy: Connecting to Application History 
server at /0.0.0.0:10200
21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting up 
application submission context for ASM
21/04/14 02:46:02 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
to rm2
21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Setting 
unmanaged AM
21/04/14 02:46:02 INFO unmanagedamlauncher.UnmanagedAMLauncher: Submitting 
application to ASM
21/04/14 02:46:03 INFO impl.YarnClientImpl: Submitted application 
application_1618393442264_0002
21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Got application 
report from ASM for, appId=2, 
appAttemptId=appattempt_1618393442264_0002_01, clientToAMToken=Token { 
kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=AM container is launched, 
waiting for AM container to Register with RM, appMasterHost=N/A, 
appQueue=hdmi-default, appMasterRpcPort=-1, appStartTime=1618393562917, 
yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=N/A, 
appUser=b_carmel
21/04/14 02:46:04 INFO unmanagedamlauncher.UnmanagedAMLauncher: Launching AM 
with application attempt id appattempt_1618393442264_0002_01
21/04/14 02:46:04 FATAL unmanagedamlauncher.UnmanagedAMLauncher: Error running 
Client
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.launchAM(UnmanagedAMLauncher.java:186)
at 
org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.run(UnmanagedAMLauncher.java:354)
at 
org.apache.hadoop.yarn.applications.unmanagedamlauncher.UnmanagedAMLauncher.main(UnmanagedAMLauncher.java:111)
{code}

 

{code:java}
 public ApplicationReport createAndGetApplicationReport(String clientUserName,
  boolean allowAccess) {
..
if (currentAttempt != null && 
currentAttempt.getAppAttemptState() == RMAppAttemptState.LAUNCHED) {
  if (getApplicationSubmissionContext().getUnmanagedAM() &&
  clientUserName != null && getUser().equals(clientUserName)) {
Token token = currentAttempt.getAMRMToken();
if (token != null) {
  amrmToken = BuilderUtils.newAMRMToken(token.getIdentifier(),
  token.getKind().toString(), token.getPassword(),
  token.getService().toString());
}
  }
}
{code}

clientUserName is fullName of a kerberos principle like a...@domain.com whereas 
getUser() will return the username recorded in RMAppImpl which is short name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9980) App hangs in accepted when moved from DEFAULT_PARTITION queue to an exclusive partition queue

2019-11-14 Thread Wang, Xinglong (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wang, Xinglong updated YARN-9980:
-
Attachment: YARN-9980.001.patch

> App hangs in accepted when moved from DEFAULT_PARTITION queue to an exclusive 
> partition queue
> -
>
> Key: YARN-9980
> URL: https://issues.apache.org/jira/browse/YARN-9980
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wang, Xinglong
>Assignee: Wang, Xinglong
>Priority: Minor
> Attachments: Screen Shot 2019-11-14 at 5.11.39 PM.png, 
> YARN-9980.001.patch
>
>
> App hangs in accpeted when moved from DEFAULT_PARTITION queue to an exclusive 
> partition queue.
> queue_root
> queue_a   - default_partition
> queue_b   - exclusive partition x, default partition is x
> When an app is submitted to queue_a, with AM_LABEL_EXPRESSION unset, RM will 
> give default_partition as AM_LABEL_EXPRESSION to this app, then it gets an 
> am1 and runs. And if later, the app is moved to queue_b, and the am1 is 
> preempted/killed/failed, it will schedule another am2 if am retry number 
> allows. But this time the resource request for this am2 is with 
> AM_LABEL_EXPRESSION = default_partition, the issue is queue_b don't have any 
> resource with default_partition, then this app will be in accepted state 
> forever in RM UI.
> My understanding is that, since the app was submitted with no 
> AM_LABEL_EXPRESSION, And in the code base, we allow in our code for such kind 
> of app to run with current queue's default partition.
> Here for the move queue scenario, we should also let the app to run 
> successfully. That means am2 should get queue_b's default partition x 
> resource to run instead of pending forever.
> In our production, we have a landing queue with default_partition, we have 
> some kind of route mechanism to route apps in this queue to other queues 
> including queues with exclusive partition.
>  !Screen Shot 2019-11-14 at 5.11.39 PM.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9980) App hangs in accepted when moved from DEFAULT_PARTITION queue to an exclusive partition queue

2019-11-14 Thread Wang, Xinglong (Jira)
Wang, Xinglong created YARN-9980:


 Summary: App hangs in accepted when moved from DEFAULT_PARTITION 
queue to an exclusive partition queue
 Key: YARN-9980
 URL: https://issues.apache.org/jira/browse/YARN-9980
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wang, Xinglong
Assignee: Wang, Xinglong
 Attachments: Screen Shot 2019-11-14 at 5.11.39 PM.png

App hangs in accpeted when moved from DEFAULT_PARTITION queue to an exclusive 
partition queue.

queue_root
queue_a   - default_partition
queue_b   - exclusive partition x, default partition is x

When an app is submitted to queue_a, with AM_LABEL_EXPRESSION unset, RM will 
give default_partition as AM_LABEL_EXPRESSION to this app, then it gets an am1 
and runs. And if later, the app is moved to queue_b, and the am1 is 
preempted/killed/failed, it will schedule another am2 if am retry number 
allows. But this time the resource request for this am2 is with 
AM_LABEL_EXPRESSION = default_partition, the issue is queue_b don't have any 
resource with default_partition, then this app will be in accepted state 
forever in RM UI.

My understanding is that, since the app was submitted with no 
AM_LABEL_EXPRESSION, And in the code base, we allow in our code for such kind 
of app to run with current queue's default partition.
Here for the move queue scenario, we should also let the app to run 
successfully. That means am2 should get queue_b's default partition x resource 
to run instead of pending forever.

In our production, we have a landing queue with default_partition, we have some 
kind of route mechanism to route apps in this queue to other queues including 
queues with exclusive partition.

 !Screen Shot 2019-11-14 at 5.11.39 PM.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5718) TimelineClient (and other places in YARN) shouldn't over-write HDFS client retry settings which could cause unexpected behavior

2019-10-14 Thread Wang, Xinglong (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951623#comment-16951623
 ] 

Wang, Xinglong commented on YARN-5718:
--

I went through hdfs code, and also found the issue is only with No-HA hdfs 
setup. The original description is not correct.

As the following code, only in Non_HA case, retry config will be used. In HA 
case, RetryPolicies.failoverOnNetworkException will be used.


{code:java}
public class NameNodeProxies {
public static  ProxyAndInfo createProxy(Configuration conf,
  URI nameNodeUri, Class xface, AtomicBoolean fallbackToSimpleAuth)
  throws IOException {
AbstractNNFailoverProxyProvider failoverProxyProvider =
createFailoverProxyProvider(conf, nameNodeUri, xface, true,
  fallbackToSimpleAuth);
  
if (failoverProxyProvider == null) {
  // Non-HA case
  return createNonHAProxy(conf, NameNode.getAddress(conf, nameNodeUri),
  xface, UserGroupInformation.getCurrentUser(), true,
  fallbackToSimpleAuth);
} else {
  // HA case
  Conf config = new Conf(conf);
  T proxy = (T) RetryProxy.create(xface, failoverProxyProvider,
  RetryPolicies.failoverOnNetworkException(
  RetryPolicies.TRY_ONCE_THEN_FAIL, config.maxFailoverAttempts,
  config.maxRetryAttempts, config.failoverSleepBaseMillis,
  config.failoverSleepMaxMillis));
{code}


> TimelineClient (and other places in YARN) shouldn't over-write HDFS client 
> retry settings which could cause unexpected behavior
> ---
>
> Key: YARN-5718
> URL: https://issues.apache.org/jira/browse/YARN-5718
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineclient
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Major
> Fix For: 3.0.0-alpha2
>
> Attachments: YARN-5718-v2.1.patch, YARN-5718-v2.patch, YARN-5718.patch
>
>
> In one HA cluster, after NN failed over, we noticed that job is getting 
> failed as TimelineClient failed to retry connection to proper NN. This is 
> because we are overwrite hdfs client settings that hard code retry policy to 
> be enabled that conflict NN failed-over case - hdfs client should fail fast 
> so can retry on another NN.
> We shouldn't assume any retry policy for hdfs client at all places in YARN. 
> This should keep consistent with HDFS settings that has different retry 
> polices in different deployment case. Thus, we should clean up these hard 
> code settings in YARN, include: FileSystemTimelineWriter, 
> FileSystemRMStateStore and FileSystemNodeLabelsStore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5748) Backport YARN-5718 to branch-2

2019-10-14 Thread Wang, Xinglong (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-5748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951621#comment-16951621
 ] 

Wang, Xinglong commented on YARN-5748:
--

I went through hdfs code, and also found the issue is only with No-HA hdfs 
setup. The original description is not correct.

 
As the following code, only in Non_HA case, retry config will be used. In HA 
case, RetryPolicies.failoverOnNetworkException will be used.
 
{code:java}
public static  ProxyAndInfo createProxy(Configuration conf,
  URI nameNodeUri, Class xface, AtomicBoolean fallbackToSimpleAuth)
  throws IOException {
AbstractNNFailoverProxyProvider failoverProxyProvider =
createFailoverProxyProvider(conf, nameNodeUri, xface, true,
  fallbackToSimpleAuth);
  
if (failoverProxyProvider == null) {
  // Non-HA case
  return createNonHAProxy(conf, NameNode.getAddress(conf, nameNodeUri),
  xface, UserGroupInformation.getCurrentUser(), true,
  fallbackToSimpleAuth);
} else {
  // HA case
  Conf config = new Conf(conf);
  T proxy = (T) RetryProxy.create(xface, failoverProxyProvider,
  RetryPolicies.failoverOnNetworkException(
  RetryPolicies.TRY_ONCE_THEN_FAIL, config.maxFailoverAttempts,
  config.maxRetryAttempts, config.failoverSleepBaseMillis,
  config.failoverSleepMaxMillis));

{code}


> Backport YARN-5718 to branch-2
> --
>
> Key: YARN-5748
> URL: https://issues.apache.org/jira/browse/YARN-5748
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Masatake Iwasaki
>Priority: Major
> Attachments: YARN-5748-branch-2.001.patch, 
> YARN-5748-branch-2.002.patch
>
>
> In YARN-5718, we have identify several unnecessary config to over-write HDFS 
> client behavior in several components of YARN (FSRMStore, TimelineClient, 
> NodeLabelStore, etc.) which cause job failure in some cases (NN HA, etc.) - 
> that's definitely belongs to bug. In YARN-5718, we proposed to remove the 
> config as it shouldn't be supposed to work, which get committed to trunk 
> already as alpha stage has more flexibility for incompatible changes. In 
> branch-2, we want to play a bit more safe and get more discussion. 
> Obviously, there are several options here:
> 1. Don't fix anything, let bug exist
> 2. Fix the bug, but keep the configuration, or mark it deprecated and add 
> some explanation to say this configuration is not supposed to work any more.
> 3. Exactly like YARN-5718, fix the bug and remove the unnecessary 
> configuration.
> This ticket is filed for more discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode

2019-09-30 Thread Wang, Xinglong (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940752#comment-16940752
 ] 

Wang, Xinglong commented on YARN-9847:
--

I was not aware of YARN-6967. They are good enough for this issue. We can close 
this one.

> ZKRMStateStore will cause zk connection loss when writing huge data into znode
> --
>
> Key: YARN-9847
> URL: https://issues.apache.org/jira/browse/YARN-9847
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wang, Xinglong
>Assignee: Wang, Xinglong
>Priority: Minor
> Attachments: YARN-9847.001.patch, YARN-9847.002.patch
>
>
> Recently, we encountered RM ZK connection issue due to RM was trying to write 
> huge data into znode. This behavior will zk report Len error and then cause 
> zk session connection loss. And eventually RM would crash due to zk 
> connection issue.
> *The fix*
> In order to protect ResouceManager from crash due to this.
> This fix is trying to limit the size of data for attemp by limiting the 
> diagnostic info when writing ApplicationAttemptStateData into znode. The size 
> will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will 
> be also used by zookeeper server.
> *The story*
> ResourceManager Log
> {code:java}
> 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, 
> unexpected error, closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> 2019-07-29 04:27:35,459 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at 

[jira] [Created] (YARN-9854) RM jetty hang due to WebAppProxyServlet lacks of timeout while doing proxyLink

2019-09-24 Thread Wang, Xinglong (Jira)
Wang, Xinglong created YARN-9854:


 Summary: RM jetty hang due to WebAppProxyServlet lacks of timeout 
while doing proxyLink
 Key: YARN-9854
 URL: https://issues.apache.org/jira/browse/YARN-9854
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: amrmproxy, resourcemanager, webapp
Reporter: Wang, Xinglong
Assignee: Wang, Xinglong


RM will proxy url request to [http://rm:port/proxy/application_x] to AM or 
related history server.

Recently we met an issue https://issues.apache.org/jira/browse/SPARK-26961 
which will cause Spark AM hang forever.

And we have a monitor tool to access [http://rm:port/proxy/application_x]  
periodically. Thus all proxied connection to the hang spark AM will also hang 
forever due to WebAppProxyServlet is lacking of socket connection timeout 
setting while initialize httpclient towards this spark AM.

 

The jetty server holding RM servlets is with limited threads. In this case, 
each time one such thread will hang due to waiting for Spark AM response. 
Eventually all jetty threads serving http traffic hang and caused all RM web 
links not responsive. 

 

If we give timeout config to httpclient, we will be free of this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode

2019-09-24 Thread Wang, Xinglong (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936506#comment-16936506
 ] 

Wang, Xinglong commented on YARN-9847:
--

[~tangzhankun], instance of ApplicationAttemptStateData will be fully 
serialized into app attempt znode and the instance contains several fields 
besides diagnostics info as following including startTime, finalTrackingUrl, 
diagnostics, exitStatus etc. So if diagnostics is 100KB, then 
ApplicationAttemptStateData will be bigger than 100KB.

{code:java}
public abstract class ApplicationAttemptStateData {

  public static ApplicationAttemptStateData newInstance(
  ApplicationAttemptId attemptId, Container container,
  Credentials attemptTokens, long startTime, RMAppAttemptState finalState,
  String finalTrackingUrl, String diagnostics,
  FinalApplicationStatus amUnregisteredFinalStatus, int exitStatus,
  long finishTime, Map resourceSecondsMap,
  Map preemptedResourceSecondsMap) {
ApplicationAttemptStateData attemptStateData =
Records.newRecord(ApplicationAttemptStateData.class);
attemptStateData.setAttemptId(attemptId);
attemptStateData.setMasterContainer(container);
attemptStateData.setAppAttemptTokens(attemptTokens);
attemptStateData.setState(finalState);
attemptStateData.setFinalTrackingUrl(finalTrackingUrl);
attemptStateData.setDiagnostics(diagnostics == null ? "" : diagnostics);
attemptStateData.setStartTime(startTime);
attemptStateData.setFinalApplicationStatus(amUnregisteredFinalStatus);
attemptStateData.setAMContainerExitStatus(exitStatus);
attemptStateData.setFinishTime(finishTime);
attemptStateData.setMemorySeconds(RMServerUtils
.getOrDefault(resourceSecondsMap,
ResourceInformation.MEMORY_MB.getName(), 0L));
attemptStateData.setVcoreSeconds(RMServerUtils
.getOrDefault(resourceSecondsMap, ResourceInformation.VCORES.getName(),
0L));
attemptStateData.setPreemptedMemorySeconds(RMServerUtils
.getOrDefault(preemptedResourceSecondsMap,
ResourceInformation.MEMORY_MB.getName(), 0L));
attemptStateData.setPreemptedVcoreSeconds(RMServerUtils
.getOrDefault(preemptedResourceSecondsMap,
ResourceInformation.VCORES.getName(), 0L));
attemptStateData.setResourceSecondsMap(resourceSecondsMap);
attemptStateData
.setPreemptedResourceSecondsMap(preemptedResourceSecondsMap);
return attemptStateData;
  }
{code}

In the test, I limited znode size to be 100KB which means serialized 
ApplicationAttemptStateData should be below or equal with 100KB. And I 
generated 100KB diagnostics data within ApplicationAttemptStateData to make 
sure fully serialized ApplicationAttemptStateData will bigger than 100KB which 
will trigger the truncate logic.

*Original *
ApplicationAttemptStateData serialized size > 100KB
ApplicationAttemptStateData.diagnostics serialized size = 100KB

*Truncated*
ApplicationAttemptStateData serialized size = 100KB
ApplicationAttemptStateData.diagnostics serialized size < 100KB  due to 
truncation happened on this field only.

This is why this assert stands.

{code:java}
assertNotEquals("", attempt1.getDiagnostics(),
   attemptStateData1.getDiagnostics()); 
{code}


> ZKRMStateStore will cause zk connection loss when writing huge data into znode
> --
>
> Key: YARN-9847
> URL: https://issues.apache.org/jira/browse/YARN-9847
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wang, Xinglong
>Assignee: Wang, Xinglong
>Priority: Minor
> Attachments: YARN-9847.001.patch, YARN-9847.002.patch
>
>
> Recently, we encountered RM ZK connection issue due to RM was trying to write 
> huge data into znode. This behavior will zk report Len error and then cause 
> zk session connection loss. And eventually RM would crash due to zk 
> connection issue.
> *The fix*
> In order to protect ResouceManager from crash due to this.
> This fix is trying to limit the size of data for attemp by limiting the 
> diagnostic info when writing ApplicationAttemptStateData into znode. The size 
> will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will 
> be also used by zookeeper server.
> *The story*
> ResourceManager Log
> {code:java}
> 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, 
> unexpected error, closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at 

[jira] [Commented] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode

2019-09-23 Thread Wang, Xinglong (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936353#comment-16936353
 ] 

Wang, Xinglong commented on YARN-9847:
--

[~tangzhankun] 
https://issues.apache.org/jira/browse/YARN-5006 fixed znode size issue when 
user was trying to submit a new application with huge applicationData. It 
concerns with the following znode which stores ApplicationStateData.java.
{code:java}
/rmstore/ZKRMStateRoot/RMAppRoot/application_
{code}
, while this ticket is to solve znode size issue in case an app attempt sent 
huge diagnostic info data from AM to RM as its failure diagnostics. It concerns 
with the following znode which stores ApplicationAttemptStateData.java
{code:java}
/rmstore/ZKRMStateRoot/RMAppRoot/application_/appattempt_xxx_01
{code}

In both cases, RM will lost connection to zk and eventually quit.
https://issues.apache.org/jira/browse/YARN-5006 will report exception and 
reject the problematic application during application submission.
And this ticket will truncate the over-sized diagnostics info to fit 
attemptStateData into znode so that to prevent issue.

They are 2 different cases and need different logic to handle.

> ZKRMStateStore will cause zk connection loss when writing huge data into znode
> --
>
> Key: YARN-9847
> URL: https://issues.apache.org/jira/browse/YARN-9847
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wang, Xinglong
>Assignee: Wang, Xinglong
>Priority: Minor
> Attachments: YARN-9847.001.patch, YARN-9847.002.patch
>
>
> Recently, we encountered RM ZK connection issue due to RM was trying to write 
> huge data into znode. This behavior will zk report Len error and then cause 
> zk session connection loss. And eventually RM would crash due to zk 
> connection issue.
> *The fix*
> In order to protect ResouceManager from crash due to this.
> This fix is trying to limit the size of data for attemp by limiting the 
> diagnostic info when writing ApplicationAttemptStateData into znode. The size 
> will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will 
> be also used by zookeeper server.
> *The story*
> ResourceManager Log
> {code:java}
> 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, 
> unexpected error, closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> 2019-07-29 04:27:35,459 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299)
> at 
> 

[jira] [Commented] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode

2019-09-23 Thread Wang, Xinglong (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935747#comment-16935747
 ] 

Wang, Xinglong commented on YARN-9847:
--

I see https://issues.apache.org/jira/browse/YARN-5006 introduced a 
configuration. Then in favor of using 
yarn.resourcemanager.zk-max-znode-size.bytes other than -Djute.maxbuffer.


> ZKRMStateStore will cause zk connection loss when writing huge data into znode
> --
>
> Key: YARN-9847
> URL: https://issues.apache.org/jira/browse/YARN-9847
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wang, Xinglong
>Assignee: Wang, Xinglong
>Priority: Minor
> Attachments: YARN-9847.001.patch
>
>
> Recently, we encountered RM ZK connection issue due to RM was trying to write 
> huge data into znode. This behavior will zk report Len error and then cause 
> zk session connection loss. And eventually RM would crash due to zk 
> connection issue.
> *The fix*
> In order to protect ResouceManager from crash due to this.
> This fix is trying to limit the size of data for attemp by limiting the 
> diagnostic info when writing ApplicationAttemptStateData into znode. The size 
> will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will 
> be also used by zookeeper server.
> *The story*
> ResourceManager Log
> {code:java}
> 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, 
> unexpected error, closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> 2019-07-29 04:27:35,459 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> 

[jira] [Commented] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode

2019-09-23 Thread Wang, Xinglong (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935745#comment-16935745
 ] 

Wang, Xinglong commented on YARN-9847:
--

[~tangzhankun] It will only truncate the diagnostic string to make the 
serialized bytes smaller. This will not affect appattemp recovery from zk 
because the.


> ZKRMStateStore will cause zk connection loss when writing huge data into znode
> --
>
> Key: YARN-9847
> URL: https://issues.apache.org/jira/browse/YARN-9847
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wang, Xinglong
>Assignee: Wang, Xinglong
>Priority: Minor
> Attachments: YARN-9847.001.patch
>
>
> Recently, we encountered RM ZK connection issue due to RM was trying to write 
> huge data into znode. This behavior will zk report Len error and then cause 
> zk session connection loss. And eventually RM would crash due to zk 
> connection issue.
> *The fix*
> In order to protect ResouceManager from crash due to this.
> This fix is trying to limit the size of data for attemp by limiting the 
> diagnostic info when writing ApplicationAttemptStateData into znode. The size 
> will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will 
> be also used by zookeeper server.
> *The story*
> ResourceManager Log
> {code:java}
> 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, 
> unexpected error, closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> 2019-07-29 04:27:35,459 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> 

[jira] [Created] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode

2019-09-19 Thread Wang, Xinglong (Jira)
Wang, Xinglong created YARN-9847:


 Summary: ZKRMStateStore will cause zk connection loss when writing 
huge data into znode
 Key: YARN-9847
 URL: https://issues.apache.org/jira/browse/YARN-9847
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wang, Xinglong
Assignee: Wang, Xinglong


Recently, we encountered RM ZK connection issue due to RM was trying to write 
huge data into znode. This behavior will zk report Len error and then cause zk 
session connection loss. And eventually RM would crash due to zk connection 
issue.

*The fix*

In order to protect ResouceManager from crash due to this.
This fix is trying to limit the size of data for attemp by limiting the 
diagnostic info when writing ApplicationAttemptStateData into znode. The size 
will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will 
be also used by zookeeper server.

*The story*

ResourceManager Log
{code:java}
2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session 
0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, unexpected 
error, closing socket connection and attempting reconnect
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)

2019-07-29 04:27:35,459 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
at java.lang.Thread.run(Thread.java:745)
{code}


ResourceManager will retry to connect to zookeeper until it exhausted retry 
number and then give up.

{code:java}
2019-07-29 02:25:06,404 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying 
operation on ZK. Retry no. 999


2019-07-29 02:25:06,718 INFO org.apache.zookeeper.client.ZooKeeperSaslClient: 
Client will use GSSAPI as SASL mechanism.
2019-07-29 02:25:06,718 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server 2019-07-29 02:25:06,404 INFO 

[jira] [Updated] (YARN-9494) ApplicationHistoryServer endpoint access wrongly requested

2019-05-05 Thread Wang, Xinglong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wang, Xinglong updated YARN-9494:
-
Attachment: YARN-9494.002.patch

> ApplicationHistoryServer endpoint access wrongly requested
> --
>
> Key: YARN-9494
> URL: https://issues.apache.org/jira/browse/YARN-9494
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2
>Reporter: Wang, Xinglong
>Priority: Minor
> Attachments: YARN-9494.001.patch, YARN-9494.002.patch
>
>
> With the following configuration, resource manager will redirect 
> https://resourcemanager.hadoop.com:50030/proxy/application_1553677175329_47053/
>  to  0.0.0.0:10200 when resource manager can't find 
> application_1553677175329_47053 in applicationManager.
> {code:java}
> yarn.timeline-service.enabled = false
> yarn.timeline-service.generic-application-history.enabled = true
> {code}
> However, in this case, there is no timeline service enabled, thus no 
> yarn.timeline-service.address defined, and 0.0.0.0:10200 will be used as 
> timelineserver access point.
> This combination of configuration is a valid configuration, due to we have in 
> house tool to analyze the generic-applicaiton-history files generated by 
> resource manager. While we don't enable timeline service.
> {code:java}
> HTTP ERROR 500
> Problem accessing /proxy/application_1553677175329_47053/. Reason:
> Call From x/10.22.59.23 to 0.0.0.0:10200 failed on connection 
> exception: java.net.ConnectException: Connection refused; For more details 
> see:  http://wiki.apache.org/hadoop/ConnectionRefused
> Caused by:
> java.net.ConnectException: Call From x/10.22.59.23 to 0.0.0.0:10200 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.GeneratedConstructorAccessor240.newInstance(Unknown 
> Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
>   at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1558)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1498)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1398)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationHistoryProtocolPBClientImpl.getApplicationReport(ApplicationHistoryProtocolPBClientImpl.java:108)
>   at 
> org.apache.hadoop.yarn.server.webproxy.AppReportFetcher.getApplicationReport(AppReportFetcher.java:137)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getApplicationReport(WebAppProxyServlet.java:251)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getFetchedAppReport(WebAppProxyServlet.java:491)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:329)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>   at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:178)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
>   at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>   at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
>   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57)
>   at 
> 

[jira] [Commented] (YARN-9494) ApplicationHistoryServer endpoint access wrongly requested

2019-04-17 Thread Wang, Xinglong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819857#comment-16819857
 ] 

Wang, Xinglong commented on YARN-9494:
--

Attached an initial patch to demonstrate the idea. Tests related change comes 
later.

> ApplicationHistoryServer endpoint access wrongly requested
> --
>
> Key: YARN-9494
> URL: https://issues.apache.org/jira/browse/YARN-9494
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2
>Reporter: Wang, Xinglong
>Priority: Minor
> Attachments: YARN-9494.001.patch
>
>
> With the following configuration, resource manager will redirect 
> https://resourcemanager.hadoop.com:50030/proxy/application_1553677175329_47053/
>  to  0.0.0.0:10200 when resource manager can't find 
> application_1553677175329_47053 in applicationManager.
> {code:java}
> yarn.timeline-service.enabled = false
> yarn.timeline-service.generic-application-history.enabled = true
> {code}
> However, in this case, there is no timeline service enabled, thus no 
> yarn.timeline-service.address defined, and 0.0.0.0:10200 will be used as 
> timelineserver access point.
> This combination of configuration is a valid configuration, due to we have in 
> house tool to analyze the generic-applicaiton-history files generated by 
> resource manager. While we don't enable timeline service.
> {code:java}
> HTTP ERROR 500
> Problem accessing /proxy/application_1553677175329_47053/. Reason:
> Call From x/10.22.59.23 to 0.0.0.0:10200 failed on connection 
> exception: java.net.ConnectException: Connection refused; For more details 
> see:  http://wiki.apache.org/hadoop/ConnectionRefused
> Caused by:
> java.net.ConnectException: Call From x/10.22.59.23 to 0.0.0.0:10200 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.GeneratedConstructorAccessor240.newInstance(Unknown 
> Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
>   at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1558)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1498)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1398)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationHistoryProtocolPBClientImpl.getApplicationReport(ApplicationHistoryProtocolPBClientImpl.java:108)
>   at 
> org.apache.hadoop.yarn.server.webproxy.AppReportFetcher.getApplicationReport(AppReportFetcher.java:137)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getApplicationReport(WebAppProxyServlet.java:251)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getFetchedAppReport(WebAppProxyServlet.java:491)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:329)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>   at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:178)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
>   at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>   at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
>   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57)
>   at 
> 

[jira] [Updated] (YARN-9494) ApplicationHistoryServer endpoint access wrongly requested

2019-04-17 Thread Wang, Xinglong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wang, Xinglong updated YARN-9494:
-
Attachment: YARN-9494.001.patch

> ApplicationHistoryServer endpoint access wrongly requested
> --
>
> Key: YARN-9494
> URL: https://issues.apache.org/jira/browse/YARN-9494
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2
>Reporter: Wang, Xinglong
>Priority: Minor
> Attachments: YARN-9494.001.patch
>
>
> With the following configuration, resource manager will redirect 
> https://resourcemanager.hadoop.com:50030/proxy/application_1553677175329_47053/
>  to  0.0.0.0:10200 when resource manager can't find 
> application_1553677175329_47053 in applicationManager.
> {code:java}
> yarn.timeline-service.enabled = false
> yarn.timeline-service.generic-application-history.enabled = true
> {code}
> However, in this case, there is no timeline service enabled, thus no 
> yarn.timeline-service.address defined, and 0.0.0.0:10200 will be used as 
> timelineserver access point.
> This combination of configuration is a valid configuration, due to we have in 
> house tool to analyze the generic-applicaiton-history files generated by 
> resource manager. While we don't enable timeline service.
> {code:java}
> HTTP ERROR 500
> Problem accessing /proxy/application_1553677175329_47053/. Reason:
> Call From x/10.22.59.23 to 0.0.0.0:10200 failed on connection 
> exception: java.net.ConnectException: Connection refused; For more details 
> see:  http://wiki.apache.org/hadoop/ConnectionRefused
> Caused by:
> java.net.ConnectException: Call From x/10.22.59.23 to 0.0.0.0:10200 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.GeneratedConstructorAccessor240.newInstance(Unknown 
> Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
>   at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1558)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1498)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1398)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationHistoryProtocolPBClientImpl.getApplicationReport(ApplicationHistoryProtocolPBClientImpl.java:108)
>   at 
> org.apache.hadoop.yarn.server.webproxy.AppReportFetcher.getApplicationReport(AppReportFetcher.java:137)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getApplicationReport(WebAppProxyServlet.java:251)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getFetchedAppReport(WebAppProxyServlet.java:491)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:329)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>   at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:178)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
>   at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>   at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
>   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> 

[jira] [Created] (YARN-9494) ApplicationHistoryServer endpoint access wrongly requested

2019-04-17 Thread Wang, Xinglong (JIRA)
Wang, Xinglong created YARN-9494:


 Summary: ApplicationHistoryServer endpoint access wrongly requested
 Key: YARN-9494
 URL: https://issues.apache.org/jira/browse/YARN-9494
 Project: Hadoop YARN
  Issue Type: Bug
  Components: ATSv2
Reporter: Wang, Xinglong


With the following configuration, resource manager will redirect 
https://resourcemanager.hadoop.com:50030/proxy/application_1553677175329_47053/ 
to  0.0.0.0:10200 when resource manager can't find 
application_1553677175329_47053 in applicationManager.

{code:java}
yarn.timeline-service.enabled = false
yarn.timeline-service.generic-application-history.enabled = true
{code}

However, in this case, there is no timeline service enabled, thus no 
yarn.timeline-service.address defined, and 0.0.0.0:10200 will be used as 
timelineserver access point.

This combination of configuration is a valid configuration, due to we have in 
house tool to analyze the generic-applicaiton-history files generated by 
resource manager. While we don't enable timeline service.

{code:java}
HTTP ERROR 500

Problem accessing /proxy/application_1553677175329_47053/. Reason:

Call From x/10.22.59.23 to 0.0.0.0:10200 failed on connection 
exception: java.net.ConnectException: Connection refused; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused

Caused by:

java.net.ConnectException: Call From x/10.22.59.23 to 0.0.0.0:10200 failed 
on connection exception: java.net.ConnectException: Connection refused; For 
more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.GeneratedConstructorAccessor240.newInstance(Unknown 
Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1558)
at org.apache.hadoop.ipc.Client.call(Client.java:1498)
at org.apache.hadoop.ipc.Client.call(Client.java:1398)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationHistoryProtocolPBClientImpl.getApplicationReport(ApplicationHistoryProtocolPBClientImpl.java:108)
at 
org.apache.hadoop.yarn.server.webproxy.AppReportFetcher.getApplicationReport(AppReportFetcher.java:137)
at 
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getApplicationReport(WebAppProxyServlet.java:251)
at 
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.getFetchedAppReport(WebAppProxyServlet.java:491)
at 
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:329)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:178)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
at 
com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:617)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:576)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at