[jira] [Commented] (YARN-8865) RMStateStore contains large number of expired RMDelegationToken

2018-11-06 Thread Daryn Sharp (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676839#comment-16676839
 ] 

Daryn Sharp commented on YARN-8865:
---

+1 looks good!

> RMStateStore contains large number of expired RMDelegationToken
> ---
>
> Key: YARN-8865
> URL: https://issues.apache.org/jira/browse/YARN-8865
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-8865.001.patch, YARN-8865.002.patch, 
> YARN-8865.003.patch, YARN-8865.004.patch, YARN-8865.005.patch, 
> YARN-8865.006.patch
>
>
> When the RM state store is restored expired delegation tokens are restored 
> and added to the system. These expired tokens do not get cleaned up or 
> removed. The exact reason why the tokens are still in the store is not clear. 
> We have seen as many as 250,000 tokens in the store some of which were 2 
> years old.
> This has two side effects:
> * for the zookeeper store this leads to a jute buffer exhaustion issue and 
> prevents the RM from becoming active.
> * restore takes longer than needed and heap usage is higher than it should be
> We should not restore already expired tokens since they cannot be renewed or 
> used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8865) RMStateStore contains large number of expired RMDelegationToken

2018-10-23 Thread Daryn Sharp (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661222#comment-16661222
 ] 

Daryn Sharp commented on YARN-8865:
---

You shouldn't modify the token identifier, ie. change the max date, because an 
identifier is and must be immutable.  I think a very similar and safe change is 
when the secret key doesn't exist, artificially expire the token by creating 
the {{DelegationTokenInformation}} with a {{renewDate}} in the past.

> RMStateStore contains large number of expired RMDelegationToken
> ---
>
> Key: YARN-8865
> URL: https://issues.apache.org/jira/browse/YARN-8865
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-8865.001.patch, YARN-8865.002.patch, 
> YARN-8865.003.patch, YARN-8865.004.patch, YARN-8865.005.patch
>
>
> When the RM state store is restored expired delegation tokens are restored 
> and added to the system. These expired tokens do not get cleaned up or 
> removed. The exact reason why the tokens are still in the store is not clear. 
> We have seen as many as 250,000 tokens in the store some of which were 2 
> years old.
> This has two side effects:
> * for the zookeeper store this leads to a jute buffer exhaustion issue and 
> prevents the RM from becoming active.
> * restore takes longer than needed and heap usage is higher than it should be
> We should not restore already expired tokens since they cannot be renewed or 
> used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8865) RMStateStore contains large number of expired RMDelegationToken

2018-10-11 Thread Daryn Sharp (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16646515#comment-16646515
 ] 

Daryn Sharp commented on YARN-8865:
---

Good job.  That explains why the secret manager doesn't remove them.  What's 
interesting is secret keys are supposed to outlive their tokens.  Were secret 
keys manually deleted?  Regardless the secret manager should be able to recover 
its state.

The patch is a high risky change for a common class.  All secret managers are 
not be equipped to handle mutation during loading.  Case in point: The NN 
generates an edit to remove tokens.  Edits cannot be generated while replaying 
edits (restoring state).  Fundamentally a HA standby cannot modify state.  
Similar issues probably exist for other secret managers.

Perhaps the lowest risk change is add tokens with an invalid key anyway.  Set 
the password to null.  Authentication will fail, and should allow the 
expiration thread to correctly remove the tokens.

Or the lowest risk change is modify the RMDTSM to handle removal while 
restoring state.

> RMStateStore contains large number of expired RMDelegationToken
> ---
>
> Key: YARN-8865
> URL: https://issues.apache.org/jira/browse/YARN-8865
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-8865.001.patch, YARN-8865.002.patch
>
>
> When the RM state store is restored expired delegation tokens are restored 
> and added to the system. These expired tokens do not get cleaned up or 
> removed. The exact reason why the tokens are still in the store is not clear. 
> We have seen as many as 250,000 tokens in the store some of which were 2 
> years old.
> This has two side effects:
> * for the zookeeper store this leads to a jute buffer exhaustion issue and 
> prevents the RM from becoming active.
> * restore takes longer than needed and heap usage is higher than it should be
> We should not restore already expired tokens since they cannot be renewed or 
> used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8865) RMStateStore contains large number of expired RMDelegationToken

2018-10-10 Thread Daryn Sharp (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645520#comment-16645520
 ] 

Daryn Sharp commented on YARN-8865:
---

The RMDelegationTokenSecretManager is an AbstractDelegationTokenSecretManager.  
The ADTSM uses a thread to periodically roll secret keys and purge expired 
tokens.  We checked some clusters that use the level db state store and we're 
not leaking tokens which implies the problem is likely specific to the 
ZKRMStateStore.

Given it's the ADTSM's job to expunge expired tokens, every state store impl 
should not be burdened with duplicated code to explicitly purge tokens just 
because one state store impl is buggy.

> RMStateStore contains large number of expired RMDelegationToken
> ---
>
> Key: YARN-8865
> URL: https://issues.apache.org/jira/browse/YARN-8865
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-8865.001.patch
>
>
> When the RM state store is restored expired delegation tokens are restored 
> and added to the system. These expired tokens do not get cleaned up or 
> removed. The exact reason why the tokens are still in the store is not clear. 
> We have seen as many as 250,000 tokens in the store some of which were 2 
> years old.
> This has two side effects:
> * for the zookeeper store this leads to a jute buffer exhaustion issue and 
> prevents the RM from becoming active.
> * restore takes longer than needed and heap usage is higher than it should be
> We should not restore already expired tokens since they cannot be renewed or 
> used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment

2018-05-10 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470662#comment-16470662
 ] 

Daryn Sharp commented on YARN-8108:
---

bq. I took a look into the issue and am feeling okay about the conservative fix 
of making RMAuthenticationFilter global whenever it is enabled.

While that would "work", isn't it be a regression?  An admin that specifically 
configured those filters, perhaps with different principals as Eric previously 
mentioned, would be quite surprised to discover that the configuration is now 
silently ignored.

Per earlier comments, the issue is apparently not present through at least 
2.7.5.  Most of the referenced jiras are up to 5 years old.  We still need to 
identity which (recent-ish) jira caused the regression to understand the 
problem.

> RM metrics rest API throws GSSException in kerberized environment
> -
>
> Key: YARN-8108
> URL: https://issues.apache.org/jira/browse/YARN-8108
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Kshitij Badani
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-8108.001.patch
>
>
> Test is trying to pull up metrics data from SHS after kiniting as 'test_user'
> It is throwing GSSException as follows
> {code:java}
> b2b460b80713|RUNNING: curl --silent -k -X GET -D 
> /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : 
> http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15
>  07:15:48,757|INFO|MainThread|machine.py:194 - 
> run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0
> 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - 
> getMetricsJsonData()|metrics:
> 
> 
> 
> Error 403 GSSException: Failure unspecified at GSS-API level 
> (Mechanism level: Request is a replay (34))
> 
> HTTP ERROR 403
> Problem accessing /proxy/application_1518674952153_0070/metrics/json. 
> Reason:
>  GSSException: Failure unspecified at GSS-API level (Mechanism level: 
> Request is a replay (34))
> 
> 
> {code}
> Rootcausing : proxyserver on RM can't be supported for Kerberos enabled 
> cluster because AuthenticationFilter is applied twice in Hadoop code (once in 
> httpServer2 for RM, and another instance from AmFilterInitializer for proxy 
> server). This will require code changes to hadoop-yarn-server-web-proxy 
> project



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment

2018-04-19 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444097#comment-16444097
 ] 

Daryn Sharp commented on YARN-8108:
---

The TGS issues are purely caused by the double registration of the 
RMAuthenticationFilter for the /proxy path, so I don't think the SpnegoFilter 
init is involved.  Please clarify the relevance?

Silently ignoring the explicit configuration for the proxyserver when it's 
internal may have security ramifications.  An admin may want more or less 
restrictive auth for the two services.

I'm a bit uneasy with rationalizing how to fix an issue, with an unknown root 
cause, with a not well understood fix.  Please track down the Jira that 
introduced the regression/incompatibility so we can correctly assess the 
problem.

> RM metrics rest API throws GSSException in kerberized environment
> -
>
> Key: YARN-8108
> URL: https://issues.apache.org/jira/browse/YARN-8108
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Kshitij Badani
>Priority: Major
> Attachments: YARN-8108.001.patch
>
>
> Test is trying to pull up metrics data from SHS after kiniting as 'test_user'
> It is throwing GSSException as follows
> {code:java}
> b2b460b80713|RUNNING: curl --silent -k -X GET -D 
> /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : 
> http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15
>  07:15:48,757|INFO|MainThread|machine.py:194 - 
> run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0
> 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - 
> getMetricsJsonData()|metrics:
> 
> 
> 
> Error 403 GSSException: Failure unspecified at GSS-API level 
> (Mechanism level: Request is a replay (34))
> 
> HTTP ERROR 403
> Problem accessing /proxy/application_1518674952153_0070/metrics/json. 
> Reason:
>  GSSException: Failure unspecified at GSS-API level (Mechanism level: 
> Request is a replay (34))
> 
> 
> {code}
> Rootcausing : proxyserver on RM can't be supported for Kerberos enabled 
> cluster because AuthenticationFilter is applied twice in Hadoop code (once in 
> httpServer2 for RM, and another instance from AmFilterInitializer for proxy 
> server). This will require code changes to hadoop-yarn-server-web-proxy 
> project



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment

2018-04-18 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443180#comment-16443180
 ] 

Daryn Sharp commented on YARN-8108:
---

bq. This seems to work and not trigger code path registered by proxyserver.

Please elaborate:
# Why do we want to bypass the code registered by the proxyserver?
# Should the proxy service even be using the RM's auth filter?
# How/why does changing addFilter to addGlobalFilter fix the problem?  Adding 
the filter to every context (even those explicitly registered to not be 
filtered) seems counterintuitive.

I think we also need to root cause exactly what change caused the RM auth 
filter to be double registered so we can ensure we've correctly fixed the bug.

> RM metrics rest API throws GSSException in kerberized environment
> -
>
> Key: YARN-8108
> URL: https://issues.apache.org/jira/browse/YARN-8108
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Kshitij Badani
>Priority: Major
> Attachments: YARN-8108.001.patch
>
>
> Test is trying to pull up metrics data from SHS after kiniting as 'test_user'
> It is throwing GSSException as follows
> {code:java}
> b2b460b80713|RUNNING: curl --silent -k -X GET -D 
> /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : 
> http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15
>  07:15:48,757|INFO|MainThread|machine.py:194 - 
> run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0
> 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - 
> getMetricsJsonData()|metrics:
> 
> 
> 
> Error 403 GSSException: Failure unspecified at GSS-API level 
> (Mechanism level: Request is a replay (34))
> 
> HTTP ERROR 403
> Problem accessing /proxy/application_1518674952153_0070/metrics/json. 
> Reason:
>  GSSException: Failure unspecified at GSS-API level (Mechanism level: 
> Request is a replay (34))
> 
> 
> {code}
> Rootcausing : proxyserver on RM can't be supported for Kerberos enabled 
> cluster because AuthenticationFilter is applied twice in Hadoop code (once in 
> httpServer2 for RM, and another instance from AmFilterInitializer for proxy 
> server). This will require code changes to hadoop-yarn-server-web-proxy 
> project



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment

2018-04-17 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440940#comment-16440940
 ] 

Daryn Sharp commented on YARN-8108:
---

Analysis looks sound.  Agree each servlet should scope filters for itself, not 
globally.

I'm surprised this hasn't been found before.  Is this specific to 3.x?  Or does 
it exist in 2.x?  (I guess we haven't see this bug due to an alternate auth for 
the RM)

> RM metrics rest API throws GSSException in kerberized environment
> -
>
> Key: YARN-8108
> URL: https://issues.apache.org/jira/browse/YARN-8108
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Kshitij Badani
>Priority: Major
>
> Test is trying to pull up metrics data from SHS after kiniting as 'test_user'
> It is throwing GSSException as follows
> {code:java}
> b2b460b80713|RUNNING: curl --silent -k -X GET -D 
> /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : 
> http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15
>  07:15:48,757|INFO|MainThread|machine.py:194 - 
> run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0
> 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - 
> getMetricsJsonData()|metrics:
> 
> 
> 
> Error 403 GSSException: Failure unspecified at GSS-API level 
> (Mechanism level: Request is a replay (34))
> 
> HTTP ERROR 403
> Problem accessing /proxy/application_1518674952153_0070/metrics/json. 
> Reason:
>  GSSException: Failure unspecified at GSS-API level (Mechanism level: 
> Request is a replay (34))
> 
> 
> {code}
> Rootcausing : proxyserver on RM can't be supported for Kerberos enabled 
> cluster because AuthenticationFilter is applied twice in Hadoop code (once in 
> httpServer2 for RM, and another instance from AmFilterInitializer for proxy 
> server). This will require code changes to hadoop-yarn-server-web-proxy 
> project



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7922) Yarn dont resolve rm/_HOST to hostname

2018-02-12 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361017#comment-16361017
 ] 

Daryn Sharp commented on YARN-7922:
---

This shouldn't be able to happen.  Distributed shell gets the renewer from 
{{YarnClientUtils.getRmPrincipal}} which calls 
{{SecurityUtil.getServerPrincipal}} to substitute _HOST.  Yet somehow the 
substitution did not occur.

The most conceivable, yet unlikely, way I see this failing is the principal has 
more than 3 components, ie.  contains another / or @, which would cause 
the substitution to short-out.

> Yarn dont resolve rm/_HOST to hostname
> --
>
> Key: YARN-7922
> URL: https://issues.apache.org/jira/browse/YARN-7922
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.3
>Reporter: Berry Österlund
>Priority: Minor
>
> The normal auth_to_local usually removes everything after the / in the 
> username of the Kerberos principle. That, together with the _HOST setting in 
> the configuration files specifying the Kerberos principles is usually what is 
> required to convert rm/_HOST@ to user yarn.
> In our environment, we cant use the default rules in auth_to_local. We have 
> to specify each and every host and only convert those specifically. In other 
> words, we don’t have the DEFAULT rule in auth_to_local. Ideally, the config 
> for us would be the following
> {code:java}
> RULE:[1:$1@$0](rm@)s/.*/invalid_user/
> RULE:[2:$1/$2@$0](rm/rm1_host.fulldomain@)s/.*/yarn/
> RULE:[2:$1/$2@$0](rm/rm2_host.fulldomain@)s/.*/yarn/
> {code}
> But if we use only that configuration, the servicecheck in Ambari failes with 
> the following exception.
> {code:java}
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit 
> application_1518422080198_0002 to YARN : Failed to renew token: Kind: 
> HDFS_DELEGATION_TOKEN, Service: ha-hdfs:devhadoop, Ident: 
> (HDFS_DELEGATION_TOKEN token 11096 for ambari-qa)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:272)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:708)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.Client.main(Client.java:215)
> {code}
>  
> Inside the RM’s logfile, I can find the following.
> {code:java}
> Caused by: org.apache.hadoop.security.AccessControlException: yarn tries to 
> renew a token with renewer rm/_HOST@
> {code}
> Adding the following rule to auth_to_local solves the problem
>  RULE:[2:$1/$2@$0](rm/_HOST@)s/.*/yarn/
> The client used to test this is executed with the following command
>  yarn org.apache.hadoop.yarn.applications.distributedshell.Client 
> -shell_command ls -num_containers 1 -jar 
> /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar
>  -timeout 30 --queue 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7319) java.net.UnknownHostException when trying contact node by hostname

2017-10-12 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202708#comment-16202708
 ] 

Daryn Sharp commented on YARN-7319:
---

bq. java.lang.IllegalArgumentException: java.net.UnknownHostException: 
hadoop-slave-743067341-hqrbk

I'm a bit confused.  Why is the node resolving itself as 
"hadoop-slave-743067341-hqrbk"?  I believe that's the hostname self-reported 
during registration.  If this is truly an ip-only environment, presumably that 
means the junk hostname is only in that node's /etc/hosts, but not in 
/etc/hosts of the other nodes?  I understand not having reverse dns.  However 
not having forward dns but assigning a private hostname is a bit obtuse, might 
as well not let the host resolve itself if nobody else can resolve it...

Did you try setting {{hadoop.security.token.service.use_ip=false}} per the 
javadocs on buildTokenService?  That will get you past the exception while 
generating the container token.  It's likely the client won't be able to locate 
the token though – ie. token will have a host, but if the env is ip-only, the 
client must use an ip to connect and won't be able to match the ip with the 
hostname in the token.



> java.net.UnknownHostException when trying contact node by hostname
> --
>
> Key: YARN-7319
> URL: https://issues.apache.org/jira/browse/YARN-7319
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Evgeny Makarov
>
> I'm trying to setup Hadoop on Kubernetes cluster with following setup:
> Hadoop master is k8s pod
> Each hadoop slave is additional k8s pod
> All communication is being processed on IP based manned. In HDFS I have 
> setting of dfs.namenode.datanode.registration.ip-hostname-check set to false 
> and all works fine, however same option missing for YARN manager. 
> Here part of hadoop-master log when trying to submit simple word-count job:
> 2017-10-12 09:00:25,005 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  Error trying to assign container token and NM token to an allocated 
> container container_1507798393049_0001_01_01
> java.lang.IllegalArgumentException: java.net.UnknownHostException: 
> hadoop-slave-743067341-hqrbk
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377)
> at 
> org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerToken(BuilderUtils.java:258)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.createContainerToken(RMContainerTokenSecretManager.java:220)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.pullNewlyAllocatedContainersAndNMTokens(SchedulerApplicationAttempt.java:454)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.getAllocation(FiCaSchedulerApp.java:269)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocate(CapacityScheduler.java:988)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:971)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:964)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:789)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:776)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.UnknownHostException: hadoop-slave-743067341-hqrbk
> ... 19 more
> As can be seen, host hadoop-slave-743067341-hqrbk is unreachable. Adding 

[jira] [Created] (YARN-7083) Log aggregation deletes/renames while file is open

2017-08-23 Thread Daryn Sharp (JIRA)
Daryn Sharp created YARN-7083:
-

 Summary: Log aggregation deletes/renames while file is open
 Key: YARN-7083
 URL: https://issues.apache.org/jira/browse/YARN-7083
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.8.2
Reporter: Daryn Sharp
Priority: Critical


YARN-6288 changes the log aggregation writer to be an autoclosable.  
Unfortunately the try-with-resources block for the writer will either rename or 
delete the log while open.

Assuming the NM's behavior is correct, deleting open files only results in 
ominous WARNs in the nodemanager log and increases the rate of logging in the 
NN when the implicit try-with-resource close fails.  These red herrings 
complicate debugging efforts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7048) Fix tests faking kerberos to explicitly set ugi auth type

2017-08-22 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16137000#comment-16137000
 ] 

Daryn Sharp commented on YARN-7048:
---

Note this patch contains no functional change outside of the two test files it 
updated.

Neither failing test is associated with the test files in this patch.

> Fix tests faking kerberos to explicitly set ugi auth type
> -
>
> Key: YARN-7048
> URL: https://issues.apache.org/jira/browse/YARN-7048
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-7048.patch
>
>
> TestTokenClientRMService and TestRMDelegationTokens are faking kerberos 
> authentication.  The remote user ugis are explicitly created as kerberos but 
> not the login user's ugi.  Prior to  HADOOP-9747 new ugi instances defaulted 
> to kerberos even if not kerberos.  Now ugis default to kerberos only if 
> actually kerberos based which causes the login user based tests to fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7048) Fix tests faking kerberos to explicitly set ugi auth type

2017-08-18 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-7048:
--
Attachment: YARN-7048.patch

> Fix tests faking kerberos to explicitly set ugi auth type
> -
>
> Key: YARN-7048
> URL: https://issues.apache.org/jira/browse/YARN-7048
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Daryn Sharp
> Attachments: YARN-7048.patch
>
>
> TestTokenClientRMService and TestRMDelegationTokens are faking kerberos 
> authentication.  The remote user ugis are explicitly created as kerberos but 
> not the login user's ugi.  Prior to  HADOOP-9747 new ugi instances defaulted 
> to kerberos even if not kerberos.  Now ugis default to kerberos only if 
> actually kerberos based which causes the login user based tests to fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7048) Fix tests faking kerberos to explicitly set ugi auth type

2017-08-18 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp reassigned YARN-7048:
-

Assignee: Daryn Sharp

> Fix tests faking kerberos to explicitly set ugi auth type
> -
>
> Key: YARN-7048
> URL: https://issues.apache.org/jira/browse/YARN-7048
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-7048.patch
>
>
> TestTokenClientRMService and TestRMDelegationTokens are faking kerberos 
> authentication.  The remote user ugis are explicitly created as kerberos but 
> not the login user's ugi.  Prior to  HADOOP-9747 new ugi instances defaulted 
> to kerberos even if not kerberos.  Now ugis default to kerberos only if 
> actually kerberos based which causes the login user based tests to fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7048) Fix tests faking kerberos to explicitly set ugi auth type

2017-08-18 Thread Daryn Sharp (JIRA)
Daryn Sharp created YARN-7048:
-

 Summary: Fix tests faking kerberos to explicitly set ugi auth type
 Key: YARN-7048
 URL: https://issues.apache.org/jira/browse/YARN-7048
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Daryn Sharp


TestTokenClientRMService and TestRMDelegationTokens are faking kerberos 
authentication.  The remote user ugis are explicitly created as kerberos but 
not the login user's ugi.  Prior to  HADOOP-9747 new ugi instances defaulted to 
kerberos even if not kerberos.  Now ugis default to kerberos only if actually 
kerberos based which causes the login user based tests to fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6679) Reduce Resource instance overhead via non-PBImpl

2017-06-21 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16057718#comment-16057718
 ] 

Daryn Sharp commented on YARN-6679:
---

[~jlowe] or [~nroberts] may be able to comment on the allocation throughput.  I 
just reduced overhead found by a profiler.

SLS may not be exercising the RM in the same manner as in a real-world setting. 
 If you look at {{ResourcePBImpl}} it has to:
# instantiate a builder – wasted object
# builder and its parent class have unneeded instance variables – wasted memory
# call setters for memory and vcores, each updates a bit field, assigns 
instance variable, marks parent builder dirty – unnecessary computational 
overhead

By comparison, a simple object with 2 longs is clearly a win.  Even if you 
aren't stressing the scheduler to its maximum, you should see fewer gc/min due 
to slower heap growth.  I don't have the profile available but the cost of 
excessive Resource instantiations is still a non-trivial percent of the loop.

> Reduce Resource instance overhead via non-PBImpl
> 
>
> Key: YARN-6679
> URL: https://issues.apache.org/jira/browse/YARN-6679
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Fix For: 2.9.0, 3.0.0-alpha4
>
> Attachments: YARN-6679.2.branch-2.patch, YARN-6679.2.trunk.patch, 
> YARN-6679.3.branch-2.patch, YARN-6679.3.trunk.patch, 
> YARN-6679.branch-2.patch, YARN-6679.trunk.patch
>
>
> Creating and using transient PB-based Resource instances during scheduling is 
> very expensive.  The overhead can be transparently reduced by internally 
> using lightweight non-PB based instances.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6681) Eliminate double-copy of child queues in canAssignToThisQueue

2017-06-21 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16057664#comment-16057664
 ] 

Daryn Sharp commented on YARN-6681:
---

bq. is it not good enough that leaf queue returns false and parent queue 
returns true ?

I don't know.  I tried to make the absolute minimal no-risk change that 
preserves existing semantics, as dubious as they may be.

The parent queue currently returns false if it has no child queues, so always 
returning true changes the existing semantics.  Likewise, a leaf queue subclass 
currently can claim to have child queues, so always returning false changes the 
semantics.

I'd suggest integrating the current patch(es) and use another jira for further 
changes/optimizations that change semantics?

> Eliminate double-copy of child queues in canAssignToThisQueue
> -
>
> Key: YARN-6681
> URL: https://issues.apache.org/jira/browse/YARN-6681
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6681.2.branch-2.8.patch, 
> YARN-6681.2.branch-2.patch, YARN-6681.2.trunk.patch, 
> YARN-6681.branch-2.8.patch, YARN-6681.branch-2.patch, YARN-6681.trunk.patch
>
>
> 20% of the time in {{AbstractCSQueue#canAssignToThisQueue}} is spent 
> performing two duplications a treemap of child queues into a list - once to 
> test for null, second to see if it's empty.  Eliminating the dups reduces the 
> overhead to 2%.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6682) Improve performance of AssignmentInformation datastructures

2017-06-08 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042953#comment-16042953
 ] 

Daryn Sharp commented on YARN-6682:
---

This is a simple/self-contained change.  Does anyone have time to review?

> Improve performance of AssignmentInformation datastructures
> ---
>
> Key: YARN-6682
> URL: https://issues.apache.org/jira/browse/YARN-6682
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6682.branch-2.8.patch, YARN-6682.branch-2.patch, 
> YARN-6682.trunk.patch
>
>
> {{AssignmentInformation}} is inefficient and creates lots of garbage that 
> increase gc pressure.  It creates 3 hashmaps that each contain only 2 
> enum-based keys. This requires wrapper node objects, boxing/unboxing of ints, 
> and more expensive lookups than simply using primitive arrays indexed by enum 
> ordinal.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6681) Eliminate double-copy of child queues in canAssignToThisQueue

2017-06-08 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042950#comment-16042950
 ] 

Daryn Sharp commented on YARN-6681:
---

[~Ying Zhang], are you ok with the current patch?

> Eliminate double-copy of child queues in canAssignToThisQueue
> -
>
> Key: YARN-6681
> URL: https://issues.apache.org/jira/browse/YARN-6681
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6681.2.branch-2.8.patch, 
> YARN-6681.2.branch-2.patch, YARN-6681.2.trunk.patch, 
> YARN-6681.branch-2.8.patch, YARN-6681.branch-2.patch, YARN-6681.trunk.patch
>
>
> 20% of the time in {{AbstractCSQueue#canAssignToThisQueue}} is spent 
> performing two duplications a treemap of child queues into a list - once to 
> test for null, second to see if it's empty.  Eliminating the dups reduces the 
> overhead to 2%.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6680) Avoid locking overhead for NO_LABEL lookups

2017-06-08 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042949#comment-16042949
 ] 

Daryn Sharp commented on YARN-6680:
---

Any feedback?

> Avoid locking overhead for NO_LABEL lookups
> ---
>
> Key: YARN-6680
> URL: https://issues.apache.org/jira/browse/YARN-6680
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6680.patch
>
>
> Labels are managed via a hash that is protected with a read lock.  The lock 
> acquire and release are each just as expensive as the hash lookup itself - 
> resulting in a 3X slowdown.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6679) Reduce Resource instance overhead via non-PBImpl

2017-06-08 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042946#comment-16042946
 ] 

Daryn Sharp commented on YARN-6679:
---

[~dan...@cloudera.com], have I addressed your concerns?

> Reduce Resource instance overhead via non-PBImpl
> 
>
> Key: YARN-6679
> URL: https://issues.apache.org/jira/browse/YARN-6679
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6679.2.branch-2.patch, YARN-6679.2.trunk.patch, 
> YARN-6679.3.branch-2.patch, YARN-6679.3.trunk.patch, 
> YARN-6679.branch-2.patch, YARN-6679.trunk.patch
>
>
> Creating and using transient PB-based Resource instances during scheduling is 
> very expensive.  The overhead can be transparently reduced by internally 
> using lightweight non-PB based instances.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6682) Improve performance of AssignmentInformation datastructures

2017-06-05 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16037513#comment-16037513
 ] 

Daryn Sharp commented on YARN-6682:
---

Test failures are completely unrelated.  Ex. RPC clients unable to connect to 
hexstrings...

> Improve performance of AssignmentInformation datastructures
> ---
>
> Key: YARN-6682
> URL: https://issues.apache.org/jira/browse/YARN-6682
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6682.branch-2.8.patch, YARN-6682.branch-2.patch, 
> YARN-6682.trunk.patch
>
>
> {{AssignmentInformation}} is inefficient and creates lots of garbage that 
> increase gc pressure.  It creates 3 hashmaps that each contain only 2 
> enum-based keys. This requires wrapper node objects, boxing/unboxing of ints, 
> and more expensive lookups than simply using primitive arrays indexed by enum 
> ordinal.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6679) Reduce Resource instance overhead via non-PBImpl

2017-06-05 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-6679:
--
Attachment: YARN-6679.3.branch-2.patch
YARN-6679.3.trunk.patch

Findbugs not related to this path. 
Test failures other than {{TestPBImplRecords}} (caused by me reducing 
visibility of {{getProto}}) not related.  I reverted the visibility.
Fixed up style issues.

> Reduce Resource instance overhead via non-PBImpl
> 
>
> Key: YARN-6679
> URL: https://issues.apache.org/jira/browse/YARN-6679
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6679.2.branch-2.patch, YARN-6679.2.trunk.patch, 
> YARN-6679.3.branch-2.patch, YARN-6679.3.trunk.patch, 
> YARN-6679.branch-2.patch, YARN-6679.trunk.patch
>
>
> Creating and using transient PB-based Resource instances during scheduling is 
> very expensive.  The overhead can be transparently reduced by internally 
> using lightweight non-PB based instances.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6681) Eliminate double-copy of child queues in canAssignToThisQueue

2017-06-05 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-6681:
--
Attachment: YARN-6681.2.trunk.patch
YARN-6681.2.branch-2.patch
YARN-6681.2.branch-2.8.patch

Updated {{ParentQueue#hasChildQueues}} to return true as to avoid unnecessary 
synchronization or locks.

> Eliminate double-copy of child queues in canAssignToThisQueue
> -
>
> Key: YARN-6681
> URL: https://issues.apache.org/jira/browse/YARN-6681
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6681.2.branch-2.8.patch, 
> YARN-6681.2.branch-2.patch, YARN-6681.2.trunk.patch, 
> YARN-6681.branch-2.8.patch, YARN-6681.branch-2.patch, YARN-6681.trunk.patch
>
>
> 20% of the time in {{AbstractCSQueue#canAssignToThisQueue}} is spent 
> performing two duplications a treemap of child queues into a list - once to 
> test for null, second to see if it's empty.  Eliminating the dups reduces the 
> overhead to 2%.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6679) Reduce Resource instance overhead via non-PBImpl

2017-06-05 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-6679:
--
Attachment: YARN-6679.2.trunk.patch
YARN-6679.2.branch-2.patch

[~dan...@cloudera.com], I'll save my get out of jail free card.  I added tests 
that new instances are not pb impls and the to/from conversion is correct.  
{{TestPBImplRecords}} also already does pb conversion tests so coverage should 
be good.

> Reduce Resource instance overhead via non-PBImpl
> 
>
> Key: YARN-6679
> URL: https://issues.apache.org/jira/browse/YARN-6679
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6679.2.branch-2.patch, YARN-6679.2.trunk.patch, 
> YARN-6679.branch-2.patch, YARN-6679.trunk.patch
>
>
> Creating and using transient PB-based Resource instances during scheduling is 
> very expensive.  The overhead can be transparently reduced by internally 
> using lightweight non-PB based instances.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6680) Avoid locking overhead for NO_LABEL lookups

2017-06-02 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035505#comment-16035505
 ] 

Daryn Sharp commented on YARN-6680:
---

# Maybe it was the same before, I just attacked the top few hot spots in a 
profile to get us unblocked.  I think Nathan measured a ~10% performance 
reduction but it was enough to push the RM off a cliff.  I overall achieved a 
2X performance increase from my patches under this umbrella so maybe this was 
low hanging fruit.
# Ah yes.  There are other (unnecessary) RW lock hotspots, but the label 
manager map uses locks to protect the concurrent map for essentially making a 
consistent copy.  A concurrent map can be iterated w/o CME but isn't guaranteed 
to visit every entry hence why I think the locks are there.
# Yes.  The goal is optimize the common case.
# Cloning the resource would be very bad.  Resource object allocation overhead 
is already very high.

> Avoid locking overhead for NO_LABEL lookups
> ---
>
> Key: YARN-6680
> URL: https://issues.apache.org/jira/browse/YARN-6680
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6680.patch
>
>
> Labels are managed via a hash that is protected with a read lock.  The lock 
> acquire and release are each just as expensive as the hash lookup itself - 
> resulting in a 3X slowdown.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6681) Eliminate double-copy of child queues in canAssignToThisQueue

2017-06-02 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034902#comment-16034902
 ] 

Daryn Sharp commented on YARN-6681:
---

Should/could I just unconditionally return true for a parent queue?  A RW lock 
is ridiculously expensive to fetch the size.  I tried to make minimal/low risk 
changes for an internal build to get us unblocked but it would seem to make 
sense.  My hesitation was the null check on childQueues, implying an iteration 
would NPE, but it's marked final so always returning true seems safe?

> Eliminate double-copy of child queues in canAssignToThisQueue
> -
>
> Key: YARN-6681
> URL: https://issues.apache.org/jira/browse/YARN-6681
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6681.branch-2.8.patch, YARN-6681.branch-2.patch, 
> YARN-6681.trunk.patch
>
>
> 20% of the time in {{AbstractCSQueue#canAssignToThisQueue}} is spent 
> performing two duplications a treemap of child queues into a list - once to 
> test for null, second to see if it's empty.  Eliminating the dups reduces the 
> overhead to 2%.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6680) Avoid locking overhead for NO_LABEL lookups

2017-06-02 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034834#comment-16034834
 ] 

Daryn Sharp commented on YARN-6680:
---

I use a profiler for performance work – hunches are inevitable wrong.  [~jlowe] 
and [~nroberts] verified the improvement.  2.8 is current DOA, see 
[details|https://issues.apache.org/jira/browse/YARN-6679?focusedCommentId=16033655=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16033655].
  This patch, along with my others under the umbrella, increased overall 
performance by ~2X.  

The scheduler's fine grain locking was a bad idea.  RW locks are not cheap esp. 
for tiny critical sections.  Write barriers are extremely expensive - slower 
than a hash lookup.  Surprising but true.  Eventually these maps should be 
concurrent maps which uses no lock for read ops, and the memory read barriers 
are cheap.  The processor just sniffs the cache lines.

bq. i feel the locks atleast in few places are required to maintain 
consistency. [...] read lock is required as intermittently node's partition 
mapping could be changed, or node can be deactivated etc... all the ops where 
write lock is held ?

The locks currently do not provide guaranteed consistency.  Example:
# Consistent:
#* thread1 read locks, gets resource, unlocks
#* thread2 write locks, updates resource
#* thread1 accesses resource – won't see thread2 update immediately
# Inconsistent:
#* thread1 write locks, updates resource
#* thread2 read locks, gets resource, unlocks, accesses resource – will see 
thread1 update

With my patch, the reader won't see the update in either case (unless it was 
also the writer).  The question is does it matter?  It's already a race due to 
no coarse grain lock to provide a snapshot view in time.  Will it have 
detrimental impact to possibly see slightly stale data?  If it does then 
there's already a major bug in code.

In the end, this patch contributed to making 2.8 actually deployable.

> Avoid locking overhead for NO_LABEL lookups
> ---
>
> Key: YARN-6680
> URL: https://issues.apache.org/jira/browse/YARN-6680
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6680.patch
>
>
> Labels are managed via a hash that is protected with a read lock.  The lock 
> acquire and release are each just as expensive as the hash lookup itself - 
> resulting in a 3X slowdown.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6682) Improve performance of AssignmentInformation datastructures

2017-06-01 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-6682:
--
Attachment: YARN-6682.branch-2.8.patch
YARN-6682.branch-2.patch
YARN-6682.trunk.patch

Simply uses primitive arrays indexed on ordinal.  Differences between branches 
is trivial: ex. generics removal, containerId vs rmContainer.

> Improve performance of AssignmentInformation datastructures
> ---
>
> Key: YARN-6682
> URL: https://issues.apache.org/jira/browse/YARN-6682
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6682.branch-2.8.patch, YARN-6682.branch-2.patch, 
> YARN-6682.trunk.patch
>
>
> {{AssignmentInformation}} is inefficient and creates lots of garbage that 
> increase gc pressure.  It creates 3 hashmaps that each contain only 2 
> enum-based keys. This requires wrapper node objects, boxing/unboxing of ints, 
> and more expensive lookups than simply using primitive arrays indexed by enum 
> ordinal.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6682) Improve performance of AssignmentInformation datastructures

2017-06-01 Thread Daryn Sharp (JIRA)
Daryn Sharp created YARN-6682:
-

 Summary: Improve performance of AssignmentInformation 
datastructures
 Key: YARN-6682
 URL: https://issues.apache.org/jira/browse/YARN-6682
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.8.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp


{{AssignmentInformation}} is inefficient and creates lots of garbage that 
increase gc pressure.  It creates 3 hashmaps that each contain only 2 
enum-based keys. This requires wrapper node objects, boxing/unboxing of ints, 
and more expensive lookups than simply using primitive arrays indexed by enum 
ordinal.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6681) Eliminate double-copy of child queues in canAssignToThisQueue

2017-06-01 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-6681:
--
Attachment: YARN-6681.branch-2.patch
YARN-6681.branch-2.8.patch

The branch-2 patch was really branch-2.8.  All 3 patches are just context.

> Eliminate double-copy of child queues in canAssignToThisQueue
> -
>
> Key: YARN-6681
> URL: https://issues.apache.org/jira/browse/YARN-6681
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6681.branch-2.8.patch, YARN-6681.branch-2.patch, 
> YARN-6681.trunk.patch
>
>
> 20% of the time in {{AbstractCSQueue#canAssignToThisQueue}} is spent 
> performing two duplications a treemap of child queues into a list - once to 
> test for null, second to see if it's empty.  Eliminating the dups reduces the 
> overhead to 2%.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6681) Eliminate double-copy of child queues in canAssignToThisQueue

2017-06-01 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-6681:
--
Attachment: (was: YARN-6681.branch-2.patch)

> Eliminate double-copy of child queues in canAssignToThisQueue
> -
>
> Key: YARN-6681
> URL: https://issues.apache.org/jira/browse/YARN-6681
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6681.trunk.patch
>
>
> 20% of the time in {{AbstractCSQueue#canAssignToThisQueue}} is spent 
> performing two duplications a treemap of child queues into a list - once to 
> test for null, second to see if it's empty.  Eliminating the dups reduces the 
> overhead to 2%.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6680) Avoid locking overhead for NO_LABEL lookups

2017-06-01 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033665#comment-16033665
 ] 

Daryn Sharp commented on YARN-6680:
---

Findbugs warning is from {{AggregatedLogFormat}} which is not part of this 
patch.

> Avoid locking overhead for NO_LABEL lookups
> ---
>
> Key: YARN-6680
> URL: https://issues.apache.org/jira/browse/YARN-6680
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6680.patch
>
>
> Labels are managed via a hash that is protected with a read lock.  The lock 
> acquire and release are each just as expensive as the hash lookup itself - 
> resulting in a 3X slowdown.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6679) Reduce Resource instance overhead via non-PBImpl

2017-06-01 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033655#comment-16033655
 ] 

Daryn Sharp commented on YARN-6679:
---

Thanks Daniel!

Using signum to handle a NaN (impossible, I hope?) may not be worth the cost.   
I only moved the pre-existing method from the pb impl to the base class so I'd 
suggest another jira if feel strongly that it should be changed?

bq. How thoroughly has this been tested?
I'd say very.  The first large/busy pre-production cluster was crippled by 2.8. 
 The scheduler thread was constantly pegging a cpu and falling behind.  Deploys 
were halted.  We deployed my collection of patches about 1.5w ago.  Cpu 
fluctuates a lot, but doesn't stay pegged anymore.

bq. Wanna add some unit tests to confirm the newInstance() methods and the PB 
conversion work as expected?
I would certainly hope the existing tests provide coverage! :)  I didn't expose 
any new methods to test but I'll concoct some rudimentary tests if need be.

> Reduce Resource instance overhead via non-PBImpl
> 
>
> Key: YARN-6679
> URL: https://issues.apache.org/jira/browse/YARN-6679
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6679.branch-2.patch, YARN-6679.trunk.patch
>
>
> Creating and using transient PB-based Resource instances during scheduling is 
> very expensive.  The overhead can be transparently reduced by internally 
> using lightweight non-PB based instances.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6681) Eliminate double-copy of child queues in canAssignToThisQueue

2017-06-01 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-6681:
--
Attachment: YARN-6681.branch-2.patch
YARN-6681.trunk.patch

Add a {{hasChildQueues}} method that is overridden by {{ParentQueue}} to avoid 
the tree to list duplications.

> Eliminate double-copy of child queues in canAssignToThisQueue
> -
>
> Key: YARN-6681
> URL: https://issues.apache.org/jira/browse/YARN-6681
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6681.branch-2.patch, YARN-6681.trunk.patch
>
>
> 20% of the time in {{AbstractCSQueue#canAssignToThisQueue}} is spent 
> performing two duplications a treemap of child queues into a list - once to 
> test for null, second to see if it's empty.  Eliminating the dups reduces the 
> overhead to 2%.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6681) Eliminate double-copy of child queues in canAssignToThisQueue

2017-06-01 Thread Daryn Sharp (JIRA)
Daryn Sharp created YARN-6681:
-

 Summary: Eliminate double-copy of child queues in 
canAssignToThisQueue
 Key: YARN-6681
 URL: https://issues.apache.org/jira/browse/YARN-6681
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.8.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp


20% of the time in {{AbstractCSQueue#canAssignToThisQueue}} is spent performing 
two duplications a treemap of child queues into a list - once to test for null, 
second to see if it's empty.  Eliminating the dups reduces the overhead to 2%.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6680) Avoid locking overhead for NO_LABEL lookups

2017-06-01 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-6680:
--
Attachment: YARN-6680.patch

Simply maintains a reference to the no-label instances that are already being 
seeded into maps.  Lookups of the no-label key will use the reference in lieu 
of the lock and hash.

Not a perfect solution but provides a significant performance boost until the 
maps can be changed to concurrent.

> Avoid locking overhead for NO_LABEL lookups
> ---
>
> Key: YARN-6680
> URL: https://issues.apache.org/jira/browse/YARN-6680
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6680.patch
>
>
> Labels are managed via a hash that is protected with a read lock.  The lock 
> acquire and release are each just as expensive as the hash lookup itself - 
> resulting in a 3X slowdown.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6680) Avoid locking overhead for NO_LABEL lookups

2017-06-01 Thread Daryn Sharp (JIRA)
Daryn Sharp created YARN-6680:
-

 Summary: Avoid locking overhead for NO_LABEL lookups
 Key: YARN-6680
 URL: https://issues.apache.org/jira/browse/YARN-6680
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.8.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp


Labels are managed via a hash that is protected with a read lock.  The lock 
acquire and release are each just as expensive as the hash lookup itself - 
resulting in a 3X slowdown.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6679) Reduce Resource instance overhead via non-PBImpl

2017-06-01 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-6679:
--
Attachment: YARN-6679.branch-2.patch

Same as trunk patch, plus 1-line changes to the deprecated and removed resource 
increase/decrease PBs and request PB.

> Reduce Resource instance overhead via non-PBImpl
> 
>
> Key: YARN-6679
> URL: https://issues.apache.org/jira/browse/YARN-6679
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6679.branch-2.patch, YARN-6679.trunk.patch
>
>
> Creating and using transient PB-based Resource instances during scheduling is 
> very expensive.  The overhead can be transparently reduced by internally 
> using lightweight non-PB based instances.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (YARN-6245) Add FinalResource object to reduce overhead of Resource class instancing

2017-06-01 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-6245:
--
Comment: was deleted

(was: Same as trunk patch, plus 1-line changes to the deprecated and removed 
resource increase/decrease PBs and request PB.)

> Add FinalResource object to reduce overhead of Resource class instancing
> 
>
> Key: YARN-6245
> URL: https://issues.apache.org/jira/browse/YARN-6245
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
> Attachments: observable-resource.patch, 
> YARN-6245.preliminary-staled.1.patch
>
>
> There're lots of Resource object creation in YARN Scheduler, since Resource 
> object is backed by protobuf, creation of such objects is expensive and 
> becomes bottleneck.
> To address the problem, we can introduce a FinalResource (Is it better to 
> call it ImmutableResource?) object, which is not backed by PBImpl. We can use 
> this object in frequent invoke paths in the scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6245) Add FinalResource object to reduce overhead of Resource class instancing

2017-06-01 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-6245:
--
Attachment: YARN-6679.branch-2.patch

Same as trunk patch, plus 1-line changes to the deprecated and removed resource 
increase/decrease PBs and request PB.

> Add FinalResource object to reduce overhead of Resource class instancing
> 
>
> Key: YARN-6245
> URL: https://issues.apache.org/jira/browse/YARN-6245
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
> Attachments: observable-resource.patch, 
> YARN-6245.preliminary-staled.1.patch
>
>
> There're lots of Resource object creation in YARN Scheduler, since Resource 
> object is backed by protobuf, creation of such objects is expensive and 
> becomes bottleneck.
> To address the problem, we can introduce a FinalResource (Is it better to 
> call it ImmutableResource?) object, which is not backed by PBImpl. We can use 
> this object in frequent invoke paths in the scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6245) Add FinalResource object to reduce overhead of Resource class instancing

2017-06-01 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-6245:
--
Attachment: (was: YARN-6679.branch-2.patch)

> Add FinalResource object to reduce overhead of Resource class instancing
> 
>
> Key: YARN-6245
> URL: https://issues.apache.org/jira/browse/YARN-6245
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
> Attachments: observable-resource.patch, 
> YARN-6245.preliminary-staled.1.patch
>
>
> There're lots of Resource object creation in YARN Scheduler, since Resource 
> object is backed by protobuf, creation of such objects is expensive and 
> becomes bottleneck.
> To address the problem, we can introduce a FinalResource (Is it better to 
> call it ImmutableResource?) object, which is not backed by PBImpl. We can use 
> this object in frequent invoke paths in the scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6245) Add FinalResource object to reduce overhead of Resource class instancing

2017-06-01 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033233#comment-16033233
 ] 

Daryn Sharp commented on YARN-6245:
---

Posted to new YARN-6679 in case you wish to pursue the immutable resources, 
although the scheduler really should try to reuse instances when possible.

> Add FinalResource object to reduce overhead of Resource class instancing
> 
>
> Key: YARN-6245
> URL: https://issues.apache.org/jira/browse/YARN-6245
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
> Attachments: observable-resource.patch, 
> YARN-6245.preliminary-staled.1.patch
>
>
> There're lots of Resource object creation in YARN Scheduler, since Resource 
> object is backed by protobuf, creation of such objects is expensive and 
> becomes bottleneck.
> To address the problem, we can introduce a FinalResource (Is it better to 
> call it ImmutableResource?) object, which is not backed by PBImpl. We can use 
> this object in frequent invoke paths in the scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6679) Reduce Resource instance overhead via non-PBImpl

2017-06-01 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-6679:
--
Attachment: YARN-6679.trunk.patch

# Add private {{SimpleResource}} object with longs for memory/vcores.
# {{ResourcePBImpl#getProto(Resource)}} converts to a PBImpl as necessary to 
create a message
# Bulk of patch is regexp change of {{(ResourcePBImpl) r).getProto()}} to 
{{ProtoUtils.convertToProtoFormat(r)}} or the local class conversion method.
# In a few places, just set the PB instead of creating and checking equality 
before setting.  Otherwise simple resources will be double converted.

Overall effect is resources via PBs will remain PB-based.  The myriad of 
internal resource instances used during calculations become lightweight and 
converted to a PB iff necessary.

> Reduce Resource instance overhead via non-PBImpl
> 
>
> Key: YARN-6679
> URL: https://issues.apache.org/jira/browse/YARN-6679
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: YARN-6679.trunk.patch
>
>
> Creating and using transient PB-based Resource instances during scheduling is 
> very expensive.  The overhead can be transparently reduced by internally 
> using lightweight non-PB based instances.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6679) Reduce Resource instance overhead via non-PBImpl

2017-06-01 Thread Daryn Sharp (JIRA)
Daryn Sharp created YARN-6679:
-

 Summary: Reduce Resource instance overhead via non-PBImpl
 Key: YARN-6679
 URL: https://issues.apache.org/jira/browse/YARN-6679
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.8.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp


Creating and using transient PB-based Resource instances during scheduling is 
very expensive.  The overhead can be transparently reduced by internally using 
lightweight non-PB based instances.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6245) Add FinalResource object to reduce overhead of Resource class instancing

2017-06-01 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033054#comment-16033054
 ] 

Daryn Sharp commented on YARN-6245:
---

I've been OOO.  I'll be posting my collection of patches today for review.

> Add FinalResource object to reduce overhead of Resource class instancing
> 
>
> Key: YARN-6245
> URL: https://issues.apache.org/jira/browse/YARN-6245
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
> Attachments: observable-resource.patch, 
> YARN-6245.preliminary-staled.1.patch
>
>
> There're lots of Resource object creation in YARN Scheduler, since Resource 
> object is backed by protobuf, creation of such objects is expensive and 
> becomes bottleneck.
> To address the problem, we can introduce a FinalResource (Is it better to 
> call it ImmutableResource?) object, which is not backed by PBImpl. We can use 
> this object in frequent invoke paths in the scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6245) Add FinalResource object to reduce overhead of Resource class instancing

2017-05-22 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16019701#comment-16019701
 ] 

Daryn Sharp commented on YARN-6245:
---

[~jlowe] asked me to comment since we're running into 2.8 scheduler performance 
issues we believe are (in part) due to pb impl based objects.  I think I've 
designed a means for resources via RPC to remain {{ResourcePBImpl}} while 
internally created resources are lightweight and only converted to a PB if it 
will be sent over the wire.

At least as a start, it's a very simple patch that substitutes in a lightweight 
object via {{Resource.newInstance}} that simply contains 2 longs.  Replaced 
usages of {{((ResourcePBImpl)r)#getProto()}} with 
{{ProtoUtils.convertToProtoFormat(Resource)}} which converts the lightweight to 
a pb impl as required.  That's it.

We're testing today.  Will post a sample patch if it looks promising.


> Add FinalResource object to reduce overhead of Resource class instancing
> 
>
> Key: YARN-6245
> URL: https://issues.apache.org/jira/browse/YARN-6245
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
> Attachments: observable-resource.patch, 
> YARN-6245.preliminary-staled.1.patch
>
>
> There're lots of Resource object creation in YARN Scheduler, since Resource 
> object is backed by protobuf, creation of such objects is expensive and 
> becomes bottleneck.
> To address the problem, we can introduce a FinalResource (Is it better to 
> call it ImmutableResource?) object, which is not backed by PBImpl. We can use 
> this object in frequent invoke paths in the scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6603) NPE in RMAppsBlock

2017-05-16 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16012435#comment-16012435
 ] 

Daryn Sharp commented on YARN-6603:
---

+1 I think no test is fine due to difficulty of forcing the race condition and 
the patch essentially amounts to a null check.  Failed tests appear unrelated.

> NPE in RMAppsBlock
> --
>
> Key: YARN-6603
> URL: https://issues.apache.org/jira/browse/YARN-6603
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-6603.001.patch, YARN-6603.002.patch
>
>
> We are seeing an intermittent NPE when the RM is trying to render the 
> /cluster URI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6603) NPE in RMAppsBlock

2017-05-15 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011414#comment-16011414
 ] 

Daryn Sharp commented on YARN-6603:
---

After getting the rmApp, you should replace:
{code}
RMAppAttempt appAttempt = rmApp.getAppAttempts().get(appAttemptId);
{code}
with:
{code}
RMAppAttempt appAttempt = rmApp.getAppAttempt(appAttemptdId);
{code}

The current getAppAttempts() returns an unmodifiable collection of a 
non-threadsafe map which isn't useful at all.  The latter uses proper 
synchronization to lookup the attempt.

You may also be saddened to learn that a synchronized copy of the blacklist 
hashset is created just to get the size.  Bonus points for fixing that, but not 
necessary.

> NPE in RMAppsBlock
> --
>
> Key: YARN-6603
> URL: https://issues.apache.org/jira/browse/YARN-6603
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-6603.001.patch
>
>
> We are seeing an intermittent NPE when the RM is trying to render the 
> /cluster URI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-3760) Log aggregation failures

2017-03-29 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp reopened YARN-3760:
---

Line numbers are from an old release but the error is evident.
{code}
java.lang.IllegalStateException: Cannot close TFile in the middle of key-value 
insertion.
at org.apache.hadoop.io.file.tfile.TFile$Writer.close(TFile.java:310)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogWriter.close(AggregatedLogFormat.java:456)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:326)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:429)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:388)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$2.run(LogAggregationService.java:387)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
{code}

_AggregatedLogFormat.LogWriter_
{code}
public void close() {
  try {
this.writer.close();
  } catch (IOException e) {
LOG.warn("Exception closing writer", e);
  }
  IOUtils.closeStream(fsDataOStream);
}
{code}
TFile writer's close which may throw {{IllegalStateException}} if the 
underlying fs data stream failed.  Unfortunately it only catches IOE, so the 
ISE rips out w/o closing the fsdata stream.

Additionally, the ctor creates the fs data stream then a TFile.Writer w/o a 
try/catch.  If the TFile.Writer ctor throws an exception, it's impossible to 
close the stream.

I haven't checked if there are futher issues with closing the writer high in 
the stack.

> Log aggregation failures 
> -
>
> Key: YARN-3760
> URL: https://issues.apache.org/jira/browse/YARN-3760
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> The aggregated log file does not appear to be properly closed when writes 
> fail.  This leaves a lease renewer active in the NM that spams the NN with 
> lease renewals.  If the token is marked not to be cancelled, the renewals 
> appear to continue until the token expires.  If the token is cancelled, the 
> periodic renew spam turns into a flood of failed connections until the lease 
> renewer gives up.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4126) RM should not issue delegation tokens in unsecure mode

2016-10-25 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606587#comment-15606587
 ] 

Daryn Sharp commented on YARN-4126:
---

The general contract for servers is to return null when tokens are not 
applicable.  This violates that contract and throws an exception.  How is a 
generalized client supposed to pre-meditate fetching a token?  And how to 
handle a generic IOE?

I'd rather see this reverted from trunk and never integrated.  We've 
historically had lots of problem with all the security enabled conditionals, 
which is why one of my multi-year old endeavors is to have tokens always 
enabled and gut the security conditionals.  I've always admired the fact that 
yarn unconditionally used them...  This is a step backwards.


> RM should not issue delegation tokens in unsecure mode
> --
>
> Key: YARN-4126
> URL: https://issues.apache.org/jira/browse/YARN-4126
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Bibin A Chundatt
> Fix For: 3.0.0-alpha1
>
> Attachments: 0001-YARN-4126.patch, 0002-YARN-4126.patch, 
> 0003-YARN-4126.patch, 0004-YARN-4126.patch, 0005-YARN-4126.patch, 
> 0006-YARN-4126.patch
>
>
> ClientRMService#getDelegationToken is currently  returning a delegation token 
> in insecure mode. We should not return the token if it's in insecure mode. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4632) Replacing _HOST in RM_PRINCIPAL should not be the responsibility of the client code

2016-01-25 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115313#comment-15115313
 ] 

Daryn Sharp commented on YARN-4632:
---

How do you intend to make the change in yarn?  As you probably discovered, it's 
too late for the RM to make the substitution since the NN has already encoded 
the principal in the token.

> Replacing _HOST in RM_PRINCIPAL should not be the responsibility of the 
> client code
> ---
>
> Key: YARN-4632
> URL: https://issues.apache.org/jira/browse/YARN-4632
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
>
> It is currently the client's responsibility to call 
> {{SecurityUtil.getServerPrincipal()}} to replace the _HOST placeholder in any 
> principal name used for a delegation token.  This is a non-optional operation 
> and should not be pushed onto the client.
> All client apps that followed the distributed shell as the canonical example 
> failed to do the replacement because distributed shell fails to do the 
> replacement.  (See YARN-4629.)  Rather than fixing the whole world, since the 
> whole world use distributed shell as a model, let's move the operation into 
> YARN where it belongs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3760) Log aggregation failures

2015-06-02 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569289#comment-14569289
 ] 

Daryn Sharp commented on YARN-3760:
---

Cancelled tokens trigger the retry proxy bug.

 Log aggregation failures 
 -

 Key: YARN-3760
 URL: https://issues.apache.org/jira/browse/YARN-3760
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Daryn Sharp
Priority: Critical

 The aggregated log file does not appear to be properly closed when writes 
 fail.  This leaves a lease renewer active in the NM that spams the NN with 
 lease renewals.  If the token is marked not to be cancelled, the renewals 
 appear to continue until the token expires.  If the token is cancelled, the 
 periodic renew spam turns into a flood of failed connections until the lease 
 renewer gives up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3760) Log aggregation failures

2015-06-02 Thread Daryn Sharp (JIRA)
Daryn Sharp created YARN-3760:
-

 Summary: Log aggregation failures 
 Key: YARN-3760
 URL: https://issues.apache.org/jira/browse/YARN-3760
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Daryn Sharp
Priority: Critical


The aggregated log file does not appear to be properly closed when writes fail. 
 This leaves a lease renewer active in the NM that spams the NN with lease 
renewals.  If the token is marked not to be cancelled, the renewals appear to 
continue until the token expires.  If the token is cancelled, the periodic 
renew spam turns into a flood of failed connections until the lease renewer 
gives up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer

2015-04-09 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487620#comment-14487620
 ] 

Daryn Sharp commented on YARN-3055:
---

Two apps could double renew tokens (completely benign) before this patch.  In 
practice the possibility is slim and its harmless. 

However, currently it's quite buggy. Both apps renewed and then stomped over 
each other's dttrs in allTokens.  Now both apps reference separate yet 
equivalent dttr instances, when the intention was only one app should reference 
a token.  A second/duplicate timer task was also scheduled.  Haven't bothered 
to check later fallout from the inconsistencies.

Patch: A double renew can still occur (unavoidable) but only one timer is 
scheduled.  All apps reference the same dttr instance.  Moving the logic down 
only creates 3 loops instead of 2 loops but I'll do if you feel strongly.

 The token is not renewed properly if it's shared by jobs (oozie) in 
 DelegationTokenRenewer
 --

 Key: YARN-3055
 URL: https://issues.apache.org/jira/browse/YARN-3055
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Reporter: Yi Liu
Assignee: Daryn Sharp
Priority: Blocker
 Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch


 After YARN-2964, there is only one timer to renew the token if it's shared by 
 jobs. 
 In {{removeApplicationFromRenewal}}, when going to remove a token, and the 
 token is shared by other jobs, we will not cancel the token. 
 Meanwhile, we should not cancel the _timerTask_, also we should not remove it 
 from {{allTokens}}. Otherwise for the existing submitted applications which 
 share this token will not get renew any more, and for new submitted 
 applications which share this token, the token will be renew immediately.
 For example, we have 3 applications: app1, app2, app3. And they share the 
 token1. See following scenario:
 *1).* app1 is submitted firstly, then app2, and then app3. In this case, 
 there is only one token renewal timer for token1, and is scheduled when app1 
 is submitted
 *2).* app1 is finished, then the renewal timer is cancelled. token1 will not 
 be renewed any more, but app2 and app3 still use it, so there is problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer

2015-04-09 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-3055:
--
Attachment: YARN-3055.patch

 The token is not renewed properly if it's shared by jobs (oozie) in 
 DelegationTokenRenewer
 --

 Key: YARN-3055
 URL: https://issues.apache.org/jira/browse/YARN-3055
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Reporter: Yi Liu
Assignee: Daryn Sharp
Priority: Blocker
 Attachments: YARN-3055.001.patch, YARN-3055.002.patch, 
 YARN-3055.patch, YARN-3055.patch


 After YARN-2964, there is only one timer to renew the token if it's shared by 
 jobs. 
 In {{removeApplicationFromRenewal}}, when going to remove a token, and the 
 token is shared by other jobs, we will not cancel the token. 
 Meanwhile, we should not cancel the _timerTask_, also we should not remove it 
 from {{allTokens}}. Otherwise for the existing submitted applications which 
 share this token will not get renew any more, and for new submitted 
 applications which share this token, the token will be renew immediately.
 For example, we have 3 applications: app1, app2, app3. And they share the 
 token1. See following scenario:
 *1).* app1 is submitted firstly, then app2, and then app3. In this case, 
 there is only one token renewal timer for token1, and is scheduled when app1 
 is submitted
 *2).* app1 is finished, then the renewal timer is cancelled. token1 will not 
 be renewed any more, but app2 and app3 still use it, so there is problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer

2015-04-09 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487482#comment-14487482
 ] 

Daryn Sharp commented on YARN-3055:
---

Thanks Vinod, I'll revise this morning.  The ignores shouldn't be there.  I did 
that for our internal emergency fix because we I didn't handle proxy refresh 
tokens so I didn't care the tests failed.

 The token is not renewed properly if it's shared by jobs (oozie) in 
 DelegationTokenRenewer
 --

 Key: YARN-3055
 URL: https://issues.apache.org/jira/browse/YARN-3055
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Reporter: Yi Liu
Assignee: Daryn Sharp
Priority: Blocker
 Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch


 After YARN-2964, there is only one timer to renew the token if it's shared by 
 jobs. 
 In {{removeApplicationFromRenewal}}, when going to remove a token, and the 
 token is shared by other jobs, we will not cancel the token. 
 Meanwhile, we should not cancel the _timerTask_, also we should not remove it 
 from {{allTokens}}. Otherwise for the existing submitted applications which 
 share this token will not get renew any more, and for new submitted 
 applications which share this token, the token will be renew immediately.
 For example, we have 3 applications: app1, app2, app3. And they share the 
 token1. See following scenario:
 *1).* app1 is submitted firstly, then app2, and then app3. In this case, 
 there is only one token renewal timer for token1, and is scheduled when app1 
 is submitted
 *2).* app1 is finished, then the renewal timer is cancelled. token1 will not 
 be renewed any more, but app2 and app3 still use it, so there is problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer

2015-04-08 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-3055:
--
Attachment: YARN-3055.patch

Haven't had a chance to run findbugs.  Might grumble about sync 
dttr.applicationIds.  Will check this afternoon.

 The token is not renewed properly if it's shared by jobs (oozie) in 
 DelegationTokenRenewer
 --

 Key: YARN-3055
 URL: https://issues.apache.org/jira/browse/YARN-3055
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Reporter: Yi Liu
Assignee: Yi Liu
Priority: Blocker
 Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch


 After YARN-2964, there is only one timer to renew the token if it's shared by 
 jobs. 
 In {{removeApplicationFromRenewal}}, when going to remove a token, and the 
 token is shared by other jobs, we will not cancel the token. 
 Meanwhile, we should not cancel the _timerTask_, also we should not remove it 
 from {{allTokens}}. Otherwise for the existing submitted applications which 
 share this token will not get renew any more, and for new submitted 
 applications which share this token, the token will be renew immediately.
 For example, we have 3 applications: app1, app2, app3. And they share the 
 token1. See following scenario:
 *1).* app1 is submitted firstly, then app2, and then app3. In this case, 
 there is only one token renewal timer for token1, and is scheduled when app1 
 is submitted
 *2).* app1 is finished, then the renewal timer is cancelled. token1 will not 
 be renewed any more, but app2 and app3 still use it, so there is problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer

2015-04-08 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485461#comment-14485461
 ] 

Daryn Sharp commented on YARN-3055:
---

bq. It does seem odd to get the expiration date by renewing the token

The expiration is metadata associated with the token that is only known to the 
token issuer's secret manager.  The correct fix is for the renewer to not 
reschedule if the next expiration is the same as the last.  The bug wasn't a 
real priority when tokens weren't renewed forever.  If we regress to renewing 
forever, then it does become a problem.

bq.   I think currently the sub-job won't kill the overall workflow.

Correct, I misread in my haste.  It's rather the opposite:  sub-jobs can 
override the original job's request to cancel the tokens.

bq. I think overall the current patch will work, other than few comments I have.

It works but not in a desirable way.  Jason posted my patch we use internally 
on YARN-3439 which is duped to this jira.  I'm updating it to handle the proxy 
refresh cases and will post shortly.  The current semantics of the conf setting 
and the 2.x changes have been nothing but production blockers.  Ref counting 
will solve this once and for all.


 The token is not renewed properly if it's shared by jobs (oozie) in 
 DelegationTokenRenewer
 --

 Key: YARN-3055
 URL: https://issues.apache.org/jira/browse/YARN-3055
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Reporter: Yi Liu
Assignee: Yi Liu
Priority: Blocker
 Attachments: YARN-3055.001.patch, YARN-3055.002.patch


 After YARN-2964, there is only one timer to renew the token if it's shared by 
 jobs. 
 In {{removeApplicationFromRenewal}}, when going to remove a token, and the 
 token is shared by other jobs, we will not cancel the token. 
 Meanwhile, we should not cancel the _timerTask_, also we should not remove it 
 from {{allTokens}}. Otherwise for the existing submitted applications which 
 share this token will not get renew any more, and for new submitted 
 applications which share this token, the token will be renew immediately.
 For example, we have 3 applications: app1, app2, app3. And they share the 
 token1. See following scenario:
 *1).* app1 is submitted firstly, then app2, and then app3. In this case, 
 there is only one token renewal timer for token1, and is scheduled when app1 
 is submitted
 *2).* app1 is finished, then the renewal timer is cancelled. token1 will not 
 be renewed any more, but app2 and app3 still use it, so there is problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer

2015-04-08 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485890#comment-14485890
 ] 

Daryn Sharp commented on YARN-3055:
---

The renew at job submission isn't the problem.  It's actually very desirable.  
Years back, a job submitted with bad tokens - that was destined to fail - would 
be launched anyway.  The tasks failed to connect, ipc level retries occurred, 
then higher level retries occurred, and yarn generally caught all exceptions 
and retried.  Tasks were retried, perhaps the app attempt was retried, etc.  In 
the end, a job that _clearly was going to fail_ might tie up cluster resources 
for 20+ minutes.  Why was it launched when a failed renew could have prevented 
the problem?  Not to mention the renewer was hardcoded to assume the expiration 
interval was 24h...  So much for being able to stress test the renewer with 1m 
expirations.

The potential DOS problem is when a token has reached end of life expiration.  
Let's say the token can be renewed twice.The third and subsequent renews 
return the same expiration.
# t1 = submit + renew
# t2 = t1 + renew
# t3 = t2
# t4 = t2

The renew timers fire 90% of the delta between now and the next expiration.  So 
as end of life expiration approaches, the timer fires with an increasing 
frequency.  50 threads doing that virtually non-stop would not be pretty.  The 
solution is stop renewing when the next expiration equals the last expiration.  
That can be addressed in another jira that's not a blocker because if tokens 
aren't renewed forever then it's a rare situation.

 The token is not renewed properly if it's shared by jobs (oozie) in 
 DelegationTokenRenewer
 --

 Key: YARN-3055
 URL: https://issues.apache.org/jira/browse/YARN-3055
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Reporter: Yi Liu
Assignee: Yi Liu
Priority: Blocker
 Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch


 After YARN-2964, there is only one timer to renew the token if it's shared by 
 jobs. 
 In {{removeApplicationFromRenewal}}, when going to remove a token, and the 
 token is shared by other jobs, we will not cancel the token. 
 Meanwhile, we should not cancel the _timerTask_, also we should not remove it 
 from {{allTokens}}. Otherwise for the existing submitted applications which 
 share this token will not get renew any more, and for new submitted 
 applications which share this token, the token will be renew immediately.
 For example, we have 3 applications: app1, app2, app3. And they share the 
 token1. See following scenario:
 *1).* app1 is submitted firstly, then app2, and then app3. In this case, 
 there is only one token renewal timer for token1, and is scheduled when app1 
 is submitted
 *2).* app1 is finished, then the renewal timer is cancelled. token1 will not 
 be renewed any more, but app2 and app3 still use it, so there is problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer

2015-04-08 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486104#comment-14486104
 ] 

Daryn Sharp commented on YARN-3055:
---

I believe you are describing the behavior of 2.6's new proxy token refresh 
feature.  I won't digress into how broken it appears to be except in the 
simplest use case.  With it off, there is no fetching a new token.

 The token is not renewed properly if it's shared by jobs (oozie) in 
 DelegationTokenRenewer
 --

 Key: YARN-3055
 URL: https://issues.apache.org/jira/browse/YARN-3055
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Reporter: Yi Liu
Assignee: Yi Liu
Priority: Blocker
 Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch


 After YARN-2964, there is only one timer to renew the token if it's shared by 
 jobs. 
 In {{removeApplicationFromRenewal}}, when going to remove a token, and the 
 token is shared by other jobs, we will not cancel the token. 
 Meanwhile, we should not cancel the _timerTask_, also we should not remove it 
 from {{allTokens}}. Otherwise for the existing submitted applications which 
 share this token will not get renew any more, and for new submitted 
 applications which share this token, the token will be renew immediately.
 For example, we have 3 applications: app1, app2, app3. And they share the 
 token1. See following scenario:
 *1).* app1 is submitted firstly, then app2, and then app3. In this case, 
 there is only one token renewal timer for token1, and is scheduled when app1 
 is submitted
 *2).* app1 is finished, then the renewal timer is cancelled. token1 will not 
 be renewed any more, but app2 and app3 still use it, so there is problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer

2015-04-07 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484117#comment-14484117
 ] 

Daryn Sharp commented on YARN-3055:
---

On cursory glance, are you sure this isn't going to leak tokens?  Ie. does it 
remove tokens from data structures in all cases or can a token get left in 
allTokens?

 The token is not renewed properly if it's shared by jobs (oozie) in 
 DelegationTokenRenewer
 --

 Key: YARN-3055
 URL: https://issues.apache.org/jira/browse/YARN-3055
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Reporter: Yi Liu
Assignee: Yi Liu
Priority: Blocker
 Attachments: YARN-3055.001.patch, YARN-3055.002.patch


 After YARN-2964, there is only one timer to renew the token if it's shared by 
 jobs. 
 In {{removeApplicationFromRenewal}}, when going to remove a token, and the 
 token is shared by other jobs, we will not cancel the token. 
 Meanwhile, we should not cancel the _timerTask_, also we should not remove it 
 from {{allTokens}}. Otherwise for the existing submitted applications which 
 share this token will not get renew any more, and for new submitted 
 applications which share this token, the token will be renew immediately.
 For example, we have 3 applications: app1, app2, app3. And they share the 
 token1. See following scenario:
 *1).* app1 is submitted firstly, then app2, and then app3. In this case, 
 there is only one token renewal timer for token1, and is scheduled when app1 
 is submitted
 *2).* app1 is finished, then the renewal timer is cancelled. token1 will not 
 be renewed any more, but app2 and app3 still use it, so there is problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer

2015-04-07 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484201#comment-14484201
 ] 

Daryn Sharp commented on YARN-3055:
---

This appears to go back to the really old days of renewing the token for its 
entire lifetime.  Most unfortunate.

The renewer looks like it may turn into a DOS weapon.  Renewing a token returns 
the next expiration.  The renewer uses a timer to renew 90% before expiration.  
After the last renewal, the same expiration (the wall) will be returned as 
before.  90% of the wall eventually becomes a rapid fire renewal.  There's an 
army of 50 threads prepared to fire concurrently.

My other concern is that it used to be the first job submitted with a given 
token that determined if the token is to be cancelled.  Now any job can 
influence the cancelling.  This patch didn't specifically break that behavior, 
but the original YARN-2704 did, which precipitated YARN-2964 to break it 
differently, and now this jira.

The ramification is we used to tell users to make sure the first job set the 
conf correctly, and essentially don't worry after that.  Now they do have to 
worry.  Any sub-job with the default of canceling tokens will kill the overall 
workflow.  Sub-jobs should not have jurisdiction over the tokens.

 The token is not renewed properly if it's shared by jobs (oozie) in 
 DelegationTokenRenewer
 --

 Key: YARN-3055
 URL: https://issues.apache.org/jira/browse/YARN-3055
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Reporter: Yi Liu
Assignee: Yi Liu
Priority: Blocker
 Attachments: YARN-3055.001.patch, YARN-3055.002.patch


 After YARN-2964, there is only one timer to renew the token if it's shared by 
 jobs. 
 In {{removeApplicationFromRenewal}}, when going to remove a token, and the 
 token is shared by other jobs, we will not cancel the token. 
 Meanwhile, we should not cancel the _timerTask_, also we should not remove it 
 from {{allTokens}}. Otherwise for the existing submitted applications which 
 share this token will not get renew any more, and for new submitted 
 applications which share this token, the token will be renew immediately.
 For example, we have 3 applications: app1, app2, app3. And they share the 
 token1. See following scenario:
 *1).* app1 is submitted firstly, then app2, and then app3. In this case, 
 there is only one token renewal timer for token1, and is scheduled when app1 
 is submitted
 *2).* app1 is finished, then the renewal timer is cancelled. token1 will not 
 be renewed any more, but app2 and app3 still use it, so there is problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer

2015-04-06 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481225#comment-14481225
 ] 

Daryn Sharp commented on YARN-3055:
---

Correctly handling the don't cancel setting for jobs submitting job has been 
a recurring issue.  We're internally testing a small patch to continue renewing 
until all jobs using the token(s) have finished.  Handling the auto-fetch of 
proxy tokens proved a bit more difficult so I need to complete the internal 
patch.  I can take this over or post a partial patch if [~hitliuyi] would like 
to finish it.

 The token is not renewed properly if it's shared by jobs (oozie) in 
 DelegationTokenRenewer
 --

 Key: YARN-3055
 URL: https://issues.apache.org/jira/browse/YARN-3055
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Reporter: Yi Liu
Assignee: Yi Liu
Priority: Blocker
 Attachments: YARN-3055.001.patch, YARN-3055.002.patch


 After YARN-2964, there is only one timer to renew the token if it's shared by 
 jobs. 
 In {{removeApplicationFromRenewal}}, when going to remove a token, and the 
 token is shared by other jobs, we will not cancel the token. 
 Meanwhile, we should not cancel the _timerTask_, also we should not remove it 
 from {{allTokens}}. Otherwise for the existing submitted applications which 
 share this token will not get renew any more, and for new submitted 
 applications which share this token, the token will be renew immediately.
 For example, we have 3 applications: app1, app2, app3. And they share the 
 token1. See following scenario:
 *1).* app1 is submitted firstly, then app2, and then app3. In this case, 
 there is only one token renewal timer for token1, and is scheduled when app1 
 is submitted
 *2).* app1 is finished, then the renewal timer is cancelled. token1 will not 
 be renewed any more, but app2 and app3 still use it, so there is problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2971) RM uses conf instead of token service address to renew timeline delegation tokens

2015-02-09 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313090#comment-14313090
 ] 

Daryn Sharp commented on YARN-2971:
---

+1 Looks good

 RM uses conf instead of token service address to renew timeline delegation 
 tokens
 -

 Key: YARN-2971
 URL: https://issues.apache.org/jira/browse/YARN-2971
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
 Attachments: YARN-2971-v1.patch, YARN-2971-v2.patch


 The TimelineClientImpl renewDelegationToken uses the incorrect webaddress to 
 renew Timeline DelegationTokens. It should read the service address out of 
 the token to renew the delegation token.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2971) RM uses conf instead of token service address to renew timeline delegation tokens

2015-01-27 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293559#comment-14293559
 ] 

Daryn Sharp commented on YARN-2971:
---

I think it's because the job cannot assume that the timeline server matches the 
cluster's config.  I think the patch looks fine other than it should be using 
{{SecurityUtil.getTokenServiceAddr}} instead of directly accessing the token 
service.

 RM uses conf instead of token service address to renew timeline delegation 
 tokens
 -

 Key: YARN-2971
 URL: https://issues.apache.org/jira/browse/YARN-2971
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
 Attachments: YARN-2971-v1.patch


 The TimelineClientImpl renewDelegationToken uses the incorrect webaddress to 
 renew Timeline DelegationTokens. It should read the service address out of 
 the token to renew the delegation token.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)

2014-12-15 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247400#comment-14247400
 ] 

Daryn Sharp commented on YARN-2964:
---

[~vinodkv], can you take a look at this?

 RM prematurely cancels tokens for jobs that submit jobs (oozie)
 ---

 Key: YARN-2964
 URL: https://issues.apache.org/jira/browse/YARN-2964
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Daryn Sharp
Priority: Critical

 The RM used to globally track the unique set of tokens for all apps.  It 
 remembered the first job that was submitted with the token.  The first job 
 controlled the cancellation of the token.  This prevented completion of 
 sub-jobs from canceling tokens used by the main job.
 As of YARN-2704, the RM now tracks tokens on a per-app basis.  There is no 
 notion of the first/main job.  This results in sub-jobs canceling tokens and 
 failing the main job and other sub-jobs.  It also appears to schedule 
 multiple redundant renewals.
 The issue is not immediately obvious because the RM will cancel tokens ~10 
 min (NM livelyness interval) after log aggregation completes.  The result is 
 an oozie job, ex. pig, that will launch many sub-jobs over time will fail if 
 any sub-jobs are launched 10 min after any sub-job completes.  If all other 
 sub-jobs complete within that 10 min window, then the issue goes unnoticed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1915) ClientToAMTokenMasterKey should be provided to AM at launch time

2014-08-13 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095471#comment-14095471
 ] 

Daryn Sharp commented on YARN-1915:
---

+1 But since I had a hand in the design, we should get a 2nd vote.

 ClientToAMTokenMasterKey should be provided to AM at launch time
 

 Key: YARN-1915
 URL: https://issues.apache.org/jira/browse/YARN-1915
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0
Reporter: Hitesh Shah
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-1915.patch, YARN-1915v2.patch


 Currently, the AM receives the key as part of registration. This introduces a 
 race where a client can connect to the AM when the AM has not received the 
 key. 
 Current Flow:
 1) AM needs to start the client listening service in order to get host:port 
 and send it to the RM as part of registration
 2) RM gets the port info in register() and transitions the app to RUNNING. 
 Responds back with client secret to AM.
 3) User asks RM for client token. Gets it and pings the AM. AM hasn't 
 received client secret from RM and so RPC itself rejects the request.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1915) ClientToAMTokenMasterKey should be provided to AM at launch time

2014-08-13 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095820#comment-14095820
 ] 

Daryn Sharp commented on YARN-1915:
---

I suspect it's because it removes the burden from the AM to strip the secret 
from the credentials so it doesn't leak to other processes.

 ClientToAMTokenMasterKey should be provided to AM at launch time
 

 Key: YARN-1915
 URL: https://issues.apache.org/jira/browse/YARN-1915
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0
Reporter: Hitesh Shah
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-1915.patch, YARN-1915v2.patch


 Currently, the AM receives the key as part of registration. This introduces a 
 race where a client can connect to the AM when the AM has not received the 
 key. 
 Current Flow:
 1) AM needs to start the client listening service in order to get host:port 
 and send it to the RM as part of registration
 2) RM gets the port info in register() and transitions the app to RUNNING. 
 Responds back with client secret to AM.
 3) User asks RM for client token. Gets it and pings the AM. AM hasn't 
 received client secret from RM and so RPC itself rejects the request.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1915) ClientToAMTokenMasterKey should be provided to AM at launch time

2014-08-13 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096319#comment-14096319
 ] 

Daryn Sharp commented on YARN-1915:
---

Yes, I thought the ugi mangling was gone, but the AMRMToken is indeed manually 
removed.  I'm assuming there was a valid reason why the secret is passed in the 
registration response, perhaps for future functionality.

Rather than second guess how/why it's done this way, I'd prefer to focus on a 
small immediate fix for this very tight race condition.  The AM should 
generally receive the registration response before a client can ask the RM 
where the AM is and try to connect.  Could we file another jira to contemplate 
an incompatible change?

 ClientToAMTokenMasterKey should be provided to AM at launch time
 

 Key: YARN-1915
 URL: https://issues.apache.org/jira/browse/YARN-1915
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0
Reporter: Hitesh Shah
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-1915.patch, YARN-1915v2.patch


 Currently, the AM receives the key as part of registration. This introduces a 
 race where a client can connect to the AM when the AM has not received the 
 key. 
 Current Flow:
 1) AM needs to start the client listening service in order to get host:port 
 and send it to the RM as part of registration
 2) RM gets the port info in register() and transitions the app to RUNNING. 
 Responds back with client secret to AM.
 3) User asks RM for client token. Gets it and pings the AM. AM hasn't 
 received client secret from RM and so RPC itself rejects the request.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2147) client lacks delegation token exception details when application submit fails

2014-06-30 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047946#comment-14047946
 ] 

Daryn Sharp commented on YARN-2147:
---

Code looks fine.  Currently the test verifies the stringified token is in the 
exception's message.  However since the mock is throwing an exception 
explicitly with the stringified token, we don't know if the code change is 
actually catching and adding the token.  The mock should throw a generic string 
of say, boom.  Then check the caught exception against something like Failed 
to renew token: token: boom.

 client lacks delegation token exception details when application submit fails
 -

 Key: YARN-2147
 URL: https://issues.apache.org/jira/browse/YARN-2147
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Jason Lowe
Assignee: Chen He
Priority: Minor
 Attachments: YARN-2147-v2.patch, YARN-2147-v3.patch, 
 YARN-2147-v4.patch, YARN-2147.patch


 When an client submits an application and the delegation token process fails 
 the client can lack critical details needed to understand the nature of the 
 error.  Only the message of the error exception is conveyed to the client, 
 which sometimes isn't enough to debug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2147) client lacks delegation token exception details when application submit fails

2014-06-17 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034612#comment-14034612
 ] 

Daryn Sharp commented on YARN-2147:
---

I don't think the patch handles the use case it's designed for.  If job 
submission failed with a bland Read timed out, then logging all the tokens in 
the RM log doesn't help the end user, nor does the RM log even answer the 
question which token timed out? 

What you really want to do is change 
{{DelegationTokenRenewer#handleAppSubmitEvent}} to trap exceptions from 
{{renewToken}}.  Wrap the exception with a more descriptive exception that 
stringifies to the user as Can't renew token blah: Read timed out.

 client lacks delegation token exception details when application submit fails
 -

 Key: YARN-2147
 URL: https://issues.apache.org/jira/browse/YARN-2147
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Jason Lowe
Assignee: Chen He
Priority: Minor
 Attachments: YARN-2147-v2.patch, YARN-2147.patch


 When an client submits an application and the delegation token process fails 
 the client can lack critical details needed to understand the nature of the 
 error.  Only the message of the error exception is conveyed to the client, 
 which sometimes isn't enough to debug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2156) ApplicationMasterService#serviceStart() method has hardcoded AuthMethod.TOKEN as security configuration

2014-06-13 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030639#comment-14030639
 ] 

Daryn Sharp commented on YARN-2156:
---

A warning doesn't make sense because it implies there is something you should 
change.  There's not.  The config setting, whether explicitly set or not, is 
entirely irrelevant.  By design, yarn always uses tokens and these tokens carry 
essential information that is not otherwise obtainable for non-token 
authenticated connections.  That's why token authentication is explicitly set.

 ApplicationMasterService#serviceStart() method has hardcoded AuthMethod.TOKEN 
 as security configuration
 ---

 Key: YARN-2156
 URL: https://issues.apache.org/jira/browse/YARN-2156
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Svetozar Ivanov

 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService#serviceStart()
  method has mistakenly hardcoded AuthMethod.TOKEN as Hadoop security 
 authentication. 
 It looks like that:
 {code}
 @Override
   protected void serviceStart() throws Exception {
 Configuration conf = getConfig();
 YarnRPC rpc = YarnRPC.create(conf);
 InetSocketAddress masterServiceAddress = conf.getSocketAddr(
 YarnConfiguration.RM_SCHEDULER_ADDRESS,
 YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS,
 YarnConfiguration.DEFAULT_RM_SCHEDULER_PORT);
 Configuration serverConf = conf;
 // If the auth is not-simple, enforce it to be token-based.
 serverConf = new Configuration(conf);
 serverConf.set(
 CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
 SaslRpcServer.AuthMethod.TOKEN.toString());
 
 ...
 }
 {code}
 Obviously such code makes sense only if 
 CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION config setting 
 is missing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2156) ApplicationMasterService#serviceStart() method has hardcoded AuthMethod.TOKEN as security configuration

2014-06-12 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14029913#comment-14029913
 ] 

Daryn Sharp commented on YARN-2156:
---

Yes, this is by design.  Yarn uses tokens regardless of your security setting.

 ApplicationMasterService#serviceStart() method has hardcoded AuthMethod.TOKEN 
 as security configuration
 ---

 Key: YARN-2156
 URL: https://issues.apache.org/jira/browse/YARN-2156
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Svetozar Ivanov

 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService#serviceStart()
  method has mistakenly hardcoded AuthMethod.TOKEN as Hadoop security 
 authentication. 
 It looks like that:
 {code}
 @Override
   protected void serviceStart() throws Exception {
 Configuration conf = getConfig();
 YarnRPC rpc = YarnRPC.create(conf);
 InetSocketAddress masterServiceAddress = conf.getSocketAddr(
 YarnConfiguration.RM_SCHEDULER_ADDRESS,
 YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS,
 YarnConfiguration.DEFAULT_RM_SCHEDULER_PORT);
 Configuration serverConf = conf;
 // If the auth is not-simple, enforce it to be token-based.
 serverConf = new Configuration(conf);
 serverConf.set(
 CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
 SaslRpcServer.AuthMethod.TOKEN.toString());
 
 ...
 }
 {code}
 Obviously such code makes sense only if 
 CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION config setting 
 is missing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-2156) ApplicationMasterService#serviceStart() method has hardcoded AuthMethod.TOKEN as security configuration

2014-06-12 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp resolved YARN-2156.
---

Resolution: Not a Problem

 ApplicationMasterService#serviceStart() method has hardcoded AuthMethod.TOKEN 
 as security configuration
 ---

 Key: YARN-2156
 URL: https://issues.apache.org/jira/browse/YARN-2156
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Svetozar Ivanov

 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService#serviceStart()
  method has mistakenly hardcoded AuthMethod.TOKEN as Hadoop security 
 authentication. 
 It looks like that:
 {code}
 @Override
   protected void serviceStart() throws Exception {
 Configuration conf = getConfig();
 YarnRPC rpc = YarnRPC.create(conf);
 InetSocketAddress masterServiceAddress = conf.getSocketAddr(
 YarnConfiguration.RM_SCHEDULER_ADDRESS,
 YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS,
 YarnConfiguration.DEFAULT_RM_SCHEDULER_PORT);
 Configuration serverConf = conf;
 // If the auth is not-simple, enforce it to be token-based.
 serverConf = new Configuration(conf);
 serverConf.set(
 CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
 SaslRpcServer.AuthMethod.TOKEN.toString());
 
 ...
 }
 {code}
 Obviously such code makes sense only if 
 CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION config setting 
 is missing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1841) YARN ignores/overrides explicit security settings

2014-03-18 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13939272#comment-13939272
 ] 

Daryn Sharp commented on YARN-1841:
---

bq. The fact that it behaves differently once invoked from the AM vs. just a 
simple API call to a remote cluster is what I am questioning.

This should work.  What issue are you encountering when talking to a remote 
service?

bq.  ... I understand that this really should not work (I don't even have an 
app deployed at the time of invocation of this code) ...

You answered your question but the good news it's possible.  You are trying to 
emulate an unmanaged AM.  It's not possible to just register an AM w/o first 
requesting an app  app attempt id from the RM.  The subsequent registration 
will use a AMRM token that is issued by the RM.

 YARN ignores/overrides explicit security settings
 -

 Key: YARN-1841
 URL: https://issues.apache.org/jira/browse/YARN-1841
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Oleg Zhurakousky

 core-site.xml explicitly sets authentication as SIMPLE
 {code}
  property
 namehadoop.security.authentication/name
 valuesimple/value
 descriptionSimple authentication/description
   /property
 {code}
 However any attempt to register ApplicationMaster on the remote YARN cluster 
 results in 
 {code}
 org.apache.hadoop.security.AccessControlException: SIMPLE authentication is 
 not enabled.  Available:[TOKEN]
 . . .
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1841) YARN ignores/overrides explicit security settings

2014-03-18 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13939353#comment-13939353
 ] 

Daryn Sharp commented on YARN-1841:
---

I thought you were having issues talking to other services like a NN.  As 
noted, trying to communicate directly with the AMRM service is an invalid use 
case.  By design you cannot talk to this service w/o a token issued by the RM.  
The RM must create the app id and app attempt for the AM prior to the AM 
registering.  I'd suggest leveraging the unmanaged AM.

 YARN ignores/overrides explicit security settings
 -

 Key: YARN-1841
 URL: https://issues.apache.org/jira/browse/YARN-1841
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Oleg Zhurakousky

 core-site.xml explicitly sets authentication as SIMPLE
 {code}
  property
 namehadoop.security.authentication/name
 valuesimple/value
 descriptionSimple authentication/description
   /property
 {code}
 However any attempt to register ApplicationMaster on the remote YARN cluster 
 results in 
 {code}
 org.apache.hadoop.security.AccessControlException: SIMPLE authentication is 
 not enabled.  Available:[TOKEN]
 . . .
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1841) YARN ignores/overrides explicit security settings

2014-03-17 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937933#comment-13937933
 ] 

Daryn Sharp commented on YARN-1841:
---

The reason the custom AM in the related user@hadoop thread is failing is likely 
because it's coded incorrectly.  I suspect the RM supplied tokens were not 
added to the AM's ugi.

In general, tokens are just a lightweight alternate authentication method that 
removes the need for hard authentication, ex. kerberos, which a task cannot do. 
 Tokens within yarn are used to encode app/task identity and other information. 
 Note that the identity is not the job's user identity so tokens cannot be 
disabled.

This jira should be marked invalid if Vinod agrees.

 YARN ignores/overrides explicit security settings
 -

 Key: YARN-1841
 URL: https://issues.apache.org/jira/browse/YARN-1841
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Oleg Zhurakousky

 core-site.xml explicitly sets authentication as SIMPLE
 {code}
  property
 namehadoop.security.authentication/name
 valuesimple/value
 descriptionSimple authentication/description
   /property
 {code}
 However any attempt to register ApplicationMaster on the remote YARN cluster 
 results in 
 {code}
 org.apache.hadoop.security.AccessControlException: SIMPLE authentication is 
 not enabled.  Available:[TOKEN]
 . . .
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-1841) YARN ignores/overrides explicit security settings

2014-03-17 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp resolved YARN-1841.
---

Resolution: Not A Problem

Oleg, the authentication config setting specifies the _external authentication_ 
for client visible services.  Ie. The NN, RM, etc.  The _internal 
authentication_ within the yarn framework is an implementation detail 
independent of the config auth method.  Yarn does not need to log a warning or 
exception for its internal design.

I think you are naively looking at this from the viewpoint of simple auth.  
Consider kerberos auth.  The AM, NM, tasks, etc cannot use kerberos to 
authenticate.  Even if they could, the token is used to securely sign and 
transport tamper resistant values.  Always using tokens prevents the dreaded 
why does this AM/etc break with security enabled?  After using the configured 
auth for job submission, the code path within yarn is common and the internal 
auth is of no concern to the user.

There is no design problem, the api is transparently based on the token + rpc 
layer meshing to securely transport (whether simple or kerberos auth) the 
identity and resources requirements between processes. 

Feel free to ask Vinod or I questions offline to come up to speed on hadoop  
yarn's security.

 YARN ignores/overrides explicit security settings
 -

 Key: YARN-1841
 URL: https://issues.apache.org/jira/browse/YARN-1841
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Oleg Zhurakousky

 core-site.xml explicitly sets authentication as SIMPLE
 {code}
  property
 namehadoop.security.authentication/name
 valuesimple/value
 descriptionSimple authentication/description
   /property
 {code}
 However any attempt to register ApplicationMaster on the remote YARN cluster 
 results in 
 {code}
 org.apache.hadoop.security.AccessControlException: SIMPLE authentication is 
 not enabled.  Available:[TOKEN]
 . . .
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1628) TestContainerManagerSecurity fails on trunk

2014-01-29 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13885677#comment-13885677
 ] 

Daryn Sharp commented on YARN-1628:
---

+1.  Will check in later today.  Thanks!

 TestContainerManagerSecurity fails on trunk
 ---

 Key: YARN-1628
 URL: https://issues.apache.org/jira/browse/YARN-1628
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.2.0
Reporter: Mit Desai
Assignee: Mit Desai
 Attachments: YARN-1628.patch


 The Test fails with the following error
 {noformat}
 java.lang.IllegalArgumentException: java.net.UnknownHostException: InvalidHost
   at 
 org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377)
   at 
 org.apache.hadoop.yarn.server.security.BaseNMTokenSecretManager.newInstance(BaseNMTokenSecretManager.java:145)
   at 
 org.apache.hadoop.yarn.server.security.BaseNMTokenSecretManager.createNMToken(BaseNMTokenSecretManager.java:136)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:253)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:144)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Assigned] (YARN-691) Invalid NaN values in Hadoop REST API JSON response

2013-11-07 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp reassigned YARN-691:


Assignee: Daryn Sharp

 Invalid NaN values in Hadoop REST API JSON response
 ---

 Key: YARN-691
 URL: https://issues.apache.org/jira/browse/YARN-691
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 0.23.6, 2.0.4-alpha
Reporter: Kendall Thrapp
Assignee: Daryn Sharp

 I've been occasionally coming across instances where Hadoop's Cluster 
 Applications REST API 
 (http://hadoop.apache.org/docs/r0.23.6/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_API)
  has returned JSON that PHP's json_decode function failed to parse.  I've 
 tracked the syntax error down to the presence of the unquoted word NaN 
 appearing as a value in the JSON.  For example:
 progress:NaN,
 NaN is not part of the JSON spec, so its presence renders the whole JSON 
 string invalid.  Hadoop needs to return something other than NaN in this case 
 -- perhaps an empty string or the quoted string NaN.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-691) Invalid NaN values in Hadoop REST API JSON response

2013-11-07 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-691:
-

Assignee: (was: Daryn Sharp)

 Invalid NaN values in Hadoop REST API JSON response
 ---

 Key: YARN-691
 URL: https://issues.apache.org/jira/browse/YARN-691
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 0.23.6, 2.0.4-alpha
Reporter: Kendall Thrapp

 I've been occasionally coming across instances where Hadoop's Cluster 
 Applications REST API 
 (http://hadoop.apache.org/docs/r0.23.6/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_API)
  has returned JSON that PHP's json_decode function failed to parse.  I've 
 tracked the syntax error down to the presence of the unquoted word NaN 
 appearing as a value in the JSON.  For example:
 progress:NaN,
 NaN is not part of the JSON spec, so its presence renders the whole JSON 
 string invalid.  Hadoop needs to return something other than NaN in this case 
 -- perhaps an empty string or the quoted string NaN.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-986) YARN should have a ClusterId/ServiceId

2013-09-25 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777523#comment-13777523
 ] 

Daryn Sharp commented on YARN-986:
--

This sounds like NN HA tokens which IMHO are rather hacky.  I've been intending 
to take advantage of my RPCv9 auth changes for the server to tell the client 
the token service (or perhaps another field) it needs to decouple tokens 
entirely from IP/hostname.  Thoughts on this approach?

 YARN should have a ClusterId/ServiceId
 --

 Key: YARN-986
 URL: https://issues.apache.org/jira/browse/YARN-986
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Karthik Kambatla

 This needs to be done to support non-ip based fail over of RM. Once the 
 server sets the token service address to be this generic ClusterId/ServiceId, 
 clients can translate it to appropriate final IP and then be able to select 
 tokens via TokenSelectors.
 Some workarounds for other related issues were put in place at YARN-945.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1189) NMTokenSecretManagerInNM is not being told when applications have finished

2013-09-13 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766738#comment-13766738
 ] 

Daryn Sharp commented on YARN-1189:
---

Oops, I thought the .1 patch was the latest so I didn't see the test.

 NMTokenSecretManagerInNM is not being told when applications have finished 
 ---

 Key: YARN-1189
 URL: https://issues.apache.org/jira/browse/YARN-1189
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta, 2.1.1-beta
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
Priority: Blocker
 Attachments: YARN-1189-20130912.1.patch, YARN-1189-20130913.txt


 The {{appFinished}} method is not being called when applications have 
 finished.  This causes a couple of leaks as {{oldMasterKeys}} and 
 {{appToAppAttemptMap}} are never being pruned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1189) NMTokenSecretManagerInNM is not being told when applications have finished

2013-09-13 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766728#comment-13766728
 ] 

Daryn Sharp commented on YARN-1189:
---

+1 but a test, even a mock that spies appFinished would be great to avoid a 
regression

 NMTokenSecretManagerInNM is not being told when applications have finished 
 ---

 Key: YARN-1189
 URL: https://issues.apache.org/jira/browse/YARN-1189
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta, 2.1.1-beta
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
Priority: Blocker
 Attachments: YARN-1189-20130912.1.patch, YARN-1189-20130913.txt


 The {{appFinished}} method is not being called when applications have 
 finished.  This causes a couple of leaks as {{oldMasterKeys}} and 
 {{appToAppAttemptMap}} are never being pruned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-707) Add user info in the YARN ClientToken

2013-09-04 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757904#comment-13757904
 ] 

Daryn Sharp commented on YARN-707:
--

Ug, the RM and AM are abusing the same secret manager impl.  The RM wants the 
secret key to be generated, whereas the AM really wants to verify it.  2.x 
fixed this.

 Add user info in the YARN ClientToken
 -

 Key: YARN-707
 URL: https://issues.apache.org/jira/browse/YARN-707
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Bikas Saha
Assignee: Jason Lowe
Priority: Blocker
 Fix For: 3.0.0, 2.1.1-beta

 Attachments: YARN-707-20130822.txt, YARN-707-20130827.txt, 
 YARN-707-20130828-2.txt, YARN-707-20130828.txt, YARN-707-20130829.txt, 
 YARN-707-20130830.branch-0.23.txt


 If user info is present in the client token then it can be used to do limited 
 authz in the AM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-707) Add user info in the YARN ClientToken

2013-09-04 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757890#comment-13757890
 ] 

Daryn Sharp commented on YARN-707:
--

Still reviewing, but an initial observation is 
{{ClientToAMSecretManager#getMasterKey}} is fabricating a new secret key if 
there is no pre-existing key for the appId.  This should be an error condition. 
 The secret manager knows the secret key for the specific app so there's no 
need to ever generate a secret key, right?  Else I can flood the AM with 
invalid appIds to make it go OOM from generating secret keys for invalid appIds.

 Add user info in the YARN ClientToken
 -

 Key: YARN-707
 URL: https://issues.apache.org/jira/browse/YARN-707
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Bikas Saha
Assignee: Jason Lowe
Priority: Blocker
 Fix For: 3.0.0, 2.1.1-beta

 Attachments: YARN-707-20130822.txt, YARN-707-20130827.txt, 
 YARN-707-20130828-2.txt, YARN-707-20130828.txt, YARN-707-20130829.txt, 
 YARN-707-20130830.branch-0.23.txt


 If user info is present in the client token then it can be used to do limited 
 authz in the AM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-707) Add user info in the YARN ClientToken

2013-09-04 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757926#comment-13757926
 ] 

Daryn Sharp commented on YARN-707:
--

Minor:
# {{ClientToAMTokenIdentifier#getUser()}} doesn't do a null check on the client 
name (because it can't be null) but should perhaps still check isEmpty()?
# Is {{ResourceManager#clientToAMSecretManager}} still needed now that it's in 
the context?
# Now that the client token is generated in {{RMAppAttemptImpl}} - should it 
contain the attemptId, not the appId?

 Add user info in the YARN ClientToken
 -

 Key: YARN-707
 URL: https://issues.apache.org/jira/browse/YARN-707
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Bikas Saha
Assignee: Jason Lowe
Priority: Blocker
 Fix For: 3.0.0, 2.1.1-beta

 Attachments: YARN-707-20130822.txt, YARN-707-20130827.txt, 
 YARN-707-20130828-2.txt, YARN-707-20130828.txt, YARN-707-20130829.txt, 
 YARN-707-20130830.branch-0.23.txt


 If user info is present in the client token then it can be used to do limited 
 authz in the AM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1146) RM DTSM and RMStateStore mismanage sequence number

2013-09-04 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757983#comment-13757983
 ] 

Daryn Sharp commented on YARN-1146:
---

Note that bug #2 will not self-correct if the following sequence occurs:
# Issue token 1, 2, 3, 4 (seq=4)
# Renew token 2 (seq=2)
# Cancel token 3, 4 (seq=2)
# Stop RM
# Start RM (seq=2) and will issue token 3 and 4 again

The issue is _probably_ benign given the current implementation, but is a bug 
if anything relies on sequence number.

 RM DTSM and RMStateStore mismanage sequence number
 --

 Key: YARN-1146
 URL: https://issues.apache.org/jira/browse/YARN-1146
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.0-alpha
Reporter: Daryn Sharp

 {{RMDelegationTokenSecretManager}} implements {{storeNewToken}} and 
 {{updateStoredToken}} (renew) to pass the token and its sequence number to 
 {{RMStateStore#storeRMDelegationTokenAndSequenceNumber}}.
 There are two problems:
 # The assumption is that new tokens will be synchronously stored in-order.  
 With an async secret manager this may not hold true and the state's sequence 
 number may be incorrect.
 # A token renewal will reset the state's sequence number to _that token's_ 
 sequence number.
 Bug #2 is generally masked.  Creating a new token (with the first caveat) 
 will bump the state's sequence number back up.  Restoring the dtsm will first 
 set the state's stored sequence number, then re-add all the tokens which will 
 update the sequence number if the token's sequence number is greater than the 
 dtsm's current sequence number.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-1146) RM DTSM and RMStateStore mismanage sequence number

2013-09-04 Thread Daryn Sharp (JIRA)
Daryn Sharp created YARN-1146:
-

 Summary: RM DTSM and RMStateStore mismanage sequence number
 Key: YARN-1146
 URL: https://issues.apache.org/jira/browse/YARN-1146
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.0-alpha
Reporter: Daryn Sharp


{{RMDelegationTokenSecretManager}} implements {{storeNewToken}} and 
{{updateStoredToken}} (renew) to pass the token and its sequence number to 
{{RMStateStore#storeRMDelegationTokenAndSequenceNumber}}.

There are two problems:
# The assumption is that new tokens will be synchronously stored in-order.  
With an async secret manager this may not hold true and the state's sequence 
number may be incorrect.
# A token renewal will reset the state's sequence number to _that token's_ 
sequence number.

Bug #2 is generally masked.  Creating a new token (with the first caveat) will 
bump the state's sequence number back up.  Restoring the dtsm will first set 
the state's stored sequence number, then re-add all the tokens which will 
update the sequence number if the token's sequence number is greater than the 
dtsm's current sequence number.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1146) RM DTSM and RMStateStore mismanage sequence number

2013-09-04 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13758184#comment-13758184
 ] 

Daryn Sharp commented on YARN-1146:
---

[~vinodkv] I'm desynch'ing the ADTSM on HADOOP-9930.  Is it ok for me to 
exasperate this seq number handling?

 RM DTSM and RMStateStore mismanage sequence number
 --

 Key: YARN-1146
 URL: https://issues.apache.org/jira/browse/YARN-1146
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.0-alpha
Reporter: Daryn Sharp

 {{RMDelegationTokenSecretManager}} implements {{storeNewToken}} and 
 {{updateStoredToken}} (renew) to pass the token and its sequence number to 
 {{RMStateStore#storeRMDelegationTokenAndSequenceNumber}}.
 There are two problems:
 # The assumption is that new tokens will be synchronously stored in-order.  
 With an async secret manager this may not hold true and the state's sequence 
 number may be incorrect.
 # A token renewal will reset the state's sequence number to _that token's_ 
 sequence number.
 Bug #2 is generally masked.  Creating a new token (with the first caveat) 
 will bump the state's sequence number back up.  Restoring the dtsm will first 
 set the state's stored sequence number, then re-add all the tokens which will 
 update the sequence number if the token's sequence number is greater than the 
 dtsm's current sequence number.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-707) Add user info in the YARN ClientToken

2013-09-04 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13758375#comment-13758375
 ] 

Daryn Sharp commented on YARN-707:
--

+1 Looks good enough to me.

 Add user info in the YARN ClientToken
 -

 Key: YARN-707
 URL: https://issues.apache.org/jira/browse/YARN-707
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Bikas Saha
Assignee: Jason Lowe
Priority: Blocker
 Fix For: 3.0.0, 2.1.1-beta

 Attachments: YARN-707-20130822.txt, YARN-707-20130827.txt, 
 YARN-707-20130828-2.txt, YARN-707-20130828.txt, YARN-707-20130829.txt, 
 YARN-707-20130830.branch-0.23.txt, YARN-707-20130904.branch-0.23.txt


 If user info is present in the client token then it can be used to do limited 
 authz in the AM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-707) Add user info in the YARN ClientToken

2013-08-28 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13752443#comment-13752443
 ] 

Daryn Sharp commented on YARN-707:
--

Technically you should be bumping the token ident's version number and using 
that to determine if the app submitter is in the ident.  Otherwise, decoding of 
prior tokens will attempt to read the missing app submitter from the next 
serialized object and eventually fail spectacularly.

{{RmAppImpl#createAndGetApplicationReport}}
Using checks on {{UserGroupInformation.isSecurityEnabled()}} here and elsewhere 
will cause future incompatibility to require tokens w/o security which is the 
direction yarn has been moving in.  It would be better to check if the secret 
manager is not null.

It's just logging if it cannot create a token?  This _shouldn't_ happen, but 
_if/when_ it does it's going to lead to more difficult after the fact errors in 
the client.  It's unfortunate you cannot throw the checked exception 
{{IOException}}, so I think you need to change the method signature or throw 
whatever you can, like a {{YarnException}}, to fail the request.

App attempting storing/restoring appears asymmetric.  Storing saves off the 
whole credentials in the attempt, whereas restoring appears to just pluck out 
the amrm token and the new persisted secret?

Minor:
Methods using the term Token, ex. {{recoverAppAttemptTokens}} and 
{{getTokensFromAppAttempt}} are misleading since it's Credentials.  Vinod had 
me make a similar change to the method names in the AM.

{{AM_CLIENT_TOKEN_MASTER_KEY_NAME}} is better defined in {{RMAppAttempt}}, 
rather than in the {{RMStateStore}}.  Otherwise the import dependency seems 
backwards.

 Add user info in the YARN ClientToken
 -

 Key: YARN-707
 URL: https://issues.apache.org/jira/browse/YARN-707
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Bikas Saha
Assignee: Jason Lowe
Priority: Blocker
 Fix For: 3.0.0, 2.1.1-beta

 Attachments: YARN-707-20130822.txt, YARN-707-20130827.txt


 If user info is present in the client token then it can be used to do limited 
 authz in the AM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-707) Add user info in the YARN ClientToken

2013-08-23 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13748999#comment-13748999
 ] 

Daryn Sharp commented on YARN-707:
--

It almost seems like it would be better to invert the approach to be more 
consistent with other tokens - the owner of the token is the user (not the app 
attempt) and there's a new field for the app attempt (instead of a new field 
for the user).

Another thought would be leverage the existing real/effective user in the 
token.  One is the submitter, the other is the app attempt.  Logging that 
includes the UGI will show appAttempt (auth:...) via daryn (auth:...), or 
vice-versa for the users.

Thoughts?

 Add user info in the YARN ClientToken
 -

 Key: YARN-707
 URL: https://issues.apache.org/jira/browse/YARN-707
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Bikas Saha
Assignee: Vinod Kumar Vavilapalli
 Attachments: YARN-707-20130822.txt


 If user info is present in the client token then it can be used to do limited 
 authz in the AM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-960) TestMRCredentials and TestBinaryTokenFile are failing on trunk

2013-07-24 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-960:
-

Attachment: YARN-960.patch

All tokens but the AMRM token are being lost are lost if security is disabled.  
The AMLauncher is using {{UGI.isSecurityEnabled()}} to decide if it should 
decode the existing container tokens before adding the AMRM token and 
re-encoding the container tokens.  This is completely wrong.  Tokens need to be 
unconditionally passed.

This removes the security check.

 TestMRCredentials and  TestBinaryTokenFile are failing on trunk
 ---

 Key: YARN-960
 URL: https://issues.apache.org/jira/browse/YARN-960
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Alejandro Abdelnur
Assignee: Daryn Sharp
Priority: Blocker
 Fix For: 2.1.0-beta

 Attachments: YARN-960.patch


 Not sure, but this may be a fallout from YARN-701 and/or related to YARN-945.
 Making it a blocker until full impact of the issue is scoped.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-945) AM register failing after AMRMToken

2013-07-24 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13718975#comment-13718975
 ] 

Daryn Sharp commented on YARN-945:
--

Please be sure I get a chance to look at the patch.

 AM register failing after AMRMToken
 ---

 Key: YARN-945
 URL: https://issues.apache.org/jira/browse/YARN-945
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Bikas Saha
Assignee: Vinod Kumar Vavilapalli
Priority: Blocker
 Fix For: 2.1.0-beta

 Attachments: nm.log, rm.log, yarn-site.xml


 509 2013-07-19 15:53:55,569 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 54313: readAndProcess from client 127.0.0.1   threw exception 
 [org.apache.hadoop.security.AccessControlException: SIMPLE authentication is 
 not enabled.  Available:[TOKEN]]
 510 org.apache.hadoop.security.AccessControlException: SIMPLE authentication 
 is not enabled.  Available:[TOKEN]
 511   at 
 org.apache.hadoop.ipc.Server$Connection.initializeAuthContext(Server.java:1531)
 512   at 
 org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1482)
 513   at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:788)
 514   at 
 org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:587)
 515   at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:562)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-874) Tracking YARN/MR test failures after HADOOP-9421 and YARN-827

2013-06-25 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13693186#comment-13693186
 ] 

Daryn Sharp commented on YARN-874:
--

+1!

 Tracking YARN/MR test failures after HADOOP-9421 and YARN-827
 -

 Key: YARN-874
 URL: https://issues.apache.org/jira/browse/YARN-874
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
Priority: Blocker
 Attachments: YARN-874.1.txt, YARN-874.2.txt, YARN-874.txt


 HADOOP-9421 and YARN-827 broke some YARN/MR tests. Tracking those..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-690) RM exits on token cancel/renew problems

2013-05-16 Thread Daryn Sharp (JIRA)
Daryn Sharp created YARN-690:


 Summary: RM exits on token cancel/renew problems
 Key: YARN-690
 URL: https://issues.apache.org/jira/browse/YARN-690
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 0.23.7, 3.0.0, 2.0.5-beta
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Blocker


The DelegationTokenRenewer thread is critical to the RM.  When a 
non-IOException occurs, the thread calls System.exit to prevent the RM from 
running w/o the thread.  It should be exiting only on non-RuntimeExceptions.

The problem is especially bad in 23 because the yarn protobuf layer converts 
IOExceptions into UndeclaredThrowableExceptions (RuntimeException) which causes 
the renewer to abort the process.  An UnknownHostException takes down the RM...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-690) RM exits on token cancel/renew problems

2013-05-16 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-690:
-

Attachment: YARN-690.patch

1-line change to a catch.  No test added due to difficulty of testing calls to 
System.exit.

 RM exits on token cancel/renew problems
 ---

 Key: YARN-690
 URL: https://issues.apache.org/jira/browse/YARN-690
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Blocker
 Attachments: YARN-690.patch


 The DelegationTokenRenewer thread is critical to the RM.  When a 
 non-IOException occurs, the thread calls System.exit to prevent the RM from 
 running w/o the thread.  It should be exiting only on non-RuntimeExceptions.
 The problem is especially bad in 23 because the yarn protobuf layer converts 
 IOExceptions into UndeclaredThrowableExceptions (RuntimeException) which 
 causes the renewer to abort the process.  An UnknownHostException takes down 
 the RM...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-690) RM exits on token cancel/renew problems

2013-05-16 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated YARN-690:
-

Attachment: YARN-690.patch

Doh, you're right.  It was a test, and you passed!

 RM exits on token cancel/renew problems
 ---

 Key: YARN-690
 URL: https://issues.apache.org/jira/browse/YARN-690
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Blocker
 Attachments: YARN-690.patch, YARN-690.patch


 The DelegationTokenRenewer thread is critical to the RM.  When a 
 non-IOException occurs, the thread calls System.exit to prevent the RM from 
 running w/o the thread.  It should be exiting only on non-RuntimeExceptions.
 The problem is especially bad in 23 because the yarn protobuf layer converts 
 IOExceptions into UndeclaredThrowableExceptions (RuntimeException) which 
 causes the renewer to abort the process.  An UnknownHostException takes down 
 the RM...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   >