[jira] [Commented] (YARN-8865) RMStateStore contains large number of expired RMDelegationToken
[ https://issues.apache.org/jira/browse/YARN-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676839#comment-16676839 ] Daryn Sharp commented on YARN-8865: --- +1 looks good! > RMStateStore contains large number of expired RMDelegationToken > --- > > Key: YARN-8865 > URL: https://issues.apache.org/jira/browse/YARN-8865 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8865.001.patch, YARN-8865.002.patch, > YARN-8865.003.patch, YARN-8865.004.patch, YARN-8865.005.patch, > YARN-8865.006.patch > > > When the RM state store is restored expired delegation tokens are restored > and added to the system. These expired tokens do not get cleaned up or > removed. The exact reason why the tokens are still in the store is not clear. > We have seen as many as 250,000 tokens in the store some of which were 2 > years old. > This has two side effects: > * for the zookeeper store this leads to a jute buffer exhaustion issue and > prevents the RM from becoming active. > * restore takes longer than needed and heap usage is higher than it should be > We should not restore already expired tokens since they cannot be renewed or > used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8865) RMStateStore contains large number of expired RMDelegationToken
[ https://issues.apache.org/jira/browse/YARN-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661222#comment-16661222 ] Daryn Sharp commented on YARN-8865: --- You shouldn't modify the token identifier, ie. change the max date, because an identifier is and must be immutable. I think a very similar and safe change is when the secret key doesn't exist, artificially expire the token by creating the {{DelegationTokenInformation}} with a {{renewDate}} in the past. > RMStateStore contains large number of expired RMDelegationToken > --- > > Key: YARN-8865 > URL: https://issues.apache.org/jira/browse/YARN-8865 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8865.001.patch, YARN-8865.002.patch, > YARN-8865.003.patch, YARN-8865.004.patch, YARN-8865.005.patch > > > When the RM state store is restored expired delegation tokens are restored > and added to the system. These expired tokens do not get cleaned up or > removed. The exact reason why the tokens are still in the store is not clear. > We have seen as many as 250,000 tokens in the store some of which were 2 > years old. > This has two side effects: > * for the zookeeper store this leads to a jute buffer exhaustion issue and > prevents the RM from becoming active. > * restore takes longer than needed and heap usage is higher than it should be > We should not restore already expired tokens since they cannot be renewed or > used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8865) RMStateStore contains large number of expired RMDelegationToken
[ https://issues.apache.org/jira/browse/YARN-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16646515#comment-16646515 ] Daryn Sharp commented on YARN-8865: --- Good job. That explains why the secret manager doesn't remove them. What's interesting is secret keys are supposed to outlive their tokens. Were secret keys manually deleted? Regardless the secret manager should be able to recover its state. The patch is a high risky change for a common class. All secret managers are not be equipped to handle mutation during loading. Case in point: The NN generates an edit to remove tokens. Edits cannot be generated while replaying edits (restoring state). Fundamentally a HA standby cannot modify state. Similar issues probably exist for other secret managers. Perhaps the lowest risk change is add tokens with an invalid key anyway. Set the password to null. Authentication will fail, and should allow the expiration thread to correctly remove the tokens. Or the lowest risk change is modify the RMDTSM to handle removal while restoring state. > RMStateStore contains large number of expired RMDelegationToken > --- > > Key: YARN-8865 > URL: https://issues.apache.org/jira/browse/YARN-8865 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8865.001.patch, YARN-8865.002.patch > > > When the RM state store is restored expired delegation tokens are restored > and added to the system. These expired tokens do not get cleaned up or > removed. The exact reason why the tokens are still in the store is not clear. > We have seen as many as 250,000 tokens in the store some of which were 2 > years old. > This has two side effects: > * for the zookeeper store this leads to a jute buffer exhaustion issue and > prevents the RM from becoming active. > * restore takes longer than needed and heap usage is higher than it should be > We should not restore already expired tokens since they cannot be renewed or > used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8865) RMStateStore contains large number of expired RMDelegationToken
[ https://issues.apache.org/jira/browse/YARN-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645520#comment-16645520 ] Daryn Sharp commented on YARN-8865: --- The RMDelegationTokenSecretManager is an AbstractDelegationTokenSecretManager. The ADTSM uses a thread to periodically roll secret keys and purge expired tokens. We checked some clusters that use the level db state store and we're not leaking tokens which implies the problem is likely specific to the ZKRMStateStore. Given it's the ADTSM's job to expunge expired tokens, every state store impl should not be burdened with duplicated code to explicitly purge tokens just because one state store impl is buggy. > RMStateStore contains large number of expired RMDelegationToken > --- > > Key: YARN-8865 > URL: https://issues.apache.org/jira/browse/YARN-8865 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8865.001.patch > > > When the RM state store is restored expired delegation tokens are restored > and added to the system. These expired tokens do not get cleaned up or > removed. The exact reason why the tokens are still in the store is not clear. > We have seen as many as 250,000 tokens in the store some of which were 2 > years old. > This has two side effects: > * for the zookeeper store this leads to a jute buffer exhaustion issue and > prevents the RM from becoming active. > * restore takes longer than needed and heap usage is higher than it should be > We should not restore already expired tokens since they cannot be renewed or > used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment
[ https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470662#comment-16470662 ] Daryn Sharp commented on YARN-8108: --- bq. I took a look into the issue and am feeling okay about the conservative fix of making RMAuthenticationFilter global whenever it is enabled. While that would "work", isn't it be a regression? An admin that specifically configured those filters, perhaps with different principals as Eric previously mentioned, would be quite surprised to discover that the configuration is now silently ignored. Per earlier comments, the issue is apparently not present through at least 2.7.5. Most of the referenced jiras are up to 5 years old. We still need to identity which (recent-ish) jira caused the regression to understand the problem. > RM metrics rest API throws GSSException in kerberized environment > - > > Key: YARN-8108 > URL: https://issues.apache.org/jira/browse/YARN-8108 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Kshitij Badani >Assignee: Eric Yang >Priority: Major > Attachments: YARN-8108.001.patch > > > Test is trying to pull up metrics data from SHS after kiniting as 'test_user' > It is throwing GSSException as follows > {code:java} > b2b460b80713|RUNNING: curl --silent -k -X GET -D > /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : > http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15 > 07:15:48,757|INFO|MainThread|machine.py:194 - > run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0 > 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - > getMetricsJsonData()|metrics: > > > > Error 403 GSSException: Failure unspecified at GSS-API level > (Mechanism level: Request is a replay (34)) > > HTTP ERROR 403 > Problem accessing /proxy/application_1518674952153_0070/metrics/json. > Reason: > GSSException: Failure unspecified at GSS-API level (Mechanism level: > Request is a replay (34)) > > > {code} > Rootcausing : proxyserver on RM can't be supported for Kerberos enabled > cluster because AuthenticationFilter is applied twice in Hadoop code (once in > httpServer2 for RM, and another instance from AmFilterInitializer for proxy > server). This will require code changes to hadoop-yarn-server-web-proxy > project -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment
[ https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444097#comment-16444097 ] Daryn Sharp commented on YARN-8108: --- The TGS issues are purely caused by the double registration of the RMAuthenticationFilter for the /proxy path, so I don't think the SpnegoFilter init is involved. Please clarify the relevance? Silently ignoring the explicit configuration for the proxyserver when it's internal may have security ramifications. An admin may want more or less restrictive auth for the two services. I'm a bit uneasy with rationalizing how to fix an issue, with an unknown root cause, with a not well understood fix. Please track down the Jira that introduced the regression/incompatibility so we can correctly assess the problem. > RM metrics rest API throws GSSException in kerberized environment > - > > Key: YARN-8108 > URL: https://issues.apache.org/jira/browse/YARN-8108 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Kshitij Badani >Priority: Major > Attachments: YARN-8108.001.patch > > > Test is trying to pull up metrics data from SHS after kiniting as 'test_user' > It is throwing GSSException as follows > {code:java} > b2b460b80713|RUNNING: curl --silent -k -X GET -D > /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : > http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15 > 07:15:48,757|INFO|MainThread|machine.py:194 - > run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0 > 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - > getMetricsJsonData()|metrics: > > > > Error 403 GSSException: Failure unspecified at GSS-API level > (Mechanism level: Request is a replay (34)) > > HTTP ERROR 403 > Problem accessing /proxy/application_1518674952153_0070/metrics/json. > Reason: > GSSException: Failure unspecified at GSS-API level (Mechanism level: > Request is a replay (34)) > > > {code} > Rootcausing : proxyserver on RM can't be supported for Kerberos enabled > cluster because AuthenticationFilter is applied twice in Hadoop code (once in > httpServer2 for RM, and another instance from AmFilterInitializer for proxy > server). This will require code changes to hadoop-yarn-server-web-proxy > project -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment
[ https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443180#comment-16443180 ] Daryn Sharp commented on YARN-8108: --- bq. This seems to work and not trigger code path registered by proxyserver. Please elaborate: # Why do we want to bypass the code registered by the proxyserver? # Should the proxy service even be using the RM's auth filter? # How/why does changing addFilter to addGlobalFilter fix the problem? Adding the filter to every context (even those explicitly registered to not be filtered) seems counterintuitive. I think we also need to root cause exactly what change caused the RM auth filter to be double registered so we can ensure we've correctly fixed the bug. > RM metrics rest API throws GSSException in kerberized environment > - > > Key: YARN-8108 > URL: https://issues.apache.org/jira/browse/YARN-8108 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Kshitij Badani >Priority: Major > Attachments: YARN-8108.001.patch > > > Test is trying to pull up metrics data from SHS after kiniting as 'test_user' > It is throwing GSSException as follows > {code:java} > b2b460b80713|RUNNING: curl --silent -k -X GET -D > /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : > http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15 > 07:15:48,757|INFO|MainThread|machine.py:194 - > run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0 > 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - > getMetricsJsonData()|metrics: > > > > Error 403 GSSException: Failure unspecified at GSS-API level > (Mechanism level: Request is a replay (34)) > > HTTP ERROR 403 > Problem accessing /proxy/application_1518674952153_0070/metrics/json. > Reason: > GSSException: Failure unspecified at GSS-API level (Mechanism level: > Request is a replay (34)) > > > {code} > Rootcausing : proxyserver on RM can't be supported for Kerberos enabled > cluster because AuthenticationFilter is applied twice in Hadoop code (once in > httpServer2 for RM, and another instance from AmFilterInitializer for proxy > server). This will require code changes to hadoop-yarn-server-web-proxy > project -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment
[ https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440940#comment-16440940 ] Daryn Sharp commented on YARN-8108: --- Analysis looks sound. Agree each servlet should scope filters for itself, not globally. I'm surprised this hasn't been found before. Is this specific to 3.x? Or does it exist in 2.x? (I guess we haven't see this bug due to an alternate auth for the RM) > RM metrics rest API throws GSSException in kerberized environment > - > > Key: YARN-8108 > URL: https://issues.apache.org/jira/browse/YARN-8108 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Kshitij Badani >Priority: Major > > Test is trying to pull up metrics data from SHS after kiniting as 'test_user' > It is throwing GSSException as follows > {code:java} > b2b460b80713|RUNNING: curl --silent -k -X GET -D > /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : > http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15 > 07:15:48,757|INFO|MainThread|machine.py:194 - > run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0 > 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - > getMetricsJsonData()|metrics: > > > > Error 403 GSSException: Failure unspecified at GSS-API level > (Mechanism level: Request is a replay (34)) > > HTTP ERROR 403 > Problem accessing /proxy/application_1518674952153_0070/metrics/json. > Reason: > GSSException: Failure unspecified at GSS-API level (Mechanism level: > Request is a replay (34)) > > > {code} > Rootcausing : proxyserver on RM can't be supported for Kerberos enabled > cluster because AuthenticationFilter is applied twice in Hadoop code (once in > httpServer2 for RM, and another instance from AmFilterInitializer for proxy > server). This will require code changes to hadoop-yarn-server-web-proxy > project -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7922) Yarn dont resolve rm/_HOST to hostname
[ https://issues.apache.org/jira/browse/YARN-7922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361017#comment-16361017 ] Daryn Sharp commented on YARN-7922: --- This shouldn't be able to happen. Distributed shell gets the renewer from {{YarnClientUtils.getRmPrincipal}} which calls {{SecurityUtil.getServerPrincipal}} to substitute _HOST. Yet somehow the substitution did not occur. The most conceivable, yet unlikely, way I see this failing is the principal has more than 3 components, ie. contains another / or @, which would cause the substitution to short-out. > Yarn dont resolve rm/_HOST to hostname > -- > > Key: YARN-7922 > URL: https://issues.apache.org/jira/browse/YARN-7922 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.3 >Reporter: Berry Österlund >Priority: Minor > > The normal auth_to_local usually removes everything after the / in the > username of the Kerberos principle. That, together with the _HOST setting in > the configuration files specifying the Kerberos principles is usually what is > required to convert rm/_HOST@ to user yarn. > In our environment, we cant use the default rules in auth_to_local. We have > to specify each and every host and only convert those specifically. In other > words, we don’t have the DEFAULT rule in auth_to_local. Ideally, the config > for us would be the following > {code:java} > RULE:[1:$1@$0](rm@)s/.*/invalid_user/ > RULE:[2:$1/$2@$0](rm/rm1_host.fulldomain@)s/.*/yarn/ > RULE:[2:$1/$2@$0](rm/rm2_host.fulldomain@)s/.*/yarn/ > {code} > But if we use only that configuration, the servicecheck in Ambari failes with > the following exception. > {code:java} > org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit > application_1518422080198_0002 to YARN : Failed to renew token: Kind: > HDFS_DELEGATION_TOKEN, Service: ha-hdfs:devhadoop, Ident: > (HDFS_DELEGATION_TOKEN token 11096 for ambari-qa) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:272) > at > org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:708) > at > org.apache.hadoop.yarn.applications.distributedshell.Client.main(Client.java:215) > {code} > > Inside the RM’s logfile, I can find the following. > {code:java} > Caused by: org.apache.hadoop.security.AccessControlException: yarn tries to > renew a token with renewer rm/_HOST@ > {code} > Adding the following rule to auth_to_local solves the problem > RULE:[2:$1/$2@$0](rm/_HOST@)s/.*/yarn/ > The client used to test this is executed with the following command > yarn org.apache.hadoop.yarn.applications.distributedshell.Client > -shell_command ls -num_containers 1 -jar > /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar > -timeout 30 --queue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7319) java.net.UnknownHostException when trying contact node by hostname
[ https://issues.apache.org/jira/browse/YARN-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202708#comment-16202708 ] Daryn Sharp commented on YARN-7319: --- bq. java.lang.IllegalArgumentException: java.net.UnknownHostException: hadoop-slave-743067341-hqrbk I'm a bit confused. Why is the node resolving itself as "hadoop-slave-743067341-hqrbk"? I believe that's the hostname self-reported during registration. If this is truly an ip-only environment, presumably that means the junk hostname is only in that node's /etc/hosts, but not in /etc/hosts of the other nodes? I understand not having reverse dns. However not having forward dns but assigning a private hostname is a bit obtuse, might as well not let the host resolve itself if nobody else can resolve it... Did you try setting {{hadoop.security.token.service.use_ip=false}} per the javadocs on buildTokenService? That will get you past the exception while generating the container token. It's likely the client won't be able to locate the token though – ie. token will have a host, but if the env is ip-only, the client must use an ip to connect and won't be able to match the ip with the hostname in the token. > java.net.UnknownHostException when trying contact node by hostname > -- > > Key: YARN-7319 > URL: https://issues.apache.org/jira/browse/YARN-7319 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Evgeny Makarov > > I'm trying to setup Hadoop on Kubernetes cluster with following setup: > Hadoop master is k8s pod > Each hadoop slave is additional k8s pod > All communication is being processed on IP based manned. In HDFS I have > setting of dfs.namenode.datanode.registration.ip-hostname-check set to false > and all works fine, however same option missing for YARN manager. > Here part of hadoop-master log when trying to submit simple word-count job: > 2017-10-12 09:00:25,005 ERROR > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > Error trying to assign container token and NM token to an allocated > container container_1507798393049_0001_01_01 > java.lang.IllegalArgumentException: java.net.UnknownHostException: > hadoop-slave-743067341-hqrbk > at > org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377) > at > org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerToken(BuilderUtils.java:258) > at > org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.createContainerToken(RMContainerTokenSecretManager.java:220) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.pullNewlyAllocatedContainersAndNMTokens(SchedulerApplicationAttempt.java:454) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.getAllocation(FiCaSchedulerApp.java:269) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocate(CapacityScheduler.java:988) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:971) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:964) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:789) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:776) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.UnknownHostException: hadoop-slave-743067341-hqrbk > ... 19 more > As can be seen, host hadoop-slave-743067341-hqrbk is unreachable. Adding
[jira] [Created] (YARN-7083) Log aggregation deletes/renames while file is open
Daryn Sharp created YARN-7083: - Summary: Log aggregation deletes/renames while file is open Key: YARN-7083 URL: https://issues.apache.org/jira/browse/YARN-7083 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.8.2 Reporter: Daryn Sharp Priority: Critical YARN-6288 changes the log aggregation writer to be an autoclosable. Unfortunately the try-with-resources block for the writer will either rename or delete the log while open. Assuming the NM's behavior is correct, deleting open files only results in ominous WARNs in the nodemanager log and increases the rate of logging in the NN when the implicit try-with-resource close fails. These red herrings complicate debugging efforts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7048) Fix tests faking kerberos to explicitly set ugi auth type
[ https://issues.apache.org/jira/browse/YARN-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16137000#comment-16137000 ] Daryn Sharp commented on YARN-7048: --- Note this patch contains no functional change outside of the two test files it updated. Neither failing test is associated with the test files in this patch. > Fix tests faking kerberos to explicitly set ugi auth type > - > > Key: YARN-7048 > URL: https://issues.apache.org/jira/browse/YARN-7048 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-7048.patch > > > TestTokenClientRMService and TestRMDelegationTokens are faking kerberos > authentication. The remote user ugis are explicitly created as kerberos but > not the login user's ugi. Prior to HADOOP-9747 new ugi instances defaulted > to kerberos even if not kerberos. Now ugis default to kerberos only if > actually kerberos based which causes the login user based tests to fail. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7048) Fix tests faking kerberos to explicitly set ugi auth type
[ https://issues.apache.org/jira/browse/YARN-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-7048: -- Attachment: YARN-7048.patch > Fix tests faking kerberos to explicitly set ugi auth type > - > > Key: YARN-7048 > URL: https://issues.apache.org/jira/browse/YARN-7048 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Daryn Sharp > Attachments: YARN-7048.patch > > > TestTokenClientRMService and TestRMDelegationTokens are faking kerberos > authentication. The remote user ugis are explicitly created as kerberos but > not the login user's ugi. Prior to HADOOP-9747 new ugi instances defaulted > to kerberos even if not kerberos. Now ugis default to kerberos only if > actually kerberos based which causes the login user based tests to fail. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7048) Fix tests faking kerberos to explicitly set ugi auth type
[ https://issues.apache.org/jira/browse/YARN-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp reassigned YARN-7048: - Assignee: Daryn Sharp > Fix tests faking kerberos to explicitly set ugi auth type > - > > Key: YARN-7048 > URL: https://issues.apache.org/jira/browse/YARN-7048 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-7048.patch > > > TestTokenClientRMService and TestRMDelegationTokens are faking kerberos > authentication. The remote user ugis are explicitly created as kerberos but > not the login user's ugi. Prior to HADOOP-9747 new ugi instances defaulted > to kerberos even if not kerberos. Now ugis default to kerberos only if > actually kerberos based which causes the login user based tests to fail. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7048) Fix tests faking kerberos to explicitly set ugi auth type
Daryn Sharp created YARN-7048: - Summary: Fix tests faking kerberos to explicitly set ugi auth type Key: YARN-7048 URL: https://issues.apache.org/jira/browse/YARN-7048 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Daryn Sharp TestTokenClientRMService and TestRMDelegationTokens are faking kerberos authentication. The remote user ugis are explicitly created as kerberos but not the login user's ugi. Prior to HADOOP-9747 new ugi instances defaulted to kerberos even if not kerberos. Now ugis default to kerberos only if actually kerberos based which causes the login user based tests to fail. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6679) Reduce Resource instance overhead via non-PBImpl
[ https://issues.apache.org/jira/browse/YARN-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16057718#comment-16057718 ] Daryn Sharp commented on YARN-6679: --- [~jlowe] or [~nroberts] may be able to comment on the allocation throughput. I just reduced overhead found by a profiler. SLS may not be exercising the RM in the same manner as in a real-world setting. If you look at {{ResourcePBImpl}} it has to: # instantiate a builder – wasted object # builder and its parent class have unneeded instance variables – wasted memory # call setters for memory and vcores, each updates a bit field, assigns instance variable, marks parent builder dirty – unnecessary computational overhead By comparison, a simple object with 2 longs is clearly a win. Even if you aren't stressing the scheduler to its maximum, you should see fewer gc/min due to slower heap growth. I don't have the profile available but the cost of excessive Resource instantiations is still a non-trivial percent of the loop. > Reduce Resource instance overhead via non-PBImpl > > > Key: YARN-6679 > URL: https://issues.apache.org/jira/browse/YARN-6679 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Fix For: 2.9.0, 3.0.0-alpha4 > > Attachments: YARN-6679.2.branch-2.patch, YARN-6679.2.trunk.patch, > YARN-6679.3.branch-2.patch, YARN-6679.3.trunk.patch, > YARN-6679.branch-2.patch, YARN-6679.trunk.patch > > > Creating and using transient PB-based Resource instances during scheduling is > very expensive. The overhead can be transparently reduced by internally > using lightweight non-PB based instances. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6681) Eliminate double-copy of child queues in canAssignToThisQueue
[ https://issues.apache.org/jira/browse/YARN-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16057664#comment-16057664 ] Daryn Sharp commented on YARN-6681: --- bq. is it not good enough that leaf queue returns false and parent queue returns true ? I don't know. I tried to make the absolute minimal no-risk change that preserves existing semantics, as dubious as they may be. The parent queue currently returns false if it has no child queues, so always returning true changes the existing semantics. Likewise, a leaf queue subclass currently can claim to have child queues, so always returning false changes the semantics. I'd suggest integrating the current patch(es) and use another jira for further changes/optimizations that change semantics? > Eliminate double-copy of child queues in canAssignToThisQueue > - > > Key: YARN-6681 > URL: https://issues.apache.org/jira/browse/YARN-6681 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6681.2.branch-2.8.patch, > YARN-6681.2.branch-2.patch, YARN-6681.2.trunk.patch, > YARN-6681.branch-2.8.patch, YARN-6681.branch-2.patch, YARN-6681.trunk.patch > > > 20% of the time in {{AbstractCSQueue#canAssignToThisQueue}} is spent > performing two duplications a treemap of child queues into a list - once to > test for null, second to see if it's empty. Eliminating the dups reduces the > overhead to 2%. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6682) Improve performance of AssignmentInformation datastructures
[ https://issues.apache.org/jira/browse/YARN-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042953#comment-16042953 ] Daryn Sharp commented on YARN-6682: --- This is a simple/self-contained change. Does anyone have time to review? > Improve performance of AssignmentInformation datastructures > --- > > Key: YARN-6682 > URL: https://issues.apache.org/jira/browse/YARN-6682 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6682.branch-2.8.patch, YARN-6682.branch-2.patch, > YARN-6682.trunk.patch > > > {{AssignmentInformation}} is inefficient and creates lots of garbage that > increase gc pressure. It creates 3 hashmaps that each contain only 2 > enum-based keys. This requires wrapper node objects, boxing/unboxing of ints, > and more expensive lookups than simply using primitive arrays indexed by enum > ordinal. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6681) Eliminate double-copy of child queues in canAssignToThisQueue
[ https://issues.apache.org/jira/browse/YARN-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042950#comment-16042950 ] Daryn Sharp commented on YARN-6681: --- [~Ying Zhang], are you ok with the current patch? > Eliminate double-copy of child queues in canAssignToThisQueue > - > > Key: YARN-6681 > URL: https://issues.apache.org/jira/browse/YARN-6681 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6681.2.branch-2.8.patch, > YARN-6681.2.branch-2.patch, YARN-6681.2.trunk.patch, > YARN-6681.branch-2.8.patch, YARN-6681.branch-2.patch, YARN-6681.trunk.patch > > > 20% of the time in {{AbstractCSQueue#canAssignToThisQueue}} is spent > performing two duplications a treemap of child queues into a list - once to > test for null, second to see if it's empty. Eliminating the dups reduces the > overhead to 2%. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6680) Avoid locking overhead for NO_LABEL lookups
[ https://issues.apache.org/jira/browse/YARN-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042949#comment-16042949 ] Daryn Sharp commented on YARN-6680: --- Any feedback? > Avoid locking overhead for NO_LABEL lookups > --- > > Key: YARN-6680 > URL: https://issues.apache.org/jira/browse/YARN-6680 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6680.patch > > > Labels are managed via a hash that is protected with a read lock. The lock > acquire and release are each just as expensive as the hash lookup itself - > resulting in a 3X slowdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6679) Reduce Resource instance overhead via non-PBImpl
[ https://issues.apache.org/jira/browse/YARN-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042946#comment-16042946 ] Daryn Sharp commented on YARN-6679: --- [~dan...@cloudera.com], have I addressed your concerns? > Reduce Resource instance overhead via non-PBImpl > > > Key: YARN-6679 > URL: https://issues.apache.org/jira/browse/YARN-6679 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6679.2.branch-2.patch, YARN-6679.2.trunk.patch, > YARN-6679.3.branch-2.patch, YARN-6679.3.trunk.patch, > YARN-6679.branch-2.patch, YARN-6679.trunk.patch > > > Creating and using transient PB-based Resource instances during scheduling is > very expensive. The overhead can be transparently reduced by internally > using lightweight non-PB based instances. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6682) Improve performance of AssignmentInformation datastructures
[ https://issues.apache.org/jira/browse/YARN-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16037513#comment-16037513 ] Daryn Sharp commented on YARN-6682: --- Test failures are completely unrelated. Ex. RPC clients unable to connect to hexstrings... > Improve performance of AssignmentInformation datastructures > --- > > Key: YARN-6682 > URL: https://issues.apache.org/jira/browse/YARN-6682 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6682.branch-2.8.patch, YARN-6682.branch-2.patch, > YARN-6682.trunk.patch > > > {{AssignmentInformation}} is inefficient and creates lots of garbage that > increase gc pressure. It creates 3 hashmaps that each contain only 2 > enum-based keys. This requires wrapper node objects, boxing/unboxing of ints, > and more expensive lookups than simply using primitive arrays indexed by enum > ordinal. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6679) Reduce Resource instance overhead via non-PBImpl
[ https://issues.apache.org/jira/browse/YARN-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-6679: -- Attachment: YARN-6679.3.branch-2.patch YARN-6679.3.trunk.patch Findbugs not related to this path. Test failures other than {{TestPBImplRecords}} (caused by me reducing visibility of {{getProto}}) not related. I reverted the visibility. Fixed up style issues. > Reduce Resource instance overhead via non-PBImpl > > > Key: YARN-6679 > URL: https://issues.apache.org/jira/browse/YARN-6679 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6679.2.branch-2.patch, YARN-6679.2.trunk.patch, > YARN-6679.3.branch-2.patch, YARN-6679.3.trunk.patch, > YARN-6679.branch-2.patch, YARN-6679.trunk.patch > > > Creating and using transient PB-based Resource instances during scheduling is > very expensive. The overhead can be transparently reduced by internally > using lightweight non-PB based instances. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6681) Eliminate double-copy of child queues in canAssignToThisQueue
[ https://issues.apache.org/jira/browse/YARN-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-6681: -- Attachment: YARN-6681.2.trunk.patch YARN-6681.2.branch-2.patch YARN-6681.2.branch-2.8.patch Updated {{ParentQueue#hasChildQueues}} to return true as to avoid unnecessary synchronization or locks. > Eliminate double-copy of child queues in canAssignToThisQueue > - > > Key: YARN-6681 > URL: https://issues.apache.org/jira/browse/YARN-6681 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6681.2.branch-2.8.patch, > YARN-6681.2.branch-2.patch, YARN-6681.2.trunk.patch, > YARN-6681.branch-2.8.patch, YARN-6681.branch-2.patch, YARN-6681.trunk.patch > > > 20% of the time in {{AbstractCSQueue#canAssignToThisQueue}} is spent > performing two duplications a treemap of child queues into a list - once to > test for null, second to see if it's empty. Eliminating the dups reduces the > overhead to 2%. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6679) Reduce Resource instance overhead via non-PBImpl
[ https://issues.apache.org/jira/browse/YARN-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-6679: -- Attachment: YARN-6679.2.trunk.patch YARN-6679.2.branch-2.patch [~dan...@cloudera.com], I'll save my get out of jail free card. I added tests that new instances are not pb impls and the to/from conversion is correct. {{TestPBImplRecords}} also already does pb conversion tests so coverage should be good. > Reduce Resource instance overhead via non-PBImpl > > > Key: YARN-6679 > URL: https://issues.apache.org/jira/browse/YARN-6679 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6679.2.branch-2.patch, YARN-6679.2.trunk.patch, > YARN-6679.branch-2.patch, YARN-6679.trunk.patch > > > Creating and using transient PB-based Resource instances during scheduling is > very expensive. The overhead can be transparently reduced by internally > using lightweight non-PB based instances. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6680) Avoid locking overhead for NO_LABEL lookups
[ https://issues.apache.org/jira/browse/YARN-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035505#comment-16035505 ] Daryn Sharp commented on YARN-6680: --- # Maybe it was the same before, I just attacked the top few hot spots in a profile to get us unblocked. I think Nathan measured a ~10% performance reduction but it was enough to push the RM off a cliff. I overall achieved a 2X performance increase from my patches under this umbrella so maybe this was low hanging fruit. # Ah yes. There are other (unnecessary) RW lock hotspots, but the label manager map uses locks to protect the concurrent map for essentially making a consistent copy. A concurrent map can be iterated w/o CME but isn't guaranteed to visit every entry hence why I think the locks are there. # Yes. The goal is optimize the common case. # Cloning the resource would be very bad. Resource object allocation overhead is already very high. > Avoid locking overhead for NO_LABEL lookups > --- > > Key: YARN-6680 > URL: https://issues.apache.org/jira/browse/YARN-6680 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6680.patch > > > Labels are managed via a hash that is protected with a read lock. The lock > acquire and release are each just as expensive as the hash lookup itself - > resulting in a 3X slowdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6681) Eliminate double-copy of child queues in canAssignToThisQueue
[ https://issues.apache.org/jira/browse/YARN-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034902#comment-16034902 ] Daryn Sharp commented on YARN-6681: --- Should/could I just unconditionally return true for a parent queue? A RW lock is ridiculously expensive to fetch the size. I tried to make minimal/low risk changes for an internal build to get us unblocked but it would seem to make sense. My hesitation was the null check on childQueues, implying an iteration would NPE, but it's marked final so always returning true seems safe? > Eliminate double-copy of child queues in canAssignToThisQueue > - > > Key: YARN-6681 > URL: https://issues.apache.org/jira/browse/YARN-6681 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6681.branch-2.8.patch, YARN-6681.branch-2.patch, > YARN-6681.trunk.patch > > > 20% of the time in {{AbstractCSQueue#canAssignToThisQueue}} is spent > performing two duplications a treemap of child queues into a list - once to > test for null, second to see if it's empty. Eliminating the dups reduces the > overhead to 2%. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6680) Avoid locking overhead for NO_LABEL lookups
[ https://issues.apache.org/jira/browse/YARN-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034834#comment-16034834 ] Daryn Sharp commented on YARN-6680: --- I use a profiler for performance work – hunches are inevitable wrong. [~jlowe] and [~nroberts] verified the improvement. 2.8 is current DOA, see [details|https://issues.apache.org/jira/browse/YARN-6679?focusedCommentId=16033655=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16033655]. This patch, along with my others under the umbrella, increased overall performance by ~2X. The scheduler's fine grain locking was a bad idea. RW locks are not cheap esp. for tiny critical sections. Write barriers are extremely expensive - slower than a hash lookup. Surprising but true. Eventually these maps should be concurrent maps which uses no lock for read ops, and the memory read barriers are cheap. The processor just sniffs the cache lines. bq. i feel the locks atleast in few places are required to maintain consistency. [...] read lock is required as intermittently node's partition mapping could be changed, or node can be deactivated etc... all the ops where write lock is held ? The locks currently do not provide guaranteed consistency. Example: # Consistent: #* thread1 read locks, gets resource, unlocks #* thread2 write locks, updates resource #* thread1 accesses resource – won't see thread2 update immediately # Inconsistent: #* thread1 write locks, updates resource #* thread2 read locks, gets resource, unlocks, accesses resource – will see thread1 update With my patch, the reader won't see the update in either case (unless it was also the writer). The question is does it matter? It's already a race due to no coarse grain lock to provide a snapshot view in time. Will it have detrimental impact to possibly see slightly stale data? If it does then there's already a major bug in code. In the end, this patch contributed to making 2.8 actually deployable. > Avoid locking overhead for NO_LABEL lookups > --- > > Key: YARN-6680 > URL: https://issues.apache.org/jira/browse/YARN-6680 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6680.patch > > > Labels are managed via a hash that is protected with a read lock. The lock > acquire and release are each just as expensive as the hash lookup itself - > resulting in a 3X slowdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6682) Improve performance of AssignmentInformation datastructures
[ https://issues.apache.org/jira/browse/YARN-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-6682: -- Attachment: YARN-6682.branch-2.8.patch YARN-6682.branch-2.patch YARN-6682.trunk.patch Simply uses primitive arrays indexed on ordinal. Differences between branches is trivial: ex. generics removal, containerId vs rmContainer. > Improve performance of AssignmentInformation datastructures > --- > > Key: YARN-6682 > URL: https://issues.apache.org/jira/browse/YARN-6682 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6682.branch-2.8.patch, YARN-6682.branch-2.patch, > YARN-6682.trunk.patch > > > {{AssignmentInformation}} is inefficient and creates lots of garbage that > increase gc pressure. It creates 3 hashmaps that each contain only 2 > enum-based keys. This requires wrapper node objects, boxing/unboxing of ints, > and more expensive lookups than simply using primitive arrays indexed by enum > ordinal. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6682) Improve performance of AssignmentInformation datastructures
Daryn Sharp created YARN-6682: - Summary: Improve performance of AssignmentInformation datastructures Key: YARN-6682 URL: https://issues.apache.org/jira/browse/YARN-6682 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.8.0 Reporter: Daryn Sharp Assignee: Daryn Sharp {{AssignmentInformation}} is inefficient and creates lots of garbage that increase gc pressure. It creates 3 hashmaps that each contain only 2 enum-based keys. This requires wrapper node objects, boxing/unboxing of ints, and more expensive lookups than simply using primitive arrays indexed by enum ordinal. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6681) Eliminate double-copy of child queues in canAssignToThisQueue
[ https://issues.apache.org/jira/browse/YARN-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-6681: -- Attachment: YARN-6681.branch-2.patch YARN-6681.branch-2.8.patch The branch-2 patch was really branch-2.8. All 3 patches are just context. > Eliminate double-copy of child queues in canAssignToThisQueue > - > > Key: YARN-6681 > URL: https://issues.apache.org/jira/browse/YARN-6681 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6681.branch-2.8.patch, YARN-6681.branch-2.patch, > YARN-6681.trunk.patch > > > 20% of the time in {{AbstractCSQueue#canAssignToThisQueue}} is spent > performing two duplications a treemap of child queues into a list - once to > test for null, second to see if it's empty. Eliminating the dups reduces the > overhead to 2%. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6681) Eliminate double-copy of child queues in canAssignToThisQueue
[ https://issues.apache.org/jira/browse/YARN-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-6681: -- Attachment: (was: YARN-6681.branch-2.patch) > Eliminate double-copy of child queues in canAssignToThisQueue > - > > Key: YARN-6681 > URL: https://issues.apache.org/jira/browse/YARN-6681 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6681.trunk.patch > > > 20% of the time in {{AbstractCSQueue#canAssignToThisQueue}} is spent > performing two duplications a treemap of child queues into a list - once to > test for null, second to see if it's empty. Eliminating the dups reduces the > overhead to 2%. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6680) Avoid locking overhead for NO_LABEL lookups
[ https://issues.apache.org/jira/browse/YARN-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033665#comment-16033665 ] Daryn Sharp commented on YARN-6680: --- Findbugs warning is from {{AggregatedLogFormat}} which is not part of this patch. > Avoid locking overhead for NO_LABEL lookups > --- > > Key: YARN-6680 > URL: https://issues.apache.org/jira/browse/YARN-6680 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6680.patch > > > Labels are managed via a hash that is protected with a read lock. The lock > acquire and release are each just as expensive as the hash lookup itself - > resulting in a 3X slowdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6679) Reduce Resource instance overhead via non-PBImpl
[ https://issues.apache.org/jira/browse/YARN-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033655#comment-16033655 ] Daryn Sharp commented on YARN-6679: --- Thanks Daniel! Using signum to handle a NaN (impossible, I hope?) may not be worth the cost. I only moved the pre-existing method from the pb impl to the base class so I'd suggest another jira if feel strongly that it should be changed? bq. How thoroughly has this been tested? I'd say very. The first large/busy pre-production cluster was crippled by 2.8. The scheduler thread was constantly pegging a cpu and falling behind. Deploys were halted. We deployed my collection of patches about 1.5w ago. Cpu fluctuates a lot, but doesn't stay pegged anymore. bq. Wanna add some unit tests to confirm the newInstance() methods and the PB conversion work as expected? I would certainly hope the existing tests provide coverage! :) I didn't expose any new methods to test but I'll concoct some rudimentary tests if need be. > Reduce Resource instance overhead via non-PBImpl > > > Key: YARN-6679 > URL: https://issues.apache.org/jira/browse/YARN-6679 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6679.branch-2.patch, YARN-6679.trunk.patch > > > Creating and using transient PB-based Resource instances during scheduling is > very expensive. The overhead can be transparently reduced by internally > using lightweight non-PB based instances. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6681) Eliminate double-copy of child queues in canAssignToThisQueue
[ https://issues.apache.org/jira/browse/YARN-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-6681: -- Attachment: YARN-6681.branch-2.patch YARN-6681.trunk.patch Add a {{hasChildQueues}} method that is overridden by {{ParentQueue}} to avoid the tree to list duplications. > Eliminate double-copy of child queues in canAssignToThisQueue > - > > Key: YARN-6681 > URL: https://issues.apache.org/jira/browse/YARN-6681 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6681.branch-2.patch, YARN-6681.trunk.patch > > > 20% of the time in {{AbstractCSQueue#canAssignToThisQueue}} is spent > performing two duplications a treemap of child queues into a list - once to > test for null, second to see if it's empty. Eliminating the dups reduces the > overhead to 2%. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6681) Eliminate double-copy of child queues in canAssignToThisQueue
Daryn Sharp created YARN-6681: - Summary: Eliminate double-copy of child queues in canAssignToThisQueue Key: YARN-6681 URL: https://issues.apache.org/jira/browse/YARN-6681 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.8.0 Reporter: Daryn Sharp Assignee: Daryn Sharp 20% of the time in {{AbstractCSQueue#canAssignToThisQueue}} is spent performing two duplications a treemap of child queues into a list - once to test for null, second to see if it's empty. Eliminating the dups reduces the overhead to 2%. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6680) Avoid locking overhead for NO_LABEL lookups
[ https://issues.apache.org/jira/browse/YARN-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-6680: -- Attachment: YARN-6680.patch Simply maintains a reference to the no-label instances that are already being seeded into maps. Lookups of the no-label key will use the reference in lieu of the lock and hash. Not a perfect solution but provides a significant performance boost until the maps can be changed to concurrent. > Avoid locking overhead for NO_LABEL lookups > --- > > Key: YARN-6680 > URL: https://issues.apache.org/jira/browse/YARN-6680 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6680.patch > > > Labels are managed via a hash that is protected with a read lock. The lock > acquire and release are each just as expensive as the hash lookup itself - > resulting in a 3X slowdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6680) Avoid locking overhead for NO_LABEL lookups
Daryn Sharp created YARN-6680: - Summary: Avoid locking overhead for NO_LABEL lookups Key: YARN-6680 URL: https://issues.apache.org/jira/browse/YARN-6680 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.8.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Labels are managed via a hash that is protected with a read lock. The lock acquire and release are each just as expensive as the hash lookup itself - resulting in a 3X slowdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6679) Reduce Resource instance overhead via non-PBImpl
[ https://issues.apache.org/jira/browse/YARN-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-6679: -- Attachment: YARN-6679.branch-2.patch Same as trunk patch, plus 1-line changes to the deprecated and removed resource increase/decrease PBs and request PB. > Reduce Resource instance overhead via non-PBImpl > > > Key: YARN-6679 > URL: https://issues.apache.org/jira/browse/YARN-6679 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6679.branch-2.patch, YARN-6679.trunk.patch > > > Creating and using transient PB-based Resource instances during scheduling is > very expensive. The overhead can be transparently reduced by internally > using lightweight non-PB based instances. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (YARN-6245) Add FinalResource object to reduce overhead of Resource class instancing
[ https://issues.apache.org/jira/browse/YARN-6245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-6245: -- Comment: was deleted (was: Same as trunk patch, plus 1-line changes to the deprecated and removed resource increase/decrease PBs and request PB.) > Add FinalResource object to reduce overhead of Resource class instancing > > > Key: YARN-6245 > URL: https://issues.apache.org/jira/browse/YARN-6245 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan > Attachments: observable-resource.patch, > YARN-6245.preliminary-staled.1.patch > > > There're lots of Resource object creation in YARN Scheduler, since Resource > object is backed by protobuf, creation of such objects is expensive and > becomes bottleneck. > To address the problem, we can introduce a FinalResource (Is it better to > call it ImmutableResource?) object, which is not backed by PBImpl. We can use > this object in frequent invoke paths in the scheduler. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6245) Add FinalResource object to reduce overhead of Resource class instancing
[ https://issues.apache.org/jira/browse/YARN-6245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-6245: -- Attachment: YARN-6679.branch-2.patch Same as trunk patch, plus 1-line changes to the deprecated and removed resource increase/decrease PBs and request PB. > Add FinalResource object to reduce overhead of Resource class instancing > > > Key: YARN-6245 > URL: https://issues.apache.org/jira/browse/YARN-6245 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan > Attachments: observable-resource.patch, > YARN-6245.preliminary-staled.1.patch > > > There're lots of Resource object creation in YARN Scheduler, since Resource > object is backed by protobuf, creation of such objects is expensive and > becomes bottleneck. > To address the problem, we can introduce a FinalResource (Is it better to > call it ImmutableResource?) object, which is not backed by PBImpl. We can use > this object in frequent invoke paths in the scheduler. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6245) Add FinalResource object to reduce overhead of Resource class instancing
[ https://issues.apache.org/jira/browse/YARN-6245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-6245: -- Attachment: (was: YARN-6679.branch-2.patch) > Add FinalResource object to reduce overhead of Resource class instancing > > > Key: YARN-6245 > URL: https://issues.apache.org/jira/browse/YARN-6245 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan > Attachments: observable-resource.patch, > YARN-6245.preliminary-staled.1.patch > > > There're lots of Resource object creation in YARN Scheduler, since Resource > object is backed by protobuf, creation of such objects is expensive and > becomes bottleneck. > To address the problem, we can introduce a FinalResource (Is it better to > call it ImmutableResource?) object, which is not backed by PBImpl. We can use > this object in frequent invoke paths in the scheduler. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6245) Add FinalResource object to reduce overhead of Resource class instancing
[ https://issues.apache.org/jira/browse/YARN-6245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033233#comment-16033233 ] Daryn Sharp commented on YARN-6245: --- Posted to new YARN-6679 in case you wish to pursue the immutable resources, although the scheduler really should try to reuse instances when possible. > Add FinalResource object to reduce overhead of Resource class instancing > > > Key: YARN-6245 > URL: https://issues.apache.org/jira/browse/YARN-6245 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan > Attachments: observable-resource.patch, > YARN-6245.preliminary-staled.1.patch > > > There're lots of Resource object creation in YARN Scheduler, since Resource > object is backed by protobuf, creation of such objects is expensive and > becomes bottleneck. > To address the problem, we can introduce a FinalResource (Is it better to > call it ImmutableResource?) object, which is not backed by PBImpl. We can use > this object in frequent invoke paths in the scheduler. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6679) Reduce Resource instance overhead via non-PBImpl
[ https://issues.apache.org/jira/browse/YARN-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-6679: -- Attachment: YARN-6679.trunk.patch # Add private {{SimpleResource}} object with longs for memory/vcores. # {{ResourcePBImpl#getProto(Resource)}} converts to a PBImpl as necessary to create a message # Bulk of patch is regexp change of {{(ResourcePBImpl) r).getProto()}} to {{ProtoUtils.convertToProtoFormat(r)}} or the local class conversion method. # In a few places, just set the PB instead of creating and checking equality before setting. Otherwise simple resources will be double converted. Overall effect is resources via PBs will remain PB-based. The myriad of internal resource instances used during calculations become lightweight and converted to a PB iff necessary. > Reduce Resource instance overhead via non-PBImpl > > > Key: YARN-6679 > URL: https://issues.apache.org/jira/browse/YARN-6679 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: YARN-6679.trunk.patch > > > Creating and using transient PB-based Resource instances during scheduling is > very expensive. The overhead can be transparently reduced by internally > using lightweight non-PB based instances. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6679) Reduce Resource instance overhead via non-PBImpl
Daryn Sharp created YARN-6679: - Summary: Reduce Resource instance overhead via non-PBImpl Key: YARN-6679 URL: https://issues.apache.org/jira/browse/YARN-6679 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.8.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Creating and using transient PB-based Resource instances during scheduling is very expensive. The overhead can be transparently reduced by internally using lightweight non-PB based instances. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6245) Add FinalResource object to reduce overhead of Resource class instancing
[ https://issues.apache.org/jira/browse/YARN-6245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033054#comment-16033054 ] Daryn Sharp commented on YARN-6245: --- I've been OOO. I'll be posting my collection of patches today for review. > Add FinalResource object to reduce overhead of Resource class instancing > > > Key: YARN-6245 > URL: https://issues.apache.org/jira/browse/YARN-6245 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan > Attachments: observable-resource.patch, > YARN-6245.preliminary-staled.1.patch > > > There're lots of Resource object creation in YARN Scheduler, since Resource > object is backed by protobuf, creation of such objects is expensive and > becomes bottleneck. > To address the problem, we can introduce a FinalResource (Is it better to > call it ImmutableResource?) object, which is not backed by PBImpl. We can use > this object in frequent invoke paths in the scheduler. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6245) Add FinalResource object to reduce overhead of Resource class instancing
[ https://issues.apache.org/jira/browse/YARN-6245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16019701#comment-16019701 ] Daryn Sharp commented on YARN-6245: --- [~jlowe] asked me to comment since we're running into 2.8 scheduler performance issues we believe are (in part) due to pb impl based objects. I think I've designed a means for resources via RPC to remain {{ResourcePBImpl}} while internally created resources are lightweight and only converted to a PB if it will be sent over the wire. At least as a start, it's a very simple patch that substitutes in a lightweight object via {{Resource.newInstance}} that simply contains 2 longs. Replaced usages of {{((ResourcePBImpl)r)#getProto()}} with {{ProtoUtils.convertToProtoFormat(Resource)}} which converts the lightweight to a pb impl as required. That's it. We're testing today. Will post a sample patch if it looks promising. > Add FinalResource object to reduce overhead of Resource class instancing > > > Key: YARN-6245 > URL: https://issues.apache.org/jira/browse/YARN-6245 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan > Attachments: observable-resource.patch, > YARN-6245.preliminary-staled.1.patch > > > There're lots of Resource object creation in YARN Scheduler, since Resource > object is backed by protobuf, creation of such objects is expensive and > becomes bottleneck. > To address the problem, we can introduce a FinalResource (Is it better to > call it ImmutableResource?) object, which is not backed by PBImpl. We can use > this object in frequent invoke paths in the scheduler. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6603) NPE in RMAppsBlock
[ https://issues.apache.org/jira/browse/YARN-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16012435#comment-16012435 ] Daryn Sharp commented on YARN-6603: --- +1 I think no test is fine due to difficulty of forcing the race condition and the patch essentially amounts to a null check. Failed tests appear unrelated. > NPE in RMAppsBlock > -- > > Key: YARN-6603 > URL: https://issues.apache.org/jira/browse/YARN-6603 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-6603.001.patch, YARN-6603.002.patch > > > We are seeing an intermittent NPE when the RM is trying to render the > /cluster URI. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6603) NPE in RMAppsBlock
[ https://issues.apache.org/jira/browse/YARN-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011414#comment-16011414 ] Daryn Sharp commented on YARN-6603: --- After getting the rmApp, you should replace: {code} RMAppAttempt appAttempt = rmApp.getAppAttempts().get(appAttemptId); {code} with: {code} RMAppAttempt appAttempt = rmApp.getAppAttempt(appAttemptdId); {code} The current getAppAttempts() returns an unmodifiable collection of a non-threadsafe map which isn't useful at all. The latter uses proper synchronization to lookup the attempt. You may also be saddened to learn that a synchronized copy of the blacklist hashset is created just to get the size. Bonus points for fixing that, but not necessary. > NPE in RMAppsBlock > -- > > Key: YARN-6603 > URL: https://issues.apache.org/jira/browse/YARN-6603 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-6603.001.patch > > > We are seeing an intermittent NPE when the RM is trying to render the > /cluster URI. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-3760) Log aggregation failures
[ https://issues.apache.org/jira/browse/YARN-3760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp reopened YARN-3760: --- Line numbers are from an old release but the error is evident. {code} java.lang.IllegalStateException: Cannot close TFile in the middle of key-value insertion. at org.apache.hadoop.io.file.tfile.TFile$Writer.close(TFile.java:310) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogWriter.close(AggregatedLogFormat.java:456) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:326) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:429) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:388) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$2.run(LogAggregationService.java:387) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) {code} _AggregatedLogFormat.LogWriter_ {code} public void close() { try { this.writer.close(); } catch (IOException e) { LOG.warn("Exception closing writer", e); } IOUtils.closeStream(fsDataOStream); } {code} TFile writer's close which may throw {{IllegalStateException}} if the underlying fs data stream failed. Unfortunately it only catches IOE, so the ISE rips out w/o closing the fsdata stream. Additionally, the ctor creates the fs data stream then a TFile.Writer w/o a try/catch. If the TFile.Writer ctor throws an exception, it's impossible to close the stream. I haven't checked if there are futher issues with closing the writer high in the stack. > Log aggregation failures > - > > Key: YARN-3760 > URL: https://issues.apache.org/jira/browse/YARN-3760 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Daryn Sharp >Priority: Critical > > The aggregated log file does not appear to be properly closed when writes > fail. This leaves a lease renewer active in the NM that spams the NN with > lease renewals. If the token is marked not to be cancelled, the renewals > appear to continue until the token expires. If the token is cancelled, the > periodic renew spam turns into a flood of failed connections until the lease > renewer gives up. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4126) RM should not issue delegation tokens in unsecure mode
[ https://issues.apache.org/jira/browse/YARN-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606587#comment-15606587 ] Daryn Sharp commented on YARN-4126: --- The general contract for servers is to return null when tokens are not applicable. This violates that contract and throws an exception. How is a generalized client supposed to pre-meditate fetching a token? And how to handle a generic IOE? I'd rather see this reverted from trunk and never integrated. We've historically had lots of problem with all the security enabled conditionals, which is why one of my multi-year old endeavors is to have tokens always enabled and gut the security conditionals. I've always admired the fact that yarn unconditionally used them... This is a step backwards. > RM should not issue delegation tokens in unsecure mode > -- > > Key: YARN-4126 > URL: https://issues.apache.org/jira/browse/YARN-4126 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Bibin A Chundatt > Fix For: 3.0.0-alpha1 > > Attachments: 0001-YARN-4126.patch, 0002-YARN-4126.patch, > 0003-YARN-4126.patch, 0004-YARN-4126.patch, 0005-YARN-4126.patch, > 0006-YARN-4126.patch > > > ClientRMService#getDelegationToken is currently returning a delegation token > in insecure mode. We should not return the token if it's in insecure mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4632) Replacing _HOST in RM_PRINCIPAL should not be the responsibility of the client code
[ https://issues.apache.org/jira/browse/YARN-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115313#comment-15115313 ] Daryn Sharp commented on YARN-4632: --- How do you intend to make the change in yarn? As you probably discovered, it's too late for the RM to make the substitution since the NN has already encoded the principal in the token. > Replacing _HOST in RM_PRINCIPAL should not be the responsibility of the > client code > --- > > Key: YARN-4632 > URL: https://issues.apache.org/jira/browse/YARN-4632 > Project: Hadoop YARN > Issue Type: Improvement > Components: api, resourcemanager >Affects Versions: 2.7.1 >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Critical > > It is currently the client's responsibility to call > {{SecurityUtil.getServerPrincipal()}} to replace the _HOST placeholder in any > principal name used for a delegation token. This is a non-optional operation > and should not be pushed onto the client. > All client apps that followed the distributed shell as the canonical example > failed to do the replacement because distributed shell fails to do the > replacement. (See YARN-4629.) Rather than fixing the whole world, since the > whole world use distributed shell as a model, let's move the operation into > YARN where it belongs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3760) Log aggregation failures
[ https://issues.apache.org/jira/browse/YARN-3760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569289#comment-14569289 ] Daryn Sharp commented on YARN-3760: --- Cancelled tokens trigger the retry proxy bug. Log aggregation failures - Key: YARN-3760 URL: https://issues.apache.org/jira/browse/YARN-3760 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0 Reporter: Daryn Sharp Priority: Critical The aggregated log file does not appear to be properly closed when writes fail. This leaves a lease renewer active in the NM that spams the NN with lease renewals. If the token is marked not to be cancelled, the renewals appear to continue until the token expires. If the token is cancelled, the periodic renew spam turns into a flood of failed connections until the lease renewer gives up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3760) Log aggregation failures
Daryn Sharp created YARN-3760: - Summary: Log aggregation failures Key: YARN-3760 URL: https://issues.apache.org/jira/browse/YARN-3760 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0 Reporter: Daryn Sharp Priority: Critical The aggregated log file does not appear to be properly closed when writes fail. This leaves a lease renewer active in the NM that spams the NN with lease renewals. If the token is marked not to be cancelled, the renewals appear to continue until the token expires. If the token is cancelled, the periodic renew spam turns into a flood of failed connections until the lease renewer gives up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487620#comment-14487620 ] Daryn Sharp commented on YARN-3055: --- Two apps could double renew tokens (completely benign) before this patch. In practice the possibility is slim and its harmless. However, currently it's quite buggy. Both apps renewed and then stomped over each other's dttrs in allTokens. Now both apps reference separate yet equivalent dttr instances, when the intention was only one app should reference a token. A second/duplicate timer task was also scheduled. Haven't bothered to check later fallout from the inconsistencies. Patch: A double renew can still occur (unavoidable) but only one timer is scheduled. All apps reference the same dttr instance. Moving the logic down only creates 3 loops instead of 2 loops but I'll do if you feel strongly. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Daryn Sharp Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-3055: -- Attachment: YARN-3055.patch The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Daryn Sharp Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch, YARN-3055.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487482#comment-14487482 ] Daryn Sharp commented on YARN-3055: --- Thanks Vinod, I'll revise this morning. The ignores shouldn't be there. I did that for our internal emergency fix because we I didn't handle proxy refresh tokens so I didn't care the tests failed. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Daryn Sharp Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-3055: -- Attachment: YARN-3055.patch Haven't had a chance to run findbugs. Might grumble about sync dttr.applicationIds. Will check this afternoon. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485461#comment-14485461 ] Daryn Sharp commented on YARN-3055: --- bq. It does seem odd to get the expiration date by renewing the token The expiration is metadata associated with the token that is only known to the token issuer's secret manager. The correct fix is for the renewer to not reschedule if the next expiration is the same as the last. The bug wasn't a real priority when tokens weren't renewed forever. If we regress to renewing forever, then it does become a problem. bq. I think currently the sub-job won't kill the overall workflow. Correct, I misread in my haste. It's rather the opposite: sub-jobs can override the original job's request to cancel the tokens. bq. I think overall the current patch will work, other than few comments I have. It works but not in a desirable way. Jason posted my patch we use internally on YARN-3439 which is duped to this jira. I'm updating it to handle the proxy refresh cases and will post shortly. The current semantics of the conf setting and the 2.x changes have been nothing but production blockers. Ref counting will solve this once and for all. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485890#comment-14485890 ] Daryn Sharp commented on YARN-3055: --- The renew at job submission isn't the problem. It's actually very desirable. Years back, a job submitted with bad tokens - that was destined to fail - would be launched anyway. The tasks failed to connect, ipc level retries occurred, then higher level retries occurred, and yarn generally caught all exceptions and retried. Tasks were retried, perhaps the app attempt was retried, etc. In the end, a job that _clearly was going to fail_ might tie up cluster resources for 20+ minutes. Why was it launched when a failed renew could have prevented the problem? Not to mention the renewer was hardcoded to assume the expiration interval was 24h... So much for being able to stress test the renewer with 1m expirations. The potential DOS problem is when a token has reached end of life expiration. Let's say the token can be renewed twice.The third and subsequent renews return the same expiration. # t1 = submit + renew # t2 = t1 + renew # t3 = t2 # t4 = t2 The renew timers fire 90% of the delta between now and the next expiration. So as end of life expiration approaches, the timer fires with an increasing frequency. 50 threads doing that virtually non-stop would not be pretty. The solution is stop renewing when the next expiration equals the last expiration. That can be addressed in another jira that's not a blocker because if tokens aren't renewed forever then it's a rare situation. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486104#comment-14486104 ] Daryn Sharp commented on YARN-3055: --- I believe you are describing the behavior of 2.6's new proxy token refresh feature. I won't digress into how broken it appears to be except in the simplest use case. With it off, there is no fetching a new token. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484117#comment-14484117 ] Daryn Sharp commented on YARN-3055: --- On cursory glance, are you sure this isn't going to leak tokens? Ie. does it remove tokens from data structures in all cases or can a token get left in allTokens? The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484201#comment-14484201 ] Daryn Sharp commented on YARN-3055: --- This appears to go back to the really old days of renewing the token for its entire lifetime. Most unfortunate. The renewer looks like it may turn into a DOS weapon. Renewing a token returns the next expiration. The renewer uses a timer to renew 90% before expiration. After the last renewal, the same expiration (the wall) will be returned as before. 90% of the wall eventually becomes a rapid fire renewal. There's an army of 50 threads prepared to fire concurrently. My other concern is that it used to be the first job submitted with a given token that determined if the token is to be cancelled. Now any job can influence the cancelling. This patch didn't specifically break that behavior, but the original YARN-2704 did, which precipitated YARN-2964 to break it differently, and now this jira. The ramification is we used to tell users to make sure the first job set the conf correctly, and essentially don't worry after that. Now they do have to worry. Any sub-job with the default of canceling tokens will kill the overall workflow. Sub-jobs should not have jurisdiction over the tokens. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481225#comment-14481225 ] Daryn Sharp commented on YARN-3055: --- Correctly handling the don't cancel setting for jobs submitting job has been a recurring issue. We're internally testing a small patch to continue renewing until all jobs using the token(s) have finished. Handling the auto-fetch of proxy tokens proved a bit more difficult so I need to complete the internal patch. I can take this over or post a partial patch if [~hitliuyi] would like to finish it. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Yi Liu Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2971) RM uses conf instead of token service address to renew timeline delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313090#comment-14313090 ] Daryn Sharp commented on YARN-2971: --- +1 Looks good RM uses conf instead of token service address to renew timeline delegation tokens - Key: YARN-2971 URL: https://issues.apache.org/jira/browse/YARN-2971 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.6.0 Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2971-v1.patch, YARN-2971-v2.patch The TimelineClientImpl renewDelegationToken uses the incorrect webaddress to renew Timeline DelegationTokens. It should read the service address out of the token to renew the delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2971) RM uses conf instead of token service address to renew timeline delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293559#comment-14293559 ] Daryn Sharp commented on YARN-2971: --- I think it's because the job cannot assume that the timeline server matches the cluster's config. I think the patch looks fine other than it should be using {{SecurityUtil.getTokenServiceAddr}} instead of directly accessing the token service. RM uses conf instead of token service address to renew timeline delegation tokens - Key: YARN-2971 URL: https://issues.apache.org/jira/browse/YARN-2971 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.6.0 Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2971-v1.patch The TimelineClientImpl renewDelegationToken uses the incorrect webaddress to renew Timeline DelegationTokens. It should read the service address out of the token to renew the delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
[ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247400#comment-14247400 ] Daryn Sharp commented on YARN-2964: --- [~vinodkv], can you take a look at this? RM prematurely cancels tokens for jobs that submit jobs (oozie) --- Key: YARN-2964 URL: https://issues.apache.org/jira/browse/YARN-2964 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Daryn Sharp Priority: Critical The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job. As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals. The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched 10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1915) ClientToAMTokenMasterKey should be provided to AM at launch time
[ https://issues.apache.org/jira/browse/YARN-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095471#comment-14095471 ] Daryn Sharp commented on YARN-1915: --- +1 But since I had a hand in the design, we should get a 2nd vote. ClientToAMTokenMasterKey should be provided to AM at launch time Key: YARN-1915 URL: https://issues.apache.org/jira/browse/YARN-1915 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Hitesh Shah Assignee: Jason Lowe Priority: Critical Attachments: YARN-1915.patch, YARN-1915v2.patch Currently, the AM receives the key as part of registration. This introduces a race where a client can connect to the AM when the AM has not received the key. Current Flow: 1) AM needs to start the client listening service in order to get host:port and send it to the RM as part of registration 2) RM gets the port info in register() and transitions the app to RUNNING. Responds back with client secret to AM. 3) User asks RM for client token. Gets it and pings the AM. AM hasn't received client secret from RM and so RPC itself rejects the request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1915) ClientToAMTokenMasterKey should be provided to AM at launch time
[ https://issues.apache.org/jira/browse/YARN-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095820#comment-14095820 ] Daryn Sharp commented on YARN-1915: --- I suspect it's because it removes the burden from the AM to strip the secret from the credentials so it doesn't leak to other processes. ClientToAMTokenMasterKey should be provided to AM at launch time Key: YARN-1915 URL: https://issues.apache.org/jira/browse/YARN-1915 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Hitesh Shah Assignee: Jason Lowe Priority: Critical Attachments: YARN-1915.patch, YARN-1915v2.patch Currently, the AM receives the key as part of registration. This introduces a race where a client can connect to the AM when the AM has not received the key. Current Flow: 1) AM needs to start the client listening service in order to get host:port and send it to the RM as part of registration 2) RM gets the port info in register() and transitions the app to RUNNING. Responds back with client secret to AM. 3) User asks RM for client token. Gets it and pings the AM. AM hasn't received client secret from RM and so RPC itself rejects the request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1915) ClientToAMTokenMasterKey should be provided to AM at launch time
[ https://issues.apache.org/jira/browse/YARN-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096319#comment-14096319 ] Daryn Sharp commented on YARN-1915: --- Yes, I thought the ugi mangling was gone, but the AMRMToken is indeed manually removed. I'm assuming there was a valid reason why the secret is passed in the registration response, perhaps for future functionality. Rather than second guess how/why it's done this way, I'd prefer to focus on a small immediate fix for this very tight race condition. The AM should generally receive the registration response before a client can ask the RM where the AM is and try to connect. Could we file another jira to contemplate an incompatible change? ClientToAMTokenMasterKey should be provided to AM at launch time Key: YARN-1915 URL: https://issues.apache.org/jira/browse/YARN-1915 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Hitesh Shah Assignee: Jason Lowe Priority: Critical Attachments: YARN-1915.patch, YARN-1915v2.patch Currently, the AM receives the key as part of registration. This introduces a race where a client can connect to the AM when the AM has not received the key. Current Flow: 1) AM needs to start the client listening service in order to get host:port and send it to the RM as part of registration 2) RM gets the port info in register() and transitions the app to RUNNING. Responds back with client secret to AM. 3) User asks RM for client token. Gets it and pings the AM. AM hasn't received client secret from RM and so RPC itself rejects the request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2147) client lacks delegation token exception details when application submit fails
[ https://issues.apache.org/jira/browse/YARN-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047946#comment-14047946 ] Daryn Sharp commented on YARN-2147: --- Code looks fine. Currently the test verifies the stringified token is in the exception's message. However since the mock is throwing an exception explicitly with the stringified token, we don't know if the code change is actually catching and adding the token. The mock should throw a generic string of say, boom. Then check the caught exception against something like Failed to renew token: token: boom. client lacks delegation token exception details when application submit fails - Key: YARN-2147 URL: https://issues.apache.org/jira/browse/YARN-2147 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Chen He Priority: Minor Attachments: YARN-2147-v2.patch, YARN-2147-v3.patch, YARN-2147-v4.patch, YARN-2147.patch When an client submits an application and the delegation token process fails the client can lack critical details needed to understand the nature of the error. Only the message of the error exception is conveyed to the client, which sometimes isn't enough to debug. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2147) client lacks delegation token exception details when application submit fails
[ https://issues.apache.org/jira/browse/YARN-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034612#comment-14034612 ] Daryn Sharp commented on YARN-2147: --- I don't think the patch handles the use case it's designed for. If job submission failed with a bland Read timed out, then logging all the tokens in the RM log doesn't help the end user, nor does the RM log even answer the question which token timed out? What you really want to do is change {{DelegationTokenRenewer#handleAppSubmitEvent}} to trap exceptions from {{renewToken}}. Wrap the exception with a more descriptive exception that stringifies to the user as Can't renew token blah: Read timed out. client lacks delegation token exception details when application submit fails - Key: YARN-2147 URL: https://issues.apache.org/jira/browse/YARN-2147 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Chen He Priority: Minor Attachments: YARN-2147-v2.patch, YARN-2147.patch When an client submits an application and the delegation token process fails the client can lack critical details needed to understand the nature of the error. Only the message of the error exception is conveyed to the client, which sometimes isn't enough to debug. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2156) ApplicationMasterService#serviceStart() method has hardcoded AuthMethod.TOKEN as security configuration
[ https://issues.apache.org/jira/browse/YARN-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030639#comment-14030639 ] Daryn Sharp commented on YARN-2156: --- A warning doesn't make sense because it implies there is something you should change. There's not. The config setting, whether explicitly set or not, is entirely irrelevant. By design, yarn always uses tokens and these tokens carry essential information that is not otherwise obtainable for non-token authenticated connections. That's why token authentication is explicitly set. ApplicationMasterService#serviceStart() method has hardcoded AuthMethod.TOKEN as security configuration --- Key: YARN-2156 URL: https://issues.apache.org/jira/browse/YARN-2156 Project: Hadoop YARN Issue Type: Bug Reporter: Svetozar Ivanov org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService#serviceStart() method has mistakenly hardcoded AuthMethod.TOKEN as Hadoop security authentication. It looks like that: {code} @Override protected void serviceStart() throws Exception { Configuration conf = getConfig(); YarnRPC rpc = YarnRPC.create(conf); InetSocketAddress masterServiceAddress = conf.getSocketAddr( YarnConfiguration.RM_SCHEDULER_ADDRESS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS, YarnConfiguration.DEFAULT_RM_SCHEDULER_PORT); Configuration serverConf = conf; // If the auth is not-simple, enforce it to be token-based. serverConf = new Configuration(conf); serverConf.set( CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION, SaslRpcServer.AuthMethod.TOKEN.toString()); ... } {code} Obviously such code makes sense only if CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION config setting is missing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2156) ApplicationMasterService#serviceStart() method has hardcoded AuthMethod.TOKEN as security configuration
[ https://issues.apache.org/jira/browse/YARN-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14029913#comment-14029913 ] Daryn Sharp commented on YARN-2156: --- Yes, this is by design. Yarn uses tokens regardless of your security setting. ApplicationMasterService#serviceStart() method has hardcoded AuthMethod.TOKEN as security configuration --- Key: YARN-2156 URL: https://issues.apache.org/jira/browse/YARN-2156 Project: Hadoop YARN Issue Type: Bug Reporter: Svetozar Ivanov org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService#serviceStart() method has mistakenly hardcoded AuthMethod.TOKEN as Hadoop security authentication. It looks like that: {code} @Override protected void serviceStart() throws Exception { Configuration conf = getConfig(); YarnRPC rpc = YarnRPC.create(conf); InetSocketAddress masterServiceAddress = conf.getSocketAddr( YarnConfiguration.RM_SCHEDULER_ADDRESS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS, YarnConfiguration.DEFAULT_RM_SCHEDULER_PORT); Configuration serverConf = conf; // If the auth is not-simple, enforce it to be token-based. serverConf = new Configuration(conf); serverConf.set( CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION, SaslRpcServer.AuthMethod.TOKEN.toString()); ... } {code} Obviously such code makes sense only if CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION config setting is missing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2156) ApplicationMasterService#serviceStart() method has hardcoded AuthMethod.TOKEN as security configuration
[ https://issues.apache.org/jira/browse/YARN-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp resolved YARN-2156. --- Resolution: Not a Problem ApplicationMasterService#serviceStart() method has hardcoded AuthMethod.TOKEN as security configuration --- Key: YARN-2156 URL: https://issues.apache.org/jira/browse/YARN-2156 Project: Hadoop YARN Issue Type: Bug Reporter: Svetozar Ivanov org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService#serviceStart() method has mistakenly hardcoded AuthMethod.TOKEN as Hadoop security authentication. It looks like that: {code} @Override protected void serviceStart() throws Exception { Configuration conf = getConfig(); YarnRPC rpc = YarnRPC.create(conf); InetSocketAddress masterServiceAddress = conf.getSocketAddr( YarnConfiguration.RM_SCHEDULER_ADDRESS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS, YarnConfiguration.DEFAULT_RM_SCHEDULER_PORT); Configuration serverConf = conf; // If the auth is not-simple, enforce it to be token-based. serverConf = new Configuration(conf); serverConf.set( CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION, SaslRpcServer.AuthMethod.TOKEN.toString()); ... } {code} Obviously such code makes sense only if CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION config setting is missing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1841) YARN ignores/overrides explicit security settings
[ https://issues.apache.org/jira/browse/YARN-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13939272#comment-13939272 ] Daryn Sharp commented on YARN-1841: --- bq. The fact that it behaves differently once invoked from the AM vs. just a simple API call to a remote cluster is what I am questioning. This should work. What issue are you encountering when talking to a remote service? bq. ... I understand that this really should not work (I don't even have an app deployed at the time of invocation of this code) ... You answered your question but the good news it's possible. You are trying to emulate an unmanaged AM. It's not possible to just register an AM w/o first requesting an app app attempt id from the RM. The subsequent registration will use a AMRM token that is issued by the RM. YARN ignores/overrides explicit security settings - Key: YARN-1841 URL: https://issues.apache.org/jira/browse/YARN-1841 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Oleg Zhurakousky core-site.xml explicitly sets authentication as SIMPLE {code} property namehadoop.security.authentication/name valuesimple/value descriptionSimple authentication/description /property {code} However any attempt to register ApplicationMaster on the remote YARN cluster results in {code} org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN] . . . {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1841) YARN ignores/overrides explicit security settings
[ https://issues.apache.org/jira/browse/YARN-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13939353#comment-13939353 ] Daryn Sharp commented on YARN-1841: --- I thought you were having issues talking to other services like a NN. As noted, trying to communicate directly with the AMRM service is an invalid use case. By design you cannot talk to this service w/o a token issued by the RM. The RM must create the app id and app attempt for the AM prior to the AM registering. I'd suggest leveraging the unmanaged AM. YARN ignores/overrides explicit security settings - Key: YARN-1841 URL: https://issues.apache.org/jira/browse/YARN-1841 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Oleg Zhurakousky core-site.xml explicitly sets authentication as SIMPLE {code} property namehadoop.security.authentication/name valuesimple/value descriptionSimple authentication/description /property {code} However any attempt to register ApplicationMaster on the remote YARN cluster results in {code} org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN] . . . {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1841) YARN ignores/overrides explicit security settings
[ https://issues.apache.org/jira/browse/YARN-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937933#comment-13937933 ] Daryn Sharp commented on YARN-1841: --- The reason the custom AM in the related user@hadoop thread is failing is likely because it's coded incorrectly. I suspect the RM supplied tokens were not added to the AM's ugi. In general, tokens are just a lightweight alternate authentication method that removes the need for hard authentication, ex. kerberos, which a task cannot do. Tokens within yarn are used to encode app/task identity and other information. Note that the identity is not the job's user identity so tokens cannot be disabled. This jira should be marked invalid if Vinod agrees. YARN ignores/overrides explicit security settings - Key: YARN-1841 URL: https://issues.apache.org/jira/browse/YARN-1841 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Oleg Zhurakousky core-site.xml explicitly sets authentication as SIMPLE {code} property namehadoop.security.authentication/name valuesimple/value descriptionSimple authentication/description /property {code} However any attempt to register ApplicationMaster on the remote YARN cluster results in {code} org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN] . . . {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1841) YARN ignores/overrides explicit security settings
[ https://issues.apache.org/jira/browse/YARN-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp resolved YARN-1841. --- Resolution: Not A Problem Oleg, the authentication config setting specifies the _external authentication_ for client visible services. Ie. The NN, RM, etc. The _internal authentication_ within the yarn framework is an implementation detail independent of the config auth method. Yarn does not need to log a warning or exception for its internal design. I think you are naively looking at this from the viewpoint of simple auth. Consider kerberos auth. The AM, NM, tasks, etc cannot use kerberos to authenticate. Even if they could, the token is used to securely sign and transport tamper resistant values. Always using tokens prevents the dreaded why does this AM/etc break with security enabled? After using the configured auth for job submission, the code path within yarn is common and the internal auth is of no concern to the user. There is no design problem, the api is transparently based on the token + rpc layer meshing to securely transport (whether simple or kerberos auth) the identity and resources requirements between processes. Feel free to ask Vinod or I questions offline to come up to speed on hadoop yarn's security. YARN ignores/overrides explicit security settings - Key: YARN-1841 URL: https://issues.apache.org/jira/browse/YARN-1841 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Oleg Zhurakousky core-site.xml explicitly sets authentication as SIMPLE {code} property namehadoop.security.authentication/name valuesimple/value descriptionSimple authentication/description /property {code} However any attempt to register ApplicationMaster on the remote YARN cluster results in {code} org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN] . . . {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1628) TestContainerManagerSecurity fails on trunk
[ https://issues.apache.org/jira/browse/YARN-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13885677#comment-13885677 ] Daryn Sharp commented on YARN-1628: --- +1. Will check in later today. Thanks! TestContainerManagerSecurity fails on trunk --- Key: YARN-1628 URL: https://issues.apache.org/jira/browse/YARN-1628 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.2.0 Reporter: Mit Desai Assignee: Mit Desai Attachments: YARN-1628.patch The Test fails with the following error {noformat} java.lang.IllegalArgumentException: java.net.UnknownHostException: InvalidHost at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377) at org.apache.hadoop.yarn.server.security.BaseNMTokenSecretManager.newInstance(BaseNMTokenSecretManager.java:145) at org.apache.hadoop.yarn.server.security.BaseNMTokenSecretManager.createNMToken(BaseNMTokenSecretManager.java:136) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:253) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:144) {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Assigned] (YARN-691) Invalid NaN values in Hadoop REST API JSON response
[ https://issues.apache.org/jira/browse/YARN-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp reassigned YARN-691: Assignee: Daryn Sharp Invalid NaN values in Hadoop REST API JSON response --- Key: YARN-691 URL: https://issues.apache.org/jira/browse/YARN-691 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 0.23.6, 2.0.4-alpha Reporter: Kendall Thrapp Assignee: Daryn Sharp I've been occasionally coming across instances where Hadoop's Cluster Applications REST API (http://hadoop.apache.org/docs/r0.23.6/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_API) has returned JSON that PHP's json_decode function failed to parse. I've tracked the syntax error down to the presence of the unquoted word NaN appearing as a value in the JSON. For example: progress:NaN, NaN is not part of the JSON spec, so its presence renders the whole JSON string invalid. Hadoop needs to return something other than NaN in this case -- perhaps an empty string or the quoted string NaN. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-691) Invalid NaN values in Hadoop REST API JSON response
[ https://issues.apache.org/jira/browse/YARN-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-691: - Assignee: (was: Daryn Sharp) Invalid NaN values in Hadoop REST API JSON response --- Key: YARN-691 URL: https://issues.apache.org/jira/browse/YARN-691 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 0.23.6, 2.0.4-alpha Reporter: Kendall Thrapp I've been occasionally coming across instances where Hadoop's Cluster Applications REST API (http://hadoop.apache.org/docs/r0.23.6/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_API) has returned JSON that PHP's json_decode function failed to parse. I've tracked the syntax error down to the presence of the unquoted word NaN appearing as a value in the JSON. For example: progress:NaN, NaN is not part of the JSON spec, so its presence renders the whole JSON string invalid. Hadoop needs to return something other than NaN in this case -- perhaps an empty string or the quoted string NaN. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-986) YARN should have a ClusterId/ServiceId
[ https://issues.apache.org/jira/browse/YARN-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777523#comment-13777523 ] Daryn Sharp commented on YARN-986: -- This sounds like NN HA tokens which IMHO are rather hacky. I've been intending to take advantage of my RPCv9 auth changes for the server to tell the client the token service (or perhaps another field) it needs to decouple tokens entirely from IP/hostname. Thoughts on this approach? YARN should have a ClusterId/ServiceId -- Key: YARN-986 URL: https://issues.apache.org/jira/browse/YARN-986 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Karthik Kambatla This needs to be done to support non-ip based fail over of RM. Once the server sets the token service address to be this generic ClusterId/ServiceId, clients can translate it to appropriate final IP and then be able to select tokens via TokenSelectors. Some workarounds for other related issues were put in place at YARN-945. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1189) NMTokenSecretManagerInNM is not being told when applications have finished
[ https://issues.apache.org/jira/browse/YARN-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766738#comment-13766738 ] Daryn Sharp commented on YARN-1189: --- Oops, I thought the .1 patch was the latest so I didn't see the test. NMTokenSecretManagerInNM is not being told when applications have finished --- Key: YARN-1189 URL: https://issues.apache.org/jira/browse/YARN-1189 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta, 2.1.1-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Priority: Blocker Attachments: YARN-1189-20130912.1.patch, YARN-1189-20130913.txt The {{appFinished}} method is not being called when applications have finished. This causes a couple of leaks as {{oldMasterKeys}} and {{appToAppAttemptMap}} are never being pruned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1189) NMTokenSecretManagerInNM is not being told when applications have finished
[ https://issues.apache.org/jira/browse/YARN-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766728#comment-13766728 ] Daryn Sharp commented on YARN-1189: --- +1 but a test, even a mock that spies appFinished would be great to avoid a regression NMTokenSecretManagerInNM is not being told when applications have finished --- Key: YARN-1189 URL: https://issues.apache.org/jira/browse/YARN-1189 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta, 2.1.1-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Priority: Blocker Attachments: YARN-1189-20130912.1.patch, YARN-1189-20130913.txt The {{appFinished}} method is not being called when applications have finished. This causes a couple of leaks as {{oldMasterKeys}} and {{appToAppAttemptMap}} are never being pruned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-707) Add user info in the YARN ClientToken
[ https://issues.apache.org/jira/browse/YARN-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757904#comment-13757904 ] Daryn Sharp commented on YARN-707: -- Ug, the RM and AM are abusing the same secret manager impl. The RM wants the secret key to be generated, whereas the AM really wants to verify it. 2.x fixed this. Add user info in the YARN ClientToken - Key: YARN-707 URL: https://issues.apache.org/jira/browse/YARN-707 Project: Hadoop YARN Issue Type: Improvement Reporter: Bikas Saha Assignee: Jason Lowe Priority: Blocker Fix For: 3.0.0, 2.1.1-beta Attachments: YARN-707-20130822.txt, YARN-707-20130827.txt, YARN-707-20130828-2.txt, YARN-707-20130828.txt, YARN-707-20130829.txt, YARN-707-20130830.branch-0.23.txt If user info is present in the client token then it can be used to do limited authz in the AM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-707) Add user info in the YARN ClientToken
[ https://issues.apache.org/jira/browse/YARN-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757890#comment-13757890 ] Daryn Sharp commented on YARN-707: -- Still reviewing, but an initial observation is {{ClientToAMSecretManager#getMasterKey}} is fabricating a new secret key if there is no pre-existing key for the appId. This should be an error condition. The secret manager knows the secret key for the specific app so there's no need to ever generate a secret key, right? Else I can flood the AM with invalid appIds to make it go OOM from generating secret keys for invalid appIds. Add user info in the YARN ClientToken - Key: YARN-707 URL: https://issues.apache.org/jira/browse/YARN-707 Project: Hadoop YARN Issue Type: Improvement Reporter: Bikas Saha Assignee: Jason Lowe Priority: Blocker Fix For: 3.0.0, 2.1.1-beta Attachments: YARN-707-20130822.txt, YARN-707-20130827.txt, YARN-707-20130828-2.txt, YARN-707-20130828.txt, YARN-707-20130829.txt, YARN-707-20130830.branch-0.23.txt If user info is present in the client token then it can be used to do limited authz in the AM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-707) Add user info in the YARN ClientToken
[ https://issues.apache.org/jira/browse/YARN-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757926#comment-13757926 ] Daryn Sharp commented on YARN-707: -- Minor: # {{ClientToAMTokenIdentifier#getUser()}} doesn't do a null check on the client name (because it can't be null) but should perhaps still check isEmpty()? # Is {{ResourceManager#clientToAMSecretManager}} still needed now that it's in the context? # Now that the client token is generated in {{RMAppAttemptImpl}} - should it contain the attemptId, not the appId? Add user info in the YARN ClientToken - Key: YARN-707 URL: https://issues.apache.org/jira/browse/YARN-707 Project: Hadoop YARN Issue Type: Improvement Reporter: Bikas Saha Assignee: Jason Lowe Priority: Blocker Fix For: 3.0.0, 2.1.1-beta Attachments: YARN-707-20130822.txt, YARN-707-20130827.txt, YARN-707-20130828-2.txt, YARN-707-20130828.txt, YARN-707-20130829.txt, YARN-707-20130830.branch-0.23.txt If user info is present in the client token then it can be used to do limited authz in the AM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1146) RM DTSM and RMStateStore mismanage sequence number
[ https://issues.apache.org/jira/browse/YARN-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757983#comment-13757983 ] Daryn Sharp commented on YARN-1146: --- Note that bug #2 will not self-correct if the following sequence occurs: # Issue token 1, 2, 3, 4 (seq=4) # Renew token 2 (seq=2) # Cancel token 3, 4 (seq=2) # Stop RM # Start RM (seq=2) and will issue token 3 and 4 again The issue is _probably_ benign given the current implementation, but is a bug if anything relies on sequence number. RM DTSM and RMStateStore mismanage sequence number -- Key: YARN-1146 URL: https://issues.apache.org/jira/browse/YARN-1146 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp {{RMDelegationTokenSecretManager}} implements {{storeNewToken}} and {{updateStoredToken}} (renew) to pass the token and its sequence number to {{RMStateStore#storeRMDelegationTokenAndSequenceNumber}}. There are two problems: # The assumption is that new tokens will be synchronously stored in-order. With an async secret manager this may not hold true and the state's sequence number may be incorrect. # A token renewal will reset the state's sequence number to _that token's_ sequence number. Bug #2 is generally masked. Creating a new token (with the first caveat) will bump the state's sequence number back up. Restoring the dtsm will first set the state's stored sequence number, then re-add all the tokens which will update the sequence number if the token's sequence number is greater than the dtsm's current sequence number. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1146) RM DTSM and RMStateStore mismanage sequence number
Daryn Sharp created YARN-1146: - Summary: RM DTSM and RMStateStore mismanage sequence number Key: YARN-1146 URL: https://issues.apache.org/jira/browse/YARN-1146 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp {{RMDelegationTokenSecretManager}} implements {{storeNewToken}} and {{updateStoredToken}} (renew) to pass the token and its sequence number to {{RMStateStore#storeRMDelegationTokenAndSequenceNumber}}. There are two problems: # The assumption is that new tokens will be synchronously stored in-order. With an async secret manager this may not hold true and the state's sequence number may be incorrect. # A token renewal will reset the state's sequence number to _that token's_ sequence number. Bug #2 is generally masked. Creating a new token (with the first caveat) will bump the state's sequence number back up. Restoring the dtsm will first set the state's stored sequence number, then re-add all the tokens which will update the sequence number if the token's sequence number is greater than the dtsm's current sequence number. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1146) RM DTSM and RMStateStore mismanage sequence number
[ https://issues.apache.org/jira/browse/YARN-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13758184#comment-13758184 ] Daryn Sharp commented on YARN-1146: --- [~vinodkv] I'm desynch'ing the ADTSM on HADOOP-9930. Is it ok for me to exasperate this seq number handling? RM DTSM and RMStateStore mismanage sequence number -- Key: YARN-1146 URL: https://issues.apache.org/jira/browse/YARN-1146 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.0-alpha Reporter: Daryn Sharp {{RMDelegationTokenSecretManager}} implements {{storeNewToken}} and {{updateStoredToken}} (renew) to pass the token and its sequence number to {{RMStateStore#storeRMDelegationTokenAndSequenceNumber}}. There are two problems: # The assumption is that new tokens will be synchronously stored in-order. With an async secret manager this may not hold true and the state's sequence number may be incorrect. # A token renewal will reset the state's sequence number to _that token's_ sequence number. Bug #2 is generally masked. Creating a new token (with the first caveat) will bump the state's sequence number back up. Restoring the dtsm will first set the state's stored sequence number, then re-add all the tokens which will update the sequence number if the token's sequence number is greater than the dtsm's current sequence number. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-707) Add user info in the YARN ClientToken
[ https://issues.apache.org/jira/browse/YARN-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13758375#comment-13758375 ] Daryn Sharp commented on YARN-707: -- +1 Looks good enough to me. Add user info in the YARN ClientToken - Key: YARN-707 URL: https://issues.apache.org/jira/browse/YARN-707 Project: Hadoop YARN Issue Type: Improvement Reporter: Bikas Saha Assignee: Jason Lowe Priority: Blocker Fix For: 3.0.0, 2.1.1-beta Attachments: YARN-707-20130822.txt, YARN-707-20130827.txt, YARN-707-20130828-2.txt, YARN-707-20130828.txt, YARN-707-20130829.txt, YARN-707-20130830.branch-0.23.txt, YARN-707-20130904.branch-0.23.txt If user info is present in the client token then it can be used to do limited authz in the AM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-707) Add user info in the YARN ClientToken
[ https://issues.apache.org/jira/browse/YARN-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13752443#comment-13752443 ] Daryn Sharp commented on YARN-707: -- Technically you should be bumping the token ident's version number and using that to determine if the app submitter is in the ident. Otherwise, decoding of prior tokens will attempt to read the missing app submitter from the next serialized object and eventually fail spectacularly. {{RmAppImpl#createAndGetApplicationReport}} Using checks on {{UserGroupInformation.isSecurityEnabled()}} here and elsewhere will cause future incompatibility to require tokens w/o security which is the direction yarn has been moving in. It would be better to check if the secret manager is not null. It's just logging if it cannot create a token? This _shouldn't_ happen, but _if/when_ it does it's going to lead to more difficult after the fact errors in the client. It's unfortunate you cannot throw the checked exception {{IOException}}, so I think you need to change the method signature or throw whatever you can, like a {{YarnException}}, to fail the request. App attempting storing/restoring appears asymmetric. Storing saves off the whole credentials in the attempt, whereas restoring appears to just pluck out the amrm token and the new persisted secret? Minor: Methods using the term Token, ex. {{recoverAppAttemptTokens}} and {{getTokensFromAppAttempt}} are misleading since it's Credentials. Vinod had me make a similar change to the method names in the AM. {{AM_CLIENT_TOKEN_MASTER_KEY_NAME}} is better defined in {{RMAppAttempt}}, rather than in the {{RMStateStore}}. Otherwise the import dependency seems backwards. Add user info in the YARN ClientToken - Key: YARN-707 URL: https://issues.apache.org/jira/browse/YARN-707 Project: Hadoop YARN Issue Type: Improvement Reporter: Bikas Saha Assignee: Jason Lowe Priority: Blocker Fix For: 3.0.0, 2.1.1-beta Attachments: YARN-707-20130822.txt, YARN-707-20130827.txt If user info is present in the client token then it can be used to do limited authz in the AM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-707) Add user info in the YARN ClientToken
[ https://issues.apache.org/jira/browse/YARN-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13748999#comment-13748999 ] Daryn Sharp commented on YARN-707: -- It almost seems like it would be better to invert the approach to be more consistent with other tokens - the owner of the token is the user (not the app attempt) and there's a new field for the app attempt (instead of a new field for the user). Another thought would be leverage the existing real/effective user in the token. One is the submitter, the other is the app attempt. Logging that includes the UGI will show appAttempt (auth:...) via daryn (auth:...), or vice-versa for the users. Thoughts? Add user info in the YARN ClientToken - Key: YARN-707 URL: https://issues.apache.org/jira/browse/YARN-707 Project: Hadoop YARN Issue Type: Improvement Reporter: Bikas Saha Assignee: Vinod Kumar Vavilapalli Attachments: YARN-707-20130822.txt If user info is present in the client token then it can be used to do limited authz in the AM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-960) TestMRCredentials and TestBinaryTokenFile are failing on trunk
[ https://issues.apache.org/jira/browse/YARN-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-960: - Attachment: YARN-960.patch All tokens but the AMRM token are being lost are lost if security is disabled. The AMLauncher is using {{UGI.isSecurityEnabled()}} to decide if it should decode the existing container tokens before adding the AMRM token and re-encoding the container tokens. This is completely wrong. Tokens need to be unconditionally passed. This removes the security check. TestMRCredentials and TestBinaryTokenFile are failing on trunk --- Key: YARN-960 URL: https://issues.apache.org/jira/browse/YARN-960 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Alejandro Abdelnur Assignee: Daryn Sharp Priority: Blocker Fix For: 2.1.0-beta Attachments: YARN-960.patch Not sure, but this may be a fallout from YARN-701 and/or related to YARN-945. Making it a blocker until full impact of the issue is scoped. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-945) AM register failing after AMRMToken
[ https://issues.apache.org/jira/browse/YARN-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13718975#comment-13718975 ] Daryn Sharp commented on YARN-945: -- Please be sure I get a chance to look at the patch. AM register failing after AMRMToken --- Key: YARN-945 URL: https://issues.apache.org/jira/browse/YARN-945 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Bikas Saha Assignee: Vinod Kumar Vavilapalli Priority: Blocker Fix For: 2.1.0-beta Attachments: nm.log, rm.log, yarn-site.xml 509 2013-07-19 15:53:55,569 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 54313: readAndProcess from client 127.0.0.1 threw exception [org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN]] 510 org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN] 511 at org.apache.hadoop.ipc.Server$Connection.initializeAuthContext(Server.java:1531) 512 at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1482) 513 at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:788) 514 at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:587) 515 at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:562) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-874) Tracking YARN/MR test failures after HADOOP-9421 and YARN-827
[ https://issues.apache.org/jira/browse/YARN-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13693186#comment-13693186 ] Daryn Sharp commented on YARN-874: -- +1! Tracking YARN/MR test failures after HADOOP-9421 and YARN-827 - Key: YARN-874 URL: https://issues.apache.org/jira/browse/YARN-874 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Priority: Blocker Attachments: YARN-874.1.txt, YARN-874.2.txt, YARN-874.txt HADOOP-9421 and YARN-827 broke some YARN/MR tests. Tracking those.. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-690) RM exits on token cancel/renew problems
Daryn Sharp created YARN-690: Summary: RM exits on token cancel/renew problems Key: YARN-690 URL: https://issues.apache.org/jira/browse/YARN-690 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 0.23.7, 3.0.0, 2.0.5-beta Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker The DelegationTokenRenewer thread is critical to the RM. When a non-IOException occurs, the thread calls System.exit to prevent the RM from running w/o the thread. It should be exiting only on non-RuntimeExceptions. The problem is especially bad in 23 because the yarn protobuf layer converts IOExceptions into UndeclaredThrowableExceptions (RuntimeException) which causes the renewer to abort the process. An UnknownHostException takes down the RM... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-690) RM exits on token cancel/renew problems
[ https://issues.apache.org/jira/browse/YARN-690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-690: - Attachment: YARN-690.patch 1-line change to a catch. No test added due to difficulty of testing calls to System.exit. RM exits on token cancel/renew problems --- Key: YARN-690 URL: https://issues.apache.org/jira/browse/YARN-690 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Attachments: YARN-690.patch The DelegationTokenRenewer thread is critical to the RM. When a non-IOException occurs, the thread calls System.exit to prevent the RM from running w/o the thread. It should be exiting only on non-RuntimeExceptions. The problem is especially bad in 23 because the yarn protobuf layer converts IOExceptions into UndeclaredThrowableExceptions (RuntimeException) which causes the renewer to abort the process. An UnknownHostException takes down the RM... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-690) RM exits on token cancel/renew problems
[ https://issues.apache.org/jira/browse/YARN-690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-690: - Attachment: YARN-690.patch Doh, you're right. It was a test, and you passed! RM exits on token cancel/renew problems --- Key: YARN-690 URL: https://issues.apache.org/jira/browse/YARN-690 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Attachments: YARN-690.patch, YARN-690.patch The DelegationTokenRenewer thread is critical to the RM. When a non-IOException occurs, the thread calls System.exit to prevent the RM from running w/o the thread. It should be exiting only on non-RuntimeExceptions. The problem is especially bad in 23 because the yarn protobuf layer converts IOExceptions into UndeclaredThrowableExceptions (RuntimeException) which causes the renewer to abort the process. An UnknownHostException takes down the RM... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira