[jira] [Created] (YARN-9013) [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner

2018-11-12 Thread Botong Huang (JIRA)
Botong Huang created YARN-9013:
--

 Summary: [GPG] fix order of steps cleaning Registry entries in 
ApplicationCleaner
 Key: YARN-9013
 URL: https://issues.apache.org/jira/browse/YARN-9013
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


ApplicationCleaner today deletes the entry for all finished (non-running) 
application in YarnRegistry using this logic:
 # GPG gets the list of running applications from Router.
 # GPG gets the full list of applications in registry
 # GPG deletes in registry every app in 2 that’s not in 1

The problem is that jobs that started between 1 and 2 meets the criteria in 3, 
and thus get deleted by mistake. The fix/right order should be 2->1->3, rather 
than 1->2->3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8933) [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in allocation response

2018-10-22 Thread Botong Huang (JIRA)
Botong Huang created YARN-8933:
--

 Summary: [AMRMProxy] Fix potential null AvailableResource and 
NumClusterNode in allocation response
 Key: YARN-8933
 URL: https://issues.apache.org/jira/browse/YARN-8933
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


After YARN-8696, the allocate response by FederationInterceptor is merged from 
the responses from a random subset of all sub-clusters, depending on the async 
heartbeat timing. As a result, cluster-wide information fields in the response, 
e.g. AvailableResources and NumClusterNodes, are not consistent at all. It can 
even be null/zero because the specific response is merged from an empty set of 
sub-cluster responses. 

In this patch, we let FederationInterceptor remember the last allocate response 
from all known sub-clusters, and always construct the cluster-wide info fields 
from all of them. We also moved sub-cluster timeout from 
LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that sub-clusters 
that expired (haven't had a successful allocate response for a while) won't be 
included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client

2018-10-16 Thread Botong Huang (JIRA)
Botong Huang created YARN-8893:
--

 Summary: [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM 
client
 Key: YARN-8893
 URL: https://issues.apache.org/jira/browse/YARN-8893
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


Fix thread leak in AMRMClientRelayer and UAM client used by 
FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner

2018-10-09 Thread Botong Huang (JIRA)
Botong Huang created YARN-8862:
--

 Summary: [GPG] add Yarn Registry cleanup in ApplicationCleaner
 Key: YARN-8862
 URL: https://issues.apache.org/jira/browse/YARN-8862
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in 
secondary sub-clusters. Because of potential more app attempts later, AMRMProxy 
cannot kill the UAM and delete the tokens when one local attempt finishes. So 
similar to the StateStore application table, we need ApplicationCleaner in GPG 
to cleanup the finished app entries in Yarn Registry. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator

2018-09-21 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang resolved YARN-7599.

Resolution: Fixed

> [GPG] ApplicationCleaner in Global Policy Generator
> ---
>
> Key: YARN-7599
> URL: https://issues.apache.org/jira/browse/YARN-7599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
>  Labels: federation, gpg
> Attachments: YARN-7599-YARN-7402.v1.patch, 
> YARN-7599-YARN-7402.v2.patch, YARN-7599-YARN-7402.v3.patch, 
> YARN-7599-YARN-7402.v4.patch, YARN-7599-YARN-7402.v5.patch, 
> YARN-7599-YARN-7402.v6.patch, YARN-7599-YARN-7402.v7.patch, 
> YARN-7599-YARN-7402.v8.patch
>
>
> In Federation, we need a cleanup service for StateStore as well as Yarn 
> Registry. For the former, we need to remove old application records. For the 
> latter, failed and killed applications might leave records in the Yarn 
> Registry (see YARN-6128). We plan to do both cleanup work in 
> ApplicationCleaner in GPG



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8760) Fix concurrent re-register due to YarnRM failover in AMRMClientRelayer

2018-09-10 Thread Botong Huang (JIRA)
Botong Huang created YARN-8760:
--

 Summary: Fix concurrent re-register due to YarnRM failover in 
AMRMClientRelayer
 Key: YARN-8760
 URL: https://issues.apache.org/jira/browse/YARN-8760
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


When home YarnRM is failing over, FinishApplicationMaster call from AM can have 
multiple retry threads outstanding in FederationInterceptor. When new YarnRM 
come back up, all retry threads will re-register to YarnRM. The first one will 
succeed but the rest will get "Application Master is already registered" 
exception. We should catch and swallow this exception and move on. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8705) Refactor in preparation for YARN-8696

2018-08-23 Thread Botong Huang (JIRA)
Botong Huang created YARN-8705:
--

 Summary: Refactor in preparation for YARN-8696
 Key: YARN-8705
 URL: https://issues.apache.org/jira/browse/YARN-8705
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


Refactor the UAM heartbeat thread as well as call back method in preparation 
for YARN-8696 FederationInterceptor upgrade



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8697) LocalityMulticastAMRMProxyPolicy should fallback to random sub-cluster when cannot resolve resource

2018-08-21 Thread Botong Huang (JIRA)
Botong Huang created YARN-8697:
--

 Summary: LocalityMulticastAMRMProxyPolicy should fallback to 
random sub-cluster when cannot resolve resource
 Key: YARN-8697
 URL: https://issues.apache.org/jira/browse/YARN-8697
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


Right now in LocalityMulticastAMRMProxyPolicy, whenever we cannot resolve the 
resource name (node or rack), we always route the request to home sub-cluster. 
However, home sub-cluster might not be always be ready to use (timed out 
YARN-8581) or enabled (by AMRMProxyPolicy weights). It might also be 
overwhelmed by the requests if sub-cluster resolver has some issue. In this 
Jira, we are changing it to pick a random active and enabled sub-cluster for 
resource request we cannot resolve. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8696) FederationInterceptor upgrade: home sub-cluster heartbeat async

2018-08-21 Thread Botong Huang (JIRA)
Botong Huang created YARN-8696:
--

 Summary: FederationInterceptor upgrade: home sub-cluster heartbeat 
async
 Key: YARN-8696
 URL: https://issues.apache.org/jira/browse/YARN-8696
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


Today in _FederationInterceptor_, the heartbeat to home sub-cluster is 
synchronous. After the heartbeat is sent out to home sub-cluster, it waits for 
the home response to come back before merging and returning the (merged) 
heartbeat result to back AM. If home sub-cluster is suffering from connection 
issues, or down during an YarnRM master-slave switch, all heartbeat threads in 
_FederationInterceptor_ will be blocked waiting for home response. As a result, 
the successful UAM heartbeats from secondary sub-clusters will not be returned 
to AM at all. Additionally, because of the fact that we kept the same heartbeat 
responseId between AM and home RM, lots of tricky handling are needed regarding 
the responseId resync when it comes to _FederationInterceptor_ (part of 
AMRMProxy, NM) work preserving restart (YARN-6127, YARN-1336), home RM 
master-slave switch etc. 

In this patch, we change the heartbeat to home sub-cluster to asynchronous, 
same as the way we handle UAM heartbeats in secondaries. So that any 
sub-cluster down or connection issues won't impact AM getting responses from 
other sub-clusters. The responseId is also managed separately for home 
sub-cluster and AM, and they increment independently. The resync logic becomes 
much cleaner. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8673) [AMRMProxy] More robust responseId resync after an YarnRM master slave switch

2018-08-16 Thread Botong Huang (JIRA)
Botong Huang created YARN-8673:
--

 Summary: [AMRMProxy] More robust responseId resync after an YarnRM 
master slave switch
 Key: YARN-8673
 URL: https://issues.apache.org/jira/browse/YARN-8673
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


After master slave switch of YarnRM, an _ApplicationNotRegisteredException_ 
will be thrown from the new YarnRM. AM will re-regsiter and reset the 
responseId to zero. _AMRMClientRelayer_ inside _FederationInterceptor_ follows 
the same protocol, and does the automatic re-register and responseId resync. 
However, when exceptions or temporary network issue happens in the allocate 
call after re-register, the resync logic might be broken. This patch improves 
the robustness of the process by parsing the expected repsonseId from YarnRM 
exception message. So that whenever the responseId is out of sync for whatever 
reason, we can automatically resync and move on. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8658) Metrics for AMRMClientRelayer inside FederationInterceptor

2018-08-13 Thread Botong Huang (JIRA)
Botong Huang created YARN-8658:
--

 Summary: Metrics for AMRMClientRelayer inside FederationInterceptor
 Key: YARN-8658
 URL: https://issues.apache.org/jira/browse/YARN-8658
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Young Chen


AMRMClientRelayer (YARN-7900) is introduced for stateful FederationInterceptor 
(YARN-7899), to keep track of all pending requests sent to every subcluster 
YarnRM. We need to add metrics for AMRMClientRelayer to show the state of 
things in FederationInterceptor. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8581) [AMRMProxy] Add sub-cluster timeout in LocalityMulticastAMRMProxyPolicy

2018-07-25 Thread Botong Huang (JIRA)
Botong Huang created YARN-8581:
--

 Summary: [AMRMProxy] Add sub-cluster timeout in 
LocalityMulticastAMRMProxyPolicy
 Key: YARN-8581
 URL: https://issues.apache.org/jira/browse/YARN-8581
 Project: Hadoop YARN
  Issue Type: Task
  Components: amrmproxy, federation
Reporter: Botong Huang
Assignee: Botong Huang


In Federation, every time an AM heartbeat comes in, 
LocalityMulticastAMRMProxyPolicy in AMRMProxy splits the asks according to the 
list of active and enabled sub-clusters. However, if we haven't been able to 
heartbeat to a sub-cluster for some time (network issues, or we keep hitting 
some exception from YarnRM, or YarnRM master-slave switch is taking a long time 
etc.), we should consider the sub-cluster as unhealthy and stop routing asks 
there, until the heartbeat channel becomes healthy again. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8536) Add max heap config option for Federation Router

2018-07-13 Thread Botong Huang (JIRA)
Botong Huang created YARN-8536:
--

 Summary: Add max heap config option for Federation Router
 Key: YARN-8536
 URL: https://issues.apache.org/jira/browse/YARN-8536
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8534) Add max heap config option for Federation Router and GPG

2018-07-13 Thread Botong Huang (JIRA)
Botong Huang created YARN-8534:
--

 Summary: Add max heap config option for Federation Router and GPG
 Key: YARN-8534
 URL: https://issues.apache.org/jira/browse/YARN-8534
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8481) AMRMProxyPolicies should accept heartbeat response from new/unknown subclusters

2018-06-29 Thread Botong Huang (JIRA)
Botong Huang created YARN-8481:
--

 Summary: AMRMProxyPolicies should accept heartbeat response from 
new/unknown subclusters
 Key: YARN-8481
 URL: https://issues.apache.org/jira/browse/YARN-8481
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang
Assignee: Botong Huang


Currently BroadcastAMRMProxyPolicy assumes that we only span the application to 
the sub-clusters instructed by itself via _splitResourceRequests_. However, 
with AMRMProxy HA, second attempts of the application might come up with 
multiple sub-clusters initially without consulting the AMRMProxyPolicy at all. 
This leads to exceptions in _notifyOfResponse._ It should simply allow the 
new/unknown sub-cluster heartbeat responses. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8451) Multiple NM heartbeat thread created when a slow NM resync with RM

2018-06-22 Thread Botong Huang (JIRA)
Botong Huang created YARN-8451:
--

 Summary: Multiple NM heartbeat thread created when a slow NM 
resync with RM
 Key: YARN-8451
 URL: https://issues.apache.org/jira/browse/YARN-8451
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang
Assignee: Botong Huang


During a NM resync with RM (say RM did a master slave switch), if NM is running 
slow, more than one RESYNC event may be put into the NM dispatcher by the 
existing heartbeat thread before they are processed. As a result, multiple new 
heartbeat thread are later created and start to hb to RM concurrently with 
their own responseId. If at some point of time, one thread becomes more than 
one step behind others, RM will send back a resync signal in this heartbeat 
response, killing all containers in this NM. 

See comments below for details on how this can happen. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8433) TestAMRestart flaky in trunk

2018-06-16 Thread Botong Huang (JIRA)
Botong Huang created YARN-8433:
--

 Summary: TestAMRestart flaky in trunk
 Key: YARN-8433
 URL: https://issues.apache.org/jira/browse/YARN-8433
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang


 
[org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testContainersFromPreviousAttemptsWithRMRestart[FAIR]|https://builds.apache.org/job/PreCommit-YARN-Build/21002/testReport/org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager/TestAMRestart/testContainersFromPreviousAttemptsWithRMRestart_FAIR_/]
Attempt state is not correct (timeout). expected: but was:
 
[org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testPreemptedAMRestartOnRMRestart[FAIR]|https://builds.apache.org/job/PreCommit-YARN-Build/21014/testReport/org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager/TestAMRestart/testPreemptedAMRestartOnRMRestart_FAIR_/]
test timed out after 6 milliseconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8412) Move ResourceRequest.clone logic everywhere into a proper API

2018-06-11 Thread Botong Huang (JIRA)
Botong Huang created YARN-8412:
--

 Summary: Move ResourceRequest.clone logic everywhere into a proper 
API
 Key: YARN-8412
 URL: https://issues.apache.org/jira/browse/YARN-8412
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


ResourceRequest.clone code is replicated in lots of places, some missing to 
copy one field or two due to new fields added over time. This JIRA attempts to 
move them into a proper API so that everyone can use this single 
implementation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8334) [GPG] Fix potential connection leak in GPGUtils

2018-05-23 Thread Botong Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang resolved YARN-8334.

Resolution: Fixed

> [GPG] Fix potential connection leak in GPGUtils
> ---
>
> Key: YARN-8334
> URL: https://issues.apache.org/jira/browse/YARN-8334
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
>Priority: Minor
> Attachments: YARN-8334-YARN-7402.v1.patch, 
> YARN-8334-YARN-7402.v2.patch
>
>
> Missing ClientResponse.close and Client.destroy can lead to a connection leak.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8227) TestPlacementConstraintTransformations is failing in trunk

2018-04-27 Thread Botong Huang (JIRA)
Botong Huang created YARN-8227:
--

 Summary: TestPlacementConstraintTransformations is failing in trunk
 Key: YARN-8227
 URL: https://issues.apache.org/jira/browse/YARN-8227
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang


[ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.1 s 
<<< FAILURE! - in 
org.apache.hadoop.yarn.api.resource.TestPlacementConstraintTransformations
[ERROR] 
testCardinalityConstraint(org.apache.hadoop.yarn.api.resource.TestPlacementConstraintTransformations)
  Time elapsed: 0.007 s  <<< FAILURE!
java.lang.AssertionError: expected: java.util.HashSet<[hb]> but was: 
java.util.HashSet<[hb]>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:144)
at 
org.apache.hadoop.yarn.api.resource.TestPlacementConstraintTransformations.testCardinalityConstraint(TestPlacementConstraintTransformations.java:116)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8110) AMRMProxy recover should catch for all throwable retrying to recover apps

2018-04-02 Thread Botong Huang (JIRA)
Botong Huang created YARN-8110:
--

 Summary: AMRMProxy recover should catch for all throwable retrying 
to recover apps
 Key: YARN-8110
 URL: https://issues.apache.org/jira/browse/YARN-8110
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


In NM work preserving restart, when AMRMProxy recovers applications one by one, 
the current catch only catch for IOException. If one app recovery throws other 
thing (e.g. RuntimeException), it will fail the entire AMRMProxy recovery. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8010) add config in FederationRMFailoverProxy to not bypass facade cache when failing over

2018-03-07 Thread Botong Huang (JIRA)
Botong Huang created YARN-8010:
--

 Summary: add config in FederationRMFailoverProxy to not bypass 
facade cache when failing over
 Key: YARN-8010
 URL: https://issues.apache.org/jira/browse/YARN-8010
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


Today when YarnRM is failing over, the FederationRMFailoverProxy running in 
AMRMProxy will perform failover, try to get latest subcluster info from 
FederationStateStore and then retry connect to the latest YarnRM master. When 
calling getSubCluster() to FederationStateStoreFacade, it bypasses the cache 
with a flush flag. When YarnRM is failing over, every AM heartbeat thread 
creates a different thread inside FederationInterceptor, each of which keeps 
performing failover several times. This leads to a big spike of getSubCluster 
call to FederationStateStore. 

Depending on the cluster setup (e.g. putting a VIP before all YarnRMs), YarnRM 
master slave change might not result in RM addr change. In other cases, a small 
delay of getting latest subcluster information may be acceptable. This patch 
thus creates a config option, so that it is possible to ask the 
FederationRMFailoverProxy to not flush cache when calling getSubCluster(). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7918) TestAMRMClientPlacementConstraints.testAMRMClientWithPlacementConstraints failing in trunk

2018-02-09 Thread Botong Huang (JIRA)
Botong Huang created YARN-7918:
--

 Summary: 
TestAMRMClientPlacementConstraints.testAMRMClientWithPlacementConstraints 
failing in trunk
 Key: YARN-7918
 URL: https://issues.apache.org/jira/browse/YARN-7918
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang


java.lang.AssertionError: expected:<2> but was:<1>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.client.api.impl.TestAMRMClientPlacementConstraints.testAMRMClientWithPlacementConstraints(TestAMRMClientPlacementConstraints.java:161)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7900) [AMRMProxy] AMRMClientRelayer for stateful FederationInterceptor

2018-02-06 Thread Botong Huang (JIRA)
Botong Huang created YARN-7900:
--

 Summary: [AMRMProxy] AMRMClientRelayer for stateful 
FederationInterceptor
 Key: YARN-7900
 URL: https://issues.apache.org/jira/browse/YARN-7900
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


Inside stateful FederationInterceptor (YARN-7899), we need a component similar 
to AMRMClient that remembers all pending (outstands) requests we've sent to 
YarnRM, auto re-register and do full pending resend when YarnRM fails over and 
throws ApplicationMasterNotRegisteredException back. This JIRA adds this 
component as AMRMClientRelayer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7899) [AMRMProxy] Stateful FederationInterceptor for pending requests

2018-02-06 Thread Botong Huang (JIRA)
Botong Huang created YARN-7899:
--

 Summary: [AMRMProxy] Stateful FederationInterceptor for pending 
requests
 Key: YARN-7899
 URL: https://issues.apache.org/jira/browse/YARN-7899
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


Today FederationInterceptor (in AMRMProxy for YARN Federation) is stateless in 
terms of pending (outstanding) requests. Whenever AM issues new requests, FI 
simply splits and sends them to sub-cluster YarnRMs and forget about them. This 
JIRA attempts to make FI stateful so that it remembers the pending requests in 
all relevant sub-clusters. This has two major benefits: 

1. It is a prerequisite for FI to be able to cancel pending request in one 
sub-cluster and re-send it to other sub-clusters. This is needed for load 
balancing and to fully comply with the relax locality fallback to ANY semantic. 
When we send a request to one sub-cluster, we have effectively restrained the 
allocation for this request to be within this sub-cluster rather than 
everywhere. If the cluster capacity in this sub-cluster for this app is full or 
this YarnRM is overloaded and slow, the request will be stuck there for a long 
time even if there is free capacity in other sub-clusters. We need FI to 
remember and adjust the pending requests on the fly. 

2. This makes pending request recovery easier when YarnRM fails over. Today 
whenever one sub-cluster RM fails over, in order to recover lost pending 
requests for this sub-cluster, 
we have to propagate the ApplicationMasterNotRegisteredException from the 
YarnRM back to AM, triggering a full pending resend from AM. This contains 
pending for not only the failing-over sub-cluster, but everyone. Since our 
split-merge (AMRMProxyPolicy) does not guarantee idempotency, the same request 
we sent to sub-cluster-1 earlier might be resent to sub-cluster-2. If both 
these YarnRMs have not failed over, they will both allocate for this request, 
leading to over-allocation. Also, these full pending resends also puts 
unnecessary load on every YarnRM in the cluster everytime one YarnRM fails 
over. With stateful FederationInterceptor, since we remember pending requests 
we have sent out earlier, we can shield the 
ApplicationMasterNotRegisteredException for AM and resend the pending only to 
the failed over YarnRM. This eliminates over-allocation and minimizes the 
recovery overhead. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7720) [Federation] Race condition between second app attempt and UAM heartbeat when first attempt node is down

2018-01-08 Thread Botong Huang (JIRA)
Botong Huang created YARN-7720:
--

 Summary: [Federation] Race condition between second app attempt 
and UAM heartbeat when first attempt node is down
 Key: YARN-7720
 URL: https://issues.apache.org/jira/browse/YARN-7720
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Botong Huang
Assignee: Botong Huang


In Federation, multiple attempts of an application share the same UAM in each 
secondary sub-cluster. When first attempt fails, we reply on the fact that 
secondary RM won't kill the existing UAM before the AM heartbeat timeout 
(default at 10 min). When second attempt comes up in the home sub-cluster, it 
will pick up the UAM token from Yarn Registry and resume the UAM heartbeat to 
secondary RMs. 

The default heartbeat timeout for NM and AM are both 10 mins. The problem is 
that when the first attempt node goes down or out of connection, only after 10 
mins will the home RM mark the first attempt as failed, and then schedule the 
2nd attempt in some other node. By then the UAMs in secondaries are already 
timing out, and they might not survive until the second attempt comes up. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7676) Fix inconsistent priority ordering in Priority and SchedulerRequestKey

2017-12-20 Thread Botong Huang (JIRA)
Botong Huang created YARN-7676:
--

 Summary: Fix inconsistent priority ordering in Priority and 
SchedulerRequestKey
 Key: YARN-7676
 URL: https://issues.apache.org/jira/browse/YARN-7676
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor


Today the priority ordering in _Priority.compareTo()_ and 
_SchedulerRequestKey.compareTo()_ is inconsistent. Both _compareTo_ method is 
trying to reverse the order: 

P0.compareTo(P1) > 0, meaning priority wise P0 < P1. However, 
SK(P0).comapreTo(SK(P1)) < 0, meaning priority wise SK(P0) > SK(P1). 

This is attempting to fix that by undo both reversing logic. So that priority 
wise P0 > P1 and SK(P0) > SK(P1). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7631) ResourceRequest with different Capacity (Resource) overrides each other in RM

2017-12-08 Thread Botong Huang (JIRA)
Botong Huang created YARN-7631:
--

 Summary: ResourceRequest with different Capacity (Resource) 
overrides each other in RM
 Key: YARN-7631
 URL: https://issues.apache.org/jira/browse/YARN-7631
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang


Today in AMRMClientImpl, the ResourceRequests (RR) are kept as: RequestId -> 
Priority -> ResourceName -> ExecutionType -> Resource (Capacity) -> 
ResourceRequestInfo (the actual RR). 

This means that only RRs with the same (requestId, priority, resourcename, 
executionType, resource) will be grouped and aggregated together. 

While in RM side, the mapping is SchedulerRequestKey (RequestId, priority) -> 
LocalityAppPlacementAllocator (ResourceName -> RR). 

The issue is that in RM side Resource is not in the key to the RR at all. (Note 
that executionType is also not in the RM side, but it is fine because RM 
handles it separately as container update requests.) This means that under the 
same value of (requestId, priority, resourcename), RRs with different Resource 
values will be grouped together and override each other in RM. As a result, 
some of the container requests are lost and will never be allocated. 
Furthermore, since the two RRs are kept under different keys in AMRMClient 
side, allocation of RR1 will only trigger cancel for RR1, the pending RR2 will 
not get resend as well. 

I’ve attached an unit test (resourcebug.patch) which is failing in trunk to 
illustrate this issue. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7630) Fix AMRMToken handling in AMRMProxy

2017-12-08 Thread Botong Huang (JIRA)
Botong Huang created YARN-7630:
--

 Summary: Fix AMRMToken handling in AMRMProxy
 Key: YARN-7630
 URL: https://issues.apache.org/jira/browse/YARN-7630
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor


Symptom: after RM rolls over the master key for AMRMToken, whenever the RPC 
connection from FederationInterceptor to RM breaks due to transient network 
issue and reconnects, heartbeat to RM starts failing because of the “Invalid 
AMRMToken” exception. Whenever it hits, it happens for both home RM and 
secondary RMs. 

Related facts: 
1. When RM issues a new AMRMToken, it always send with service name field as 
empty string. RPC layer in AM side will set it properly before start using it. 
2. UGI keeps all tokens using a map from serviceName->Token. Initially 
AMRMClientUtils.createRMProxy() is used to load the first token and start the 
RM connection. 
3. When RM renew the token, YarnServerSecurityUtils.updateAMRMToken() is used 
to load it into UGI and replace the existing token (with the same serviceName 
key). 

Bug: 
The bug is that 2-AMRMClientUtils.createRMProxy() and 
3-YarnServerSecurityUtils.updateAMRMToken() is not handling the sequence 
consistently. We always need to load the token (with empty service name) into 
UGI first before we set the serviceName, so that the previous AMRMToken will be 
overridden. But 2 is doing it reversely. That’s why after RM rolls the 
amrmToken, the UGI end up with two tokens. Whenever the RPC connection break 
and reconnect, the wrong token could be picked and thus trigger the exception. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7599) Application cleaner and subcluster cleaner in Global Policy Generator

2017-12-01 Thread Botong Huang (JIRA)
Botong Huang created YARN-7599:
--

 Summary: Application cleaner and subcluster cleaner in Global 
Policy Generator
 Key: YARN-7599
 URL: https://issues.apache.org/jira/browse/YARN-7599
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor


In Federation, we need a cleanup service for StateStore as well as Yarn 
Registry. For the former, we need to remove old application records as well as 
inactive subclusters. For the latter, failed and killed applications might 
leave records in the Yarn Registry (see YARN-6128). We plan to add both cleanup 
service in GPG



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7479) TestContainerManagerSecurity.testContainerManager[Simple] flaky in trunk

2017-11-12 Thread Botong Huang (JIRA)
Botong Huang created YARN-7479:
--

 Summary: TestContainerManagerSecurity.testContainerManager[Simple] 
flaky in trunk
 Key: YARN-7479
 URL: https://issues.apache.org/jira/browse/YARN-7479
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang


Was waiting for container_1_0001_01_00 to get to state COMPLETE but was in 
state RUNNING after the timeout

java.lang.AssertionError: Was waiting for container_1_0001_01_00 to get to 
state COMPLETE but was in state RUNNING after the timeout
at org.junit.Assert.fail(Assert.java:88)
at 
org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:431)
at 
org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:360)
at 
org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:171)

Pasting some exception message during test run here: 

org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not 
enabled.  Available:[TOKEN]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119)

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 Given NMToken for application : appattempt_1_0001_01 seems to have been 
generated illegally.
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1491)
at org.apache.hadoop.ipc.Client.call(Client.java:1437)
at org.apache.hadoop.ipc.Client.call(Client.java:1347)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 Given NMToken for application : appattempt_1_0001_01 is not valid for 
current node manager.expected : localhost:46649 found : InvalidHost:1234
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1491)
at org.apache.hadoop.ipc.Client.call(Client.java:1437)
at org.apache.hadoop.ipc.Client.call(Client.java:1347)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7339) LocalityMulticastAMRMProxyPolicy should handle cancel request properly

2017-10-16 Thread Botong Huang (JIRA)
Botong Huang created YARN-7339:
--

 Summary: LocalityMulticastAMRMProxyPolicy should handle cancel 
request properly
 Key: YARN-7339
 URL: https://issues.apache.org/jira/browse/YARN-7339
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor


Currently inside AMRMProxy, LocalityMulticastAMRMProxyPolicy is not handling 
and splitting cancel requests from AM properly: 

# For node cancel request, we should not treat it as a localized resource 
request. Otherwise it can lead to all weight zero issue when computing 
localized resource weight. 

# For ANY cancel, we should broadcast to all known subclusters, not just the 
ones associated with localized resources. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7317) Fix overallocation resulted from ceiling in LocalityMulticastAMRMProxyPolicy

2017-10-11 Thread Botong Huang (JIRA)
Botong Huang created YARN-7317:
--

 Summary: Fix overallocation resulted from ceiling in 
LocalityMulticastAMRMProxyPolicy
 Key: YARN-7317
 URL: https://issues.apache.org/jira/browse/YARN-7317
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor


When LocalityMulticastAMRMProxyPolicy is splitting up the ANY requests into 
different subclusters, we are doing Ceil(N * weight), leading to overallocation 
overall. It is better to do Floor(N * weight) for each subcluster and then 
assign the residue randomly according to the weights. So that the total number 
of containers we ask from all subclusters sum up to be N. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7281) Auto inject AllocationRequestId in AMRMClient.ContainerRequest when not supplied

2017-10-02 Thread Botong Huang (JIRA)
Botong Huang created YARN-7281:
--

 Summary: Auto inject AllocationRequestId in 
AMRMClient.ContainerRequest when not supplied
 Key: YARN-7281
 URL: https://issues.apache.org/jira/browse/YARN-7281
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor


AllocationRequestId is introduced in YARN-4879 to simplify the resource 
allocation protocol inside AM-RM heartbeat. Many new features (e.g. Yarn 
Federation) are/will be built preferring AllocationRequestId to present. 

This Jira is modifying AMRMClient so that when AM is not supplying the 
AllocationRequestId, it will be auto-generated in the constructor of 
AMRMClient.ContainerRequest. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7203) Add container ExecutionType into ContainerReport

2017-09-14 Thread Botong Huang (JIRA)
Botong Huang created YARN-7203:
--

 Summary: Add container ExecutionType into ContainerReport
 Key: YARN-7203
 URL: https://issues.apache.org/jira/browse/YARN-7203
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7199) TestAMRMClientContainerRequest.testOpportunisticAndGuaranteedRequests is failing in trunk

2017-09-14 Thread Botong Huang (JIRA)
Botong Huang created YARN-7199:
--

 Summary: 
TestAMRMClientContainerRequest.testOpportunisticAndGuaranteedRequests is 
failing in trunk
 Key: YARN-7199
 URL: https://issues.apache.org/jira/browse/YARN-7199
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang


java.lang.IllegalArgumentException: The profile name cannot be null
at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
at 
org.apache.hadoop.yarn.api.records.ProfileCapability.newInstance(ProfileCapability.java:68)
at 
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.addContainerRequest(AMRMClientImpl.java:512)
at 
org.apache.hadoop.yarn.client.api.impl.TestAMRMClientContainerRequest.testOpportunisticAndGuaranteedRequests(TestAMRMClientContainerRequest.java:59)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7102) NM heartbeat stuck when responseId overflows MAX_INT

2017-08-25 Thread Botong Huang (JIRA)
Botong Huang created YARN-7102:
--

 Summary: NM heartbeat stuck when responseId overflows MAX_INT
 Key: YARN-7102
 URL: https://issues.apache.org/jira/browse/YARN-7102
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang
Assignee: Botong Huang


ResponseId overflow problem in NM-RM heartbeat. This is same as AM-RM heartbeat 
in YARN-6640, please refer to YARN-6640 for details. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7074) Fix NM state store update comment

2017-08-22 Thread Botong Huang (JIRA)
Botong Huang created YARN-7074:
--

 Summary: Fix NM state store update comment
 Key: YARN-7074
 URL: https://issues.apache.org/jira/browse/YARN-7074
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6962) Federation interceptor should support full allocate request/response api

2017-08-07 Thread Botong Huang (JIRA)
Botong Huang created YARN-6962:
--

 Summary: Federation interceptor should support full allocate 
request/response api
 Key: YARN-6962
 URL: https://issues.apache.org/jira/browse/YARN-6962
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6955) Concurrent registerAM thread in Federation Interceptor

2017-08-04 Thread Botong Huang (JIRA)
Botong Huang created YARN-6955:
--

 Summary: Concurrent registerAM thread in Federation Interceptor
 Key: YARN-6955
 URL: https://issues.apache.org/jira/browse/YARN-6955
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor


 The timeout between AM and AMRMProxy is shorter than the timeout + failOver 
between FederationInterceptor (AMRMProxy) and RM. When the first register 
thread in FI is blocked because of an RM failover, AM can timeout and resend 
register call, leading to two outstanding register call inside FI. 

Eventually when RM comes back up, one thread succeeds register and the other 
thread got an application already registered exception. FI should swallow the 
exception and return success back to AM in both threads. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6902) Update SQL server note in License.txt

2017-07-28 Thread Botong Huang (JIRA)
Botong Huang created YARN-6902:
--

 Summary: Update SQL server note in License.txt
 Key: YARN-6902
 URL: https://issues.apache.org/jira/browse/YARN-6902
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6730) Make sure NM state store is not null consistently

2017-06-22 Thread Botong Huang (JIRA)
Botong Huang created YARN-6730:
--

 Summary: Make sure NM state store is not null consistently
 Key: YARN-6730
 URL: https://issues.apache.org/jira/browse/YARN-6730
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor


In the NM statestore for NM restart, there are a lot of places where we check 
if the stateStore != null. This is true in the existing codebase too. Ideally, 
the stateStore should never be null because we have the NullStateStore 
implementation and we should not have to perform so many defensive checks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6704) Add Federation Interceptor restart when work preserving NM is enabled

2017-06-09 Thread Botong Huang (JIRA)
Botong Huang created YARN-6704:
--

 Summary: Add Federation Interceptor restart when work preserving 
NM is enabled
 Key: YARN-6704
 URL: https://issues.apache.org/jira/browse/YARN-6704
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


YARN-1336 added the ability to restart NM without loosing any running 
containers. {{AMRMProxy}} restart is added in YARN-6127. In a Federated YARN 
environment, there's additional state in the {{FederationInterceptor}} to allow 
for spanning across multiple sub-clusters, so we need to enhance 
{{FederationInterceptor}} to support work-preserving restart.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6667) Handle containerId duplicate without throwing in Federation Interceptor

2017-05-30 Thread Botong Huang (JIRA)
Botong Huang created YARN-6667:
--

 Summary: Handle containerId duplicate without throwing in 
Federation Interceptor
 Key: YARN-6667
 URL: https://issues.apache.org/jira/browse/YARN-6667
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6666) Fix unit test in TestRouterClientRMService

2017-05-30 Thread Botong Huang (JIRA)
Botong Huang created YARN-:
--

 Summary: Fix unit test in TestRouterClientRMService
 Key: YARN-
 URL: https://issues.apache.org/jira/browse/YARN-
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor


Running org.apache.hadoop.yarn.server.router.clientrm.TestRouterClientRMService
Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1.041 sec <<< 
FAILURE! - in 
org.apache.hadoop.yarn.server.router.clientrm.TestRouterClientRMService
testRouterClientRMServiceE2E(org.apache.hadoop.yarn.server.router.clientrm.TestRouterClientRMService)
  Time elapsed: 0.07 sec  <<< ERROR!
java.lang.reflect.UndeclaredThrowableException: null
at 
org.apache.hadoop.yarn.server.MockResourceManagerFacade.forceKillApplication(MockResourceManagerFacade.java:457)
at 
org.apache.hadoop.yarn.server.router.clientrm.DefaultClientRequestInterceptor.forceKillApplication(DefaultClientRequestInterceptor.java:166)
at 
org.apache.hadoop.yarn.server.router.clientrm.PassThroughClientRequestInterceptor.forceKillApplication(PassThroughClientRequestInterceptor.java:105)
at 
org.apache.hadoop.yarn.server.router.clientrm.PassThroughClientRequestInterceptor.forceKillApplication(PassThroughClientRequestInterceptor.java:105)
at 
org.apache.hadoop.yarn.server.router.clientrm.PassThroughClientRequestInterceptor.forceKillApplication(PassThroughClientRequestInterceptor.java:105)
at 
org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.forceKillApplication(RouterClientRMService.java:217)
at 
org.apache.hadoop.yarn.server.router.clientrm.BaseRouterClientRMTest$3.run(BaseRouterClientRMTest.java:218)
at 
org.apache.hadoop.yarn.server.router.clientrm.BaseRouterClientRMTest$3.run(BaseRouterClientRMTest.java:212)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1965)
at 
org.apache.hadoop.yarn.server.router.clientrm.BaseRouterClientRMTest.forceKillApplication(BaseRouterClientRMTest.java:212)
at 
org.apache.hadoop.yarn.server.router.clientrm.TestRouterClientRMService.testRouterClientRMServiceE2E(TestRouterClientRMService.java:111)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6648) Add FederationStateStore interfaces for Global Policy Generator

2017-05-25 Thread Botong Huang (JIRA)
Botong Huang created YARN-6648:
--

 Summary: Add FederationStateStore interfaces for Global Policy 
Generator
 Key: YARN-6648
 URL: https://issues.apache.org/jira/browse/YARN-6648
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6640) AM heartbeat stuck when responseId overflows MAX_INT

2017-05-24 Thread Botong Huang (JIRA)
Botong Huang created YARN-6640:
--

 Summary:  AM heartbeat stuck when responseId overflows MAX_INT
 Key: YARN-6640
 URL: https://issues.apache.org/jira/browse/YARN-6640
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor


The current code in {{ApplicationMasterService}}: 
if ((request.getResponseId() + 1) == lastResponse.getResponseId()) {
/* old heartbeat */
return lastResponse;
} else if (request.getResponseId() + 1 < lastResponse.getResponseId()) {
throw ...
}
process the heartbeat...

When a heartbeat comes in, in usual case we are expecting 
request.getResponseId() == lastResponse.getResponseId(). 

The “if“ is for the duplicate heartbeat that’s one step old, the “else if” is 
to throw and complain for heartbeats more than two steps old, otherwise we 
accept the new heartbeat and process it.

So the bug is: when lastResponse.getResponseId() == MAX_INT, the newest 
heartbeat comes in with responseId == MAX_INT. 

However reponseId + 1 will be MIN_INT, and we will fall into the “else if” case 
and RM will throw. Then we are stuck here…



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6565) Fix memory leak and finish app trigger in AMRMProxy

2017-05-05 Thread Botong Huang (JIRA)
Botong Huang created YARN-6565:
--

 Summary: Fix memory leak and finish app trigger in AMRMProxy
 Key: YARN-6565
 URL: https://issues.apache.org/jira/browse/YARN-6565
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor


Two issues in AMRMProxy:

1. When application finishes, AMRMTokenSecretManager is not updated to remove 
related data, leading to memory leak. 

2. When we kill an application, we should remove the pipeline after the AM 
container is killed. 

FINISH_APPLICATION event is sent when the AM container is still being killed. 
After we remove the pipeline, we might still get heartbeats from AM, triggering 
exception messages. 

Instead, we should wait for APPLICATION_RESOURCES_CLEANEDUP event, sent after 
where the AM container is killed. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6511) Federation Intercepting and propagating AM-RM communications (part two: secondary subclusters added)

2017-04-21 Thread Botong Huang (JIRA)
Botong Huang created YARN-6511:
--

 Summary: Federation Intercepting and propagating AM-RM 
communications (part two: secondary subclusters added)
 Key: YARN-6511
 URL: https://issues.apache.org/jira/browse/YARN-6511
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6404) Avoid misleading NoClassDefFoundError caused by ExceptionInInitializerError in FederationStateStoreFacade

2017-03-28 Thread Botong Huang (JIRA)
Botong Huang created YARN-6404:
--

 Summary: Avoid misleading NoClassDefFoundError caused by 
ExceptionInInitializerError in FederationStateStoreFacade
 Key: YARN-6404
 URL: https://issues.apache.org/jira/browse/YARN-6404
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor


Currently the singleton is created in static block: 
private static final FederationStateStoreFacade FACADE =
  new FederationStateStoreFacade();

If the constructor method fails and throw, we will see the full exception stack 
for the first time, wrapped into an {{ExceptionInInitializerError}}. However, 
after that, all later hits will be prevented by JVM, throwing a misleading 
{{NoClassDefFoundError}}. 

Here's more explanation from Stack Overflow:
http://stackoverflow.com/questions/34413/why-am-i-getting-a-noclassdeffounderror-in-java
The earlier failure could be a ClassNotFoundException or an 
ExceptionInInitializerError (indicating a failure in the static initialization 
block) or any number of other problems. The point is, a NoClassDefFoundError is 
not necessarily a classpath problem.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6370) Properly handle rack requests for non-active subclusters in LocalityMulticastAMRMProxyPolicy

2017-03-21 Thread Botong Huang (JIRA)
Botong Huang created YARN-6370:
--

 Summary: Properly handle rack requests for non-active subclusters 
in LocalityMulticastAMRMProxyPolicy
 Key: YARN-6370
 URL: https://issues.apache.org/jira/browse/YARN-6370
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6282) Recreate interceptor chain when different attempt in the same node in AMRMProxy

2017-03-03 Thread Botong Huang (JIRA)
Botong Huang created YARN-6282:
--

 Summary: Recreate interceptor chain when different attempt in the 
same node in AMRMProxy
 Key: YARN-6282
 URL: https://issues.apache.org/jira/browse/YARN-6282
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor


In AMRMProxy, an interceptor chain is created per application attempt. But the 
pipeline mapping uses application Id as key. So when a different attempt comes 
in the same node, we need to recreate the interceptor chain for it, instead of 
using the existing one. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6281) Cleanup when AMRMProxy fails to initialize a new interceptor chain

2017-03-03 Thread Botong Huang (JIRA)
Botong Huang created YARN-6281:
--

 Summary: Cleanup when AMRMProxy fails to initialize a new 
interceptor chain
 Key: YARN-6281
 URL: https://issues.apache.org/jira/browse/YARN-6281
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor


When a app starts, AMRMProxy.initializePipeline creates a new Interceptor chain 
and add it to its pipeline mapping. Then it initializes the chain and return. 
The problem is that when the chain initialization throws (e.g. because of 
configuration error, interceptor class not found etc.), the chain is not 
removed from AMRMProxy's pipeline mapping. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6247) Add SubClusterResolver into FederationStateStoreFacade

2017-02-27 Thread Botong Huang (JIRA)
Botong Huang created YARN-6247:
--

 Summary: Add SubClusterResolver into FederationStateStoreFacade
 Key: YARN-6247
 URL: https://issues.apache.org/jira/browse/YARN-6247
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor


Add SubClusterResolver into FederationStateStoreFacade. Since the resolver 
might involve some overhead (read file in the background, potentially 
periodically), it is good to put it inside FederationStateStoreFacade 
singleton, so that only one instance will be created. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6213) Failure handling and retry for performFailover in RetryInvocationHandler

2017-02-21 Thread Botong Huang (JIRA)
Botong Huang created YARN-6213:
--

 Summary: Failure handling and retry for performFailover in 
RetryInvocationHandler 
 Key: YARN-6213
 URL: https://issues.apache.org/jira/browse/YARN-6213
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor


In {{RetryInvocationHandler}}, when the method invocation fails, we reply on 
{{FailoverProxyProvider}} to performFailover and get a new proxy, so that we 
can retry the method invocation. 

However, the performFailover and get new proxy itself might fail (throw 
exception or return null proxy). This is not handled properly currently, we end 
up throwing the exception out of the while loop. Instead, we should catch the 
exception (or check for null proxy) and retry performFailover again, until the 
max fail over count reaches the maximum. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6203) Occasional test failure in TestWeightedRandomRouterPolicy

2017-02-15 Thread Botong Huang (JIRA)
Botong Huang created YARN-6203:
--

 Summary: Occasional test failure in TestWeightedRandomRouterPolicy
 Key: YARN-6203
 URL: https://issues.apache.org/jira/browse/YARN-6203
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Botong Huang
Assignee: Carlo Curino
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6190) Bug fixes in federation polices

2017-02-14 Thread Botong Huang (JIRA)
Botong Huang created YARN-6190:
--

 Summary: Bug fixes in federation polices
 Key: YARN-6190
 URL: https://issues.apache.org/jira/browse/YARN-6190
 Project: Hadoop YARN
  Issue Type: Bug
  Components: federation
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6093) Invalid AMRM token exception when RM renew AMRMtoken and FederationRMFailoverProxyProvider failover

2017-01-13 Thread Botong Huang (JIRA)
Botong Huang created YARN-6093:
--

 Summary: Invalid AMRM token exception when RM renew AMRMtoken and 
FederationRMFailoverProxyProvider failover
 Key: YARN-6093
 URL: https://issues.apache.org/jira/browse/YARN-6093
 Project: Hadoop YARN
  Issue Type: Bug
  Components: federation
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor
 Fix For: YARN-2915


AMRMProxy uses expired AMRMToken to talk to RM, leading to the "Invalid 
AMRMToken" exception. The bug is triggered when both conditions are met: 
1. RM rolls master key and renews AMRMToken for a running AM.
2. Existing RPC connection between AMRMProxy and RM drops and attempt to 
reconnect via failover in FederationRMFailoverProxyProvider. 

Here's what happened: 

In DefaultRequestInterceptor.init(), we create a proxy ugi, load it with the 
initial AMRMToken issued by RM, and used it for initiating rmClient. 

Then we arrive at FederationRMFailoverProxyProvider.init(), a full copy of ugi 
tokens are saved locally, create an actual RM proxy and setup the RPC 
connection. 

Later when RM rolls master key and issues a new AMRMToken, 
DefaultRequestInterceptor.updateAMRMToken() updates it into the proxy ugi. 

However the new token is never used until the existing RPC connection between 
AMRMProxy and RM drops for other reasons (say master RM crashes). 

At this point, since the service name of the new AMRMToken is not yet set 
correctly in DefaultRequestInterceptor.updateAMRMToken(), RPC found no valid 
AMRMToken when trying to setup a new connection. 

We first hit a "Client cannot authenticate via:[TOKEN]" exception. This is 
expected. 

Next, FederationRMFailoverProxyProvider fails over, we reset the service token 
via ClientRMProxy.getRMAddress() and reconnect. Supposedly this would have 
worked. 

However since DefaultRequestInterceptor does not use the proxy user for later 
calls to rmClient, when performing failover in 
FederationRMFailoverProxyProvider, we are not in the proxy user. 

Currently the code solve the problem by reloading the current ugi with all 
tokens saved locally in originalTokens in method addOriginalTokens(). 

The problem is that the original AMRMToken loaded is no longer accepted by RM, 
and thus we keep hitting the "Invalid AMRMToken" exception until AM fails. 

The correct way is that rather than saving the original tokens in the proxy 
ugi, we save the original ugi itself. 

Every time we perform failover and create the new RM proxy, we use the original 
ugi, which is always loaded with the up-to-date AMRMToken. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6016) Bugs in AMRMProxy handling AMRMToken and local AMRMToken

2016-12-20 Thread Botong Huang (JIRA)
Botong Huang created YARN-6016:
--

 Summary: Bugs in AMRMProxy handling AMRMToken and local AMRMToken
 Key: YARN-6016
 URL: https://issues.apache.org/jira/browse/YARN-6016
 Project: Hadoop YARN
  Issue Type: Bug
  Components: federation
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor


Two AMRMProxy bugs: 

First, the AMRMToken from RM should not be propagated to AM, since AMRMProxy 
will create a local AMRMToken for it. 

Second, the AMRMProxy Context is now parse the localAMRMTokenKeyId from 
amrmToken, but should be from localAmrmToken. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5836) NMToken passwd not checked in ContainerManagerImpl, so that malicious AM can fake the Token and kill containers of other apps at will

2016-11-04 Thread Botong Huang (JIRA)
Botong Huang created YARN-5836:
--

 Summary: NMToken passwd not checked in ContainerManagerImpl, so 
that malicious AM can fake the Token and kill containers of other apps at will
 Key: YARN-5836
 URL: https://issues.apache.org/jira/browse/YARN-5836
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Botong Huang
Assignee: Botong Huang
Priority: Minor


When AM calls NM via stopContainers in ContainerManagementProtocol, the NMToken 
(generated by RM) is passed along via the user ugi. However currently 
ContainerManagerImpl is not validating this token correctly, specifically in 
authorizeGetAndStopContainerRequest in ContainerManagerImpl. Basically it 
blindly trusts the content in the NMTokenIdentifier without verifying the 
password (RM generated signature) in the NMToken, so that malicious AM can just 
fake the content in the NMTokenIdentifier and pass it to NMs. Moreover, 
currently even for plain text checking, when the appId doesn’t match, all it 
does is log it as a warning and continues to kill the container…

For startContainers the NMToken is not checked correctly in authorizeUser as 
well, however the ContainerToken is verified properly by regenerating and 
comparing the password in verifyAndGetContainerTokenIdentifier, so that 
malicious AM cannot launch containers at will. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org