[jira] [Commented] (YARN-6539) Create SecureLogin inside Router

2019-08-09 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904254#comment-16904254
 ] 

Subru Krishnan commented on YARN-6539:
--

[~yifan.stan], great to hear that you are running Federation in a secure 
cluster! I would love to hear more details about it.

I thought I had mentioned it to [~shenyinjie] but guess not - I am not familiar 
with the security code. Hopefully [~bibinchundatt] or [~Prabhu Joseph] can 
help? Also, would it be possible to add a test?

Thanks.

> Create SecureLogin inside Router
> 
>
> Key: YARN-6539
> URL: https://issues.apache.org/jira/browse/YARN-6539
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Giovanni Matteo Fumarola
>Assignee: Xie YiFan
>Priority: Minor
> Attachments: YARN-6359_1.patch, YARN-6359_2.patch, YARN-6539_3.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9425) Make initialDelay configurable for FederationStateStoreService#scheduledExecutorService

2019-05-30 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852165#comment-16852165
 ] 

Subru Krishnan commented on YARN-9425:
--

[~giovanni.fumarola], can you take a look?

[~shenyinjie], can you fix the Yetus warnings above?

> Make initialDelay configurable for 
> FederationStateStoreService#scheduledExecutorService
> ---
>
> Key: YARN-9425
> URL: https://issues.apache.org/jira/browse/YARN-9425
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.1.0
>Reporter: Shen Yinjie
>Assignee: Shen Yinjie
>Priority: Major
> Attachments: YARN-9425_1.patch
>
>
> When enable YARN federation, subclusters info in Router Web UI  cannot be 
> loaded immediately, and client cannot find any active subclusters after 5mins 
> by default ,which is configured by 
> "yarn.federation.state-store.heartbeat-interval-secs".
> IMA,we should seperate 'initialDely' and 'delay' for 
> FederationStateStoreService#scheduledExecutorService.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9586) [QA] Need more doc for yarn.federation.policy-manager-params when LoadBasedRouterPolicy is used

2019-05-30 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852164#comment-16852164
 ] 

Subru Krishnan commented on YARN-9586:
--

[~shenyinjie], please check the Javadocs for config information:

https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/federation/policies/router/LoadBasedRouterPolicy.java#L38

> [QA] Need more doc for yarn.federation.policy-manager-params when 
> LoadBasedRouterPolicy is used
> ---
>
> Key: YARN-9586
> URL: https://issues.apache.org/jira/browse/YARN-9586
> Project: Hadoop YARN
>  Issue Type: Wish
>  Components: federation
>Reporter: Shen Yinjie
>Priority: Major
>
> We picked LoadBasedRouterPolicy for YARN federation, but had no idea what to 
>  set to yarn.federation.policy-manager-params. Is there a demo config or more 
> detailed description for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2915) Enable YARN RM scale out via federation using multiple RM's

2019-03-04 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783922#comment-16783922
 ] 

Subru Krishnan commented on YARN-2915:
--

[~liuxun323], great to hear about your interest YARN Federation. It's available 
from 2.9+, so you are good with both 3.0.0 & 3.2.0 :).

> Enable YARN RM scale out via federation using multiple RM's
> ---
>
> Key: YARN-2915
> URL: https://issues.apache.org/jira/browse/YARN-2915
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager
>Reporter: Sriram Rao
>Assignee: Subru Krishnan
>Priority: Major
>  Labels: federation
> Fix For: 2.9.0, 3.0.0-beta1
>
> Attachments: FEDERATION_CAPACITY_ALLOCATION_JIRA.pdf, 
> Federation-BoF.pdf, YARN-Federation-Hadoop-Summit_final.pptx, 
> Yarn_federation_design_v1.pdf, federation-prototype.patch
>
>
> This is an umbrella JIRA that proposes to scale out YARN to support large 
> clusters comprising of tens of thousands of nodes.   That is, rather than 
> limiting a YARN managed cluster to about 4k in size, the proposal is to 
> enable the YARN managed cluster to be elastically scalable.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse

2018-11-14 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687190#comment-16687190
 ] 

Subru Krishnan commented on YARN-8898:
--

{quote}Unfortunately we didn't write 
ApplicationHomeSubCluster.getProto.getBytes to znode\{quote}

Thanks [~bibinchundatt] for bringing this to my attention. The intention was to 
persist _ApplicationHomeSubCluster_ and that's why it was defined as a proto 
object in the first place.

So I feel it might be better to fix it as at least the API is correct? I mean 
add the trimmed _ApplicationSubmissionContext_  to _ApplicationHomeSubCluster_  
and persist the entire _ApplicationHomeSubCluster_ in _ZK._

For SQL, it's adding a new column so it should be safe as well.

 

> Fix FederationInterceptor#allocate to set application priority in 
> allocateResponse
> --
>
> Key: YARN-8898
> URL: https://issues.apache.org/jira/browse/YARN-8898
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Priority: Major
> Attachments: YARN-8898.wip.patch
>
>
> In case of FederationInterceptor#mergeAllocateResponses skips 
> application_priority in response returned



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse

2018-11-13 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686009#comment-16686009
 ] 

Subru Krishnan commented on YARN-8898:
--

Thanks [~bibinchundatt] for the detailed clarification. +1 on trimming as 
that's what we do for RM HA as well.

We should be able to use _ApplicationHomeSubCluster_ itself as addition of a 
field should still be backward compatible, right?

> Fix FederationInterceptor#allocate to set application priority in 
> allocateResponse
> --
>
> Key: YARN-8898
> URL: https://issues.apache.org/jira/browse/YARN-8898
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-8898.wip.patch
>
>
> In case of FederationInterceptor#mergeAllocateResponses skips 
> application_priority in response returned



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse

2018-11-12 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684634#comment-16684634
 ] 

Subru Krishnan commented on YARN-8898:
--

Thanks [~bibinchundatt] and [~botong] for providing context.

I feel the solution is 2 parts:
 # Save the {{ApplicationSubmissionContext}} in the _FederationStateStore_ and 
use it to submit UAMs.
 # Delegate certain APIs to _AMRMProxy_ via the Router, like we do presently 
for *killApplication*.

So for the scope of this Jira I prefer solution 2 as:
 * it doesn't involve changes to the core wire protocol
 * is future proof if we require more (or different) fields in future.

 [~bibinchundatt], does it make sense? Sincerely apologize for the delay as I 
see you already have a patch with solution 1.

 

Also, it looks to me that only the _ApplicationSubmissionContext_ (in 
non-federated mode) is persisted in the  _RMStateStore_ and if there's a update 
of an Application priority followed by RM failover, the priority will revert to 
the original one at submission?

 

 

 

> Fix FederationInterceptor#allocate to set application priority in 
> allocateResponse
> --
>
> Key: YARN-8898
> URL: https://issues.apache.org/jira/browse/YARN-8898
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-8898.wip.patch
>
>
> In case of FederationInterceptor#mergeAllocateResponses skips 
> application_priority in response returned



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse

2018-11-06 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677499#comment-16677499
 ] 

Subru Krishnan edited comment on YARN-8898 at 11/7/18 1:31 AM:
---

[~bibinchundatt]/[~botong], thanks for working on this.

I am trying to get up to speed and I have a basic question - what are the 
client APIs that you are referring to, which we need to support at AMRMProxy 
level?

{quote} Initially i was under the impression that its only application priority 
and label, On further analysis found that we might require a few more for all 
client API's to work. \\{quote}

 

 

 


was (Author: subru):
[~bibinchundatt]/[~botong], thanks for working on this.

I am trying to get up to speed and I have a basic question - what are the 
client APIs that you are referring to, which we need to support at AMRMProxy 
level?

{quote} Initially i was under the impression that its only application priority 
and label, On further analysis found that we might require a few more for all 
client API's to work.

 

 

> Fix FederationInterceptor#allocate to set application priority in 
> allocateResponse
> --
>
> Key: YARN-8898
> URL: https://issues.apache.org/jira/browse/YARN-8898
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
>
> In case of FederationInterceptor#mergeAllocateResponses skips 
> application_priority in response returned



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse

2018-11-06 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677499#comment-16677499
 ] 

Subru Krishnan edited comment on YARN-8898 at 11/7/18 1:30 AM:
---

[~bibinchundatt]/[~botong], thanks for working on this.

I am trying to get up to speed and I have a basic question - what are the 
client APIs that you are referring to, which we need to support at AMRMProxy 
level?

{quote} Initially i was under the impression that its only application priority 
and label, On further analysis found that we might require a few more for all 
client API's to work.

 


was (Author: subru):
[~bibinchundatt]/[~botong], thanks for working on this.

I am trying to get up to speed and I have a basic question - what are the 
client APIs that you are referring to, which we need to support at AMRMProxy 
level?

 ?? Initially i was under the impression that its only application priority and 
label, On further analysis found that we might require a few more for all 
client API's to work.??

 

> Fix FederationInterceptor#allocate to set application priority in 
> allocateResponse
> --
>
> Key: YARN-8898
> URL: https://issues.apache.org/jira/browse/YARN-8898
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
>
> In case of FederationInterceptor#mergeAllocateResponses skips 
> application_priority in response returned



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse

2018-11-06 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677499#comment-16677499
 ] 

Subru Krishnan edited comment on YARN-8898 at 11/7/18 1:30 AM:
---

[~bibinchundatt]/[~botong], thanks for working on this.

I am trying to get up to speed and I have a basic question - what are the 
client APIs that you are referring to, which we need to support at AMRMProxy 
level?

{quote} Initially i was under the impression that its only application priority 
and label, On further analysis found that we might require a few more for all 
client API's to work.

 

 


was (Author: subru):
[~bibinchundatt]/[~botong], thanks for working on this.

I am trying to get up to speed and I have a basic question - what are the 
client APIs that you are referring to, which we need to support at AMRMProxy 
level?

{quote} Initially i was under the impression that its only application priority 
and label, On further analysis found that we might require a few more for all 
client API's to work.

 

> Fix FederationInterceptor#allocate to set application priority in 
> allocateResponse
> --
>
> Key: YARN-8898
> URL: https://issues.apache.org/jira/browse/YARN-8898
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
>
> In case of FederationInterceptor#mergeAllocateResponses skips 
> application_priority in response returned



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8898) Fix FederationInterceptor#allocate to set application priority in allocateResponse

2018-11-06 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677499#comment-16677499
 ] 

Subru Krishnan commented on YARN-8898:
--

[~bibinchundatt]/[~botong], thanks for working on this.

I am trying to get up to speed and I have a basic question - what are the 
client APIs that you are referring to, which we need to support at AMRMProxy 
level?

 ?? Initially i was under the impression that its only application priority and 
label, On further analysis found that we might require a few more for all 
client API's to work.??

 

> Fix FederationInterceptor#allocate to set application priority in 
> allocateResponse
> --
>
> Key: YARN-8898
> URL: https://issues.apache.org/jira/browse/YARN-8898
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
>
> In case of FederationInterceptor#mergeAllocateResponses skips 
> application_priority in response returned



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8979) Spark on yarn job failed with yarn federation enabled

2018-11-06 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677493#comment-16677493
 ] 

Subru Krishnan commented on YARN-8979:
--

[~shenyinjie], thanks for reporting this. This is a known issue caused by 
YARN-4083.

 

We work around this by separating the client configuration (i.e. Spark, Tez, 
MR) from server configuration (i.e. NM, RM, Router, etc). Unfortunately this 
involves a code change that will require to have an independent conf dir for 
clients which might break existing deployments (as everyone will need to clone 
their conf dirs), so has never been committed.

> Spark on yarn job failed  with yarn federation enabled
> --
>
> Key: YARN-8979
> URL: https://issues.apache.org/jira/browse/YARN-8979
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Shen Yinjie
>Priority: Major
>
> when I ran spark job on yarn with yarn federation enabled,job failed and 
> throw Exception as snapshot.
> ps: MR and Tez jobs are OK.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml

2018-11-02 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673832#comment-16673832
 ] 

Subru Krishnan commented on YARN-7592:
--

Thanks [~rahulanand90] for the clarification. Can you update the patch after 
removing the flag (which I should mention is great) and quickly revalidate that 
there's no regression?

+1 from my side pending that.

 

> yarn.federation.failover.enabled missing in yarn-default.xml
> 
>
> Key: YARN-7592
> URL: https://issues.apache.org/jira/browse/YARN-7592
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.0.0-beta1
>Reporter: Gera Shegalov
>Priority: Major
> Attachments: IssueReproduce.patch
>
>
> yarn.federation.failover.enabled should be documented in yarn-default.xml. I 
> am also not sure why it should be true by default and force the HA retry 
> policy in {{RMProxy#createRMProxy}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6900) ZooKeeper based implementation of the FederationStateStore

2018-11-02 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673830#comment-16673830
 ] 

Subru Krishnan commented on YARN-6900:
--

[~rahulanand90], I agree with you the parameters are tricky to identify. 
Programmatically, what we need is a serialize conf as defined 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/federation/policies/manager/FederationPolicyManager.java#L101].

Manually, we could start with a key-value map where predefined keys could be 
router/amrmproxy weights or headroomAlpha. Thoughts?

 

> ZooKeeper based implementation of the FederationStateStore
> --
>
> Key: YARN-6900
> URL: https://issues.apache.org/jira/browse/YARN-6900
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation, nodemanager, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Íñigo Goiri
>Priority: Major
> Fix For: 2.9.0, 3.0.0-beta1
>
> Attachments: YARN-6900-002.patch, YARN-6900-003.patch, 
> YARN-6900-004.patch, YARN-6900-005.patch, YARN-6900-006.patch, 
> YARN-6900-007.patch, YARN-6900-008.patch, YARN-6900-009.patch, 
> YARN-6900-010.patch, YARN-6900-011.patch, YARN-6900-YARN-2915-000.patch, 
> YARN-6900-YARN-2915-001.patch
>
>
> YARN-5408 defines the unified {{FederationStateStore}} API. Currently we only 
> support SQL based stores, this JIRA tracks adding a ZooKeeper based 
> implementation for simplifying deployment as it's already popularly used for 
> {{RMStateStore}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6900) ZooKeeper based implementation of the FederationStateStore

2018-10-17 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654036#comment-16654036
 ] 

Subru Krishnan commented on YARN-6900:
--

Thanks [~rahulanand90] and [~elgoiri] for raising this. I agree that we do need 
to add a tool to simplify updating parameters and YARN-3657 was created for the 
purpose.

[~rahulanand90], any chance you are interested in working on it :)?

> ZooKeeper based implementation of the FederationStateStore
> --
>
> Key: YARN-6900
> URL: https://issues.apache.org/jira/browse/YARN-6900
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation, nodemanager, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Íñigo Goiri
>Priority: Major
> Fix For: 2.9.0, 3.0.0-beta1
>
> Attachments: YARN-6900-002.patch, YARN-6900-003.patch, 
> YARN-6900-004.patch, YARN-6900-005.patch, YARN-6900-006.patch, 
> YARN-6900-007.patch, YARN-6900-008.patch, YARN-6900-009.patch, 
> YARN-6900-010.patch, YARN-6900-011.patch, YARN-6900-YARN-2915-000.patch, 
> YARN-6900-YARN-2915-001.patch
>
>
> YARN-5408 defines the unified {{FederationStateStore}} API. Currently we only 
> support SQL based stores, this JIRA tracks adding a ZooKeeper based 
> implementation for simplifying deployment as it's already popularly used for 
> {{RMStateStore}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml

2018-10-09 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643949#comment-16643949
 ] 

Subru Krishnan commented on YARN-7592:
--

I want to make sure I fully understand the proposal - we will revert the 
changes in RMProxy and create the FederationClientRMProxy}} (I feel 
we can skip custom) directly if *yarn.federation.enabled* is set? }}

I like the idea, can you ensure couple of things:
 * This works with both HA enabled or not (for NM, router and AMRMProxy).
 * Assuming above is true, can we remove *yarn.federation.failover.enabled* 
flag completely?

 

Thanks for working on this!

 

> yarn.federation.failover.enabled missing in yarn-default.xml
> 
>
> Key: YARN-7592
> URL: https://issues.apache.org/jira/browse/YARN-7592
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.0.0-beta1
>Reporter: Gera Shegalov
>Priority: Major
> Attachments: IssueReproduce.patch
>
>
> yarn.federation.failover.enabled should be documented in yarn-default.xml. I 
> am also not sure why it should be true by default and force the HA retry 
> policy in {{RMProxy#createRMProxy}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8637) [GPG] Add FederationStateStore getAppInfo API for GlobalPolicyGenerator

2018-09-14 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16615397#comment-16615397
 ] 

Subru Krishnan commented on YARN-8637:
--

+1 on your proposal [~botong] from my side as I feel we already have too many 
configs and your approach also ensures that we don't have to change the API.

> [GPG] Add FederationStateStore getAppInfo API for GlobalPolicyGenerator
> ---
>
> Key: YARN-8637
> URL: https://issues.apache.org/jira/browse/YARN-8637
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8637-YARN-7402.v1.patch
>
>
> The core api for FederationStateStore is provided in _FederationStateStore_. 
> In this patch, we are added a _FederationGPGStateStore_ api just for GPG. 
> Specifically, we are adding the API to get full application info from 
> statestore with the starting timestamp of the app entry, so that the 
> _ApplicationCleaner_ (YARN-7599) in GPG can delete and cleanup old entries in 
> the table. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8755) Add clean up for FederationStore apps

2018-09-13 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614158#comment-16614158
 ] 

Subru Krishnan commented on YARN-8755:
--

Thanks [~bibinchundatt]!

I see that [~botong] is working on addressing your feedback. I do have a 
request - can both of you make sure if YARN-6648 needs to be updated with your 
comments and also include that as part of YARN-7599?

> Add clean up for FederationStore apps
> -
>
> Key: YARN-8755
> URL: https://issues.apache.org/jira/browse/YARN-8755
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Priority: Major
>
> We should add clean up logic for applications to home cluster mapping  in 
> federation State store. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml

2018-09-13 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614143#comment-16614143
 ] 

Subru Krishnan commented on YARN-7592:
--

Thanks [~jira.shegalov] for raising this and [~bibinchundatt] and 
[~rahulanand90] for the detailed analysis.

 

[~bibinchundatt], I agree that this is related to YARN-8434. Looks like in our 
test setup, we specify {{FederationRMFailoverProxyProvider}}  for non-HA setup 
and ConfiguredRMFailoverProxyProvider for HA setup in yarn-site.

Before we change Server/Client proxies, is it possible to remove  
*yarn.federation.enabled* flag from yarn-site and check as after (re)looking at 
the code, that may not  be necessary in NMs (only in RMs)?

> yarn.federation.failover.enabled missing in yarn-default.xml
> 
>
> Key: YARN-7592
> URL: https://issues.apache.org/jira/browse/YARN-7592
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.0.0-beta1
>Reporter: Gera Shegalov
>Priority: Major
> Attachments: IssueReproduce.patch
>
>
> yarn.federation.failover.enabled should be documented in yarn-default.xml. I 
> am also not sure why it should be true by default and force the HA retry 
> policy in {{RMProxy#createRMProxy}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8755) Add clean up for FederationStore apps

2018-09-07 Thread Subru Krishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan resolved YARN-8755.
--
Resolution: Duplicate

[~bibinchundatt], this should be addressed by YARN-6648 & YARN-7599. Your 
review of the latter will be appreciated.

Thanks.

> Add clean up for FederationStore apps
> -
>
> Key: YARN-8755
> URL: https://issues.apache.org/jira/browse/YARN-8755
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Priority: Major
>
> We should add clean up logic for applications to home cluster mapping  in 
> federation State store. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5597) YARN Federation improvements

2018-09-05 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605094#comment-16605094
 ] 

Subru Krishnan commented on YARN-5597:
--

[~bibinchundatt], we use a RDBMS (SQL) for the Federation store and ZK for RM 
store as 1) there's no leader election in Federation and 2) We only store 
metadata for which a DB performs great and not what ZK is intended for (IMHO Zk 
has abused/misused a lot).

That said, [~elgoiri] has a deployment with ZK for both Federation and RM 
stores, so he should be able to guide you.

> YARN Federation improvements
> 
>
> Key: YARN-5597
> URL: https://issues.apache.org/jira/browse/YARN-5597
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Subru Krishnan
>Assignee: Subru Krishnan
>Priority: Major
>
> This umbrella JIRA tracks set of improvements over the YARN Federation MVP 
> (YARN-2915)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml

2018-09-05 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605092#comment-16605092
 ] 

Subru Krishnan commented on YARN-7592:
--

[~bibinchundatt]/[~jira.shegalov], I have tested multiple times with a similar 
setup (for 2.9 release) and never faced any issues.

 

FYI the FEDERATION_FAILOVER_ENABLED is automatically set by 
{{FederationProxyProviderUtil}} if HA is enabled as you can see 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/federation/failover/FederationProxyProviderUtil.java#L128].

> yarn.federation.failover.enabled missing in yarn-default.xml
> 
>
> Key: YARN-7592
> URL: https://issues.apache.org/jira/browse/YARN-7592
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.0.0-beta1
>Reporter: Gera Shegalov
>Priority: Major
>
> yarn.federation.failover.enabled should be documented in yarn-default.xml. I 
> am also not sure why it should be true by default and force the HA retry 
> policy in {{RMProxy#createRMProxy}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8637) [GPG] Add FederationStateStore getAppInfo API for GlobalPolicyGenerator

2018-08-08 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574087#comment-16574087
 ] 

Subru Krishnan commented on YARN-8637:
--

Thanks [~botong] for the patch. lt looks mostly good, have a minor comments:
 * Set the default to either in-memory or ZK (essentially keep it consistent 
with current *FederationStateStore* configs) as SQL dependency shouldn't be 
expected out of the box. Also, don't see the need to add a value in 
yarn-default.
 * Implementation for ZK is missing.
 * Please add a test for the in-memory impl as well.
 * I feel we should some other name than *ApplicationsInfo* (and corresponding 
getters/setter) as that looks too close to *AppInfo*? I am worried it may cause 
some confusion.

> [GPG] Add FederationStateStore getAppInfo API for GlobalPolicyGenerator
> ---
>
> Key: YARN-8637
> URL: https://issues.apache.org/jira/browse/YARN-8637
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8637-YARN-7402.v1.patch
>
>
> The core api for FederationStateStore is provided in _FederationStateStore_. 
> In this patch, we are added a _FederationGPGStateStore_ api just for GPG. 
> Specifically, we are adding the API to get full application info from 
> statestore with the starting timestamp of the app entry, so that the 
> _ApplicationCleaner_ (YARN-7599) in GPG can delete and cleanup old entries in 
> the table. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8626) Create HomePolicyManager that sends all the requests to the home subcluster

2018-08-06 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570999#comment-16570999
 ] 

Subru Krishnan commented on YARN-8626:
--

Thanks [~elgoiri] for addressing my comments, +1 on the latest patch (v8) 
pending Yetus.

> Create HomePolicyManager that sends all the requests to the home subcluster
> ---
>
> Key: YARN-8626
> URL: https://issues.apache.org/jira/browse/YARN-8626
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Giovanni Matteo Fumarola
>Assignee: Íñigo Goiri
>Priority: Minor
> Fix For: 3.2.0
>
> Attachments: YARN-8626.000.patch, YARN-8626.001.patch, 
> YARN-8626.002.patch, YARN-8626.003.patch, YARN-8626.004.patch, 
> YARN-8626.005.patch, YARN-8626.006.patch, YARN-8626.007.patch, 
> YARN-8626.008.patch
>
>
> To have the same behavior as a regular non-federated deployment, one should 
> be able to submit jobs to the local RM and get the job constrained to that 
> subcluster.
> This JIRA creates an AMRMProxyPolicy that sends resources to the home 
> subcluster and mimics the behavior of a non-federated cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8626) Create LocalPolicyManager that sends all the requests to the home subcluster

2018-08-06 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570604#comment-16570604
 ] 

Subru Krishnan commented on YARN-8626:
--

Thanks [~elgoiri] for the patch. I looked at it, please find my comments below:
 * We generally use _home_ and not _local_. In this case I suggest replacing 
_local_ with either _home_ or _reflexive_?
 * If possible, can you remove the empty *notifyOfResponse* impls from the 
stateless AMRMProxyPolicies as those are now redundant.
 * In the {{LocalAMRMProxyPolicy}}, add a check to validate the _home SC_ is 
indeed active (and corresponding test).
 * The *FederationPolicyInitializationContext* will not have the _home SC_ set 
in {{LocalRouterPolicy}} as it's the responsibility of the router to do so 
(chicken or egg situation :)). If you have capacity reserved, then the ideal 
approach would be to query the _StateStore_ to figure out which SC has capacity 
and select that as the _home SC_. If you don't have capacity reserved, then you 
should use *UniformRandomRouterPolicy* directly.
 * Add a test for {{LocalRouterPolicy}} if it's still required based on above 
comment.

> Create LocalPolicyManager that sends all the requests to the home subcluster
> 
>
> Key: YARN-8626
> URL: https://issues.apache.org/jira/browse/YARN-8626
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Giovanni Matteo Fumarola
>Assignee: Íñigo Goiri
>Priority: Minor
> Fix For: 3.2.0
>
> Attachments: YARN-8626.000.patch, YARN-8626.001.patch, 
> YARN-8626.002.patch, YARN-8626.003.patch, YARN-8626.004.patch, 
> YARN-8626.005.patch
>
>
> To have the same behavior as a regular non-federated deployment, one should 
> be able to submit jobs to the local RM and get the job constrained to that 
> subcluster.
> This JIRA creates an AMRMProxyPolicy that sends resources to the home 
> subcluster and mimics the behavior of a non-federated cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7833) [PERF/TEST] Extend SLS to support simulation of a Federated Environment

2018-08-03 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568597#comment-16568597
 ] 

Subru Krishnan commented on YARN-7833:
--

[~tanujnay], thanks for the contribution as it's extremely useful. 
Unfortunately I am not familiar with SLS codebase so hopefully 
[~leftnoteasy]/[~curino] have some bandwidth to take a look.

> [PERF/TEST] Extend SLS to support simulation of a Federated Environment
> ---
>
> Key: YARN-7833
> URL: https://issues.apache.org/jira/browse/YARN-7833
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Carlo Curino
>Assignee: Tanuj Nayak
>Priority: Major
> Attachments: YARN-7833.v1.patch, YARN-7833.v2.patch, 
> YARN-7833.v3.patch, YARN-7833.v4.patch, YARN-7833.v5.patch, 
> YARN-7833.v6.patch, YARN-7833.v7.patch
>
>
> To develop algorithms for federation, it would be of great help to have a 
> version of SLS that supports multi RMs and GPG.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8434) Update federation documentation of Nodemanager configurations

2018-07-12 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542264#comment-16542264
 ] 

Subru Krishnan commented on YARN-8434:
--

Thanks [~elgoiri] for your feedback. I agree that both the points you raised 
are valid and we do call out pointing to {{AMRMProxy}} for clients in the 
[doc|http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/Federation.html#Running_a_Sample_Job]
 . For the HADOOP_CLIENT_CONF, we should track in the existing Jira - YARN-4083.

[~bibinchundatt], do cherry-pick to branch-2/2.9 as well when you commit. 
Thanks!

 

> Update federation documentation of Nodemanager configurations
> -
>
> Key: YARN-8434
> URL: https://issues.apache.org/jira/browse/YARN-8434
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: YARN-8434.001.patch, YARN-8434.002.patch, 
> YARN-8434.003.patch
>
>
> FederationRMFailoverProxyProvider doesn't handle connecting to active RM. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8434) Update federation documentation of Nodemanager configurations

2018-07-11 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540428#comment-16540428
 ] 

Subru Krishnan commented on YARN-8434:
--

Thanks [~bibinchundatt] for understanding/verifying! +1 from my side on latest 
patch (v3).

[~elgoiri], do you have any other documentation fixes before this goes in?

> Update federation documentation of Nodemanager configurations
> -
>
> Key: YARN-8434
> URL: https://issues.apache.org/jira/browse/YARN-8434
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: YARN-8434.001.patch, YARN-8434.002.patch, 
> YARN-8434.003.patch
>
>
> FederationRMFailoverProxyProvider doesn't handle connecting to active RM. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8434) Nodemanager not registering to active RM in federation

2018-07-10 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539415#comment-16539415
 ] 

Subru Krishnan commented on YARN-8434:
--

Thanks [~bibinchundatt] for the clarification, I understand the confusion now. 
That documentation is outdated and has to be fixed as now we automatically set 
the  *{{FederationRMFailoverProxyProvider* internally via}} 
{{FederationProxyProviderUtil and so the NM config overriding is not required. 
My bad, I apologize.}}

{{If it works for you, we can re-purpose the Jira to fix the doc?}}

> Nodemanager not registering to active RM in federation
> --
>
> Key: YARN-8434
> URL: https://issues.apache.org/jira/browse/YARN-8434
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
> Attachments: YARN-8434.001.patch, YARN-8434.002.patch
>
>
> FederationRMFailoverProxyProvider doesn't handle connecting to active RM. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8484) Fix NPE during ServiceStop in Router classes

2018-07-09 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537794#comment-16537794
 ] 

Subru Krishnan commented on YARN-8484:
--

Thanks [~giovanni.fumarola] for the clarification, +1 from my side.

> Fix NPE during ServiceStop in Router classes
> 
>
> Key: YARN-8484
> URL: https://issues.apache.org/jira/browse/YARN-8484
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
>Priority: Minor
> Attachments: YARN-8484.v1.patch
>
>
> Fix NPE during ServiceStop in Router classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7953) [GQ] Data structures for federation global queues calculations

2018-07-09 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537437#comment-16537437
 ] 

Subru Krishnan commented on YARN-7953:
--

Thanks [~abmodi] for working on this and [~botong] for the review. 

I looked at the patch and have a quick comment - since we are fully wire 
compliant with YARN APIs in Federated mode, the data structures should be part 
of *GPG* and not *RM*. IIUC they are only used for convenience in GPG for 
recalculating queue hierarchies.

> [GQ] Data structures for federation global queues calculations
> --
>
> Key: YARN-7953
> URL: https://issues.apache.org/jira/browse/YARN-7953
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Carlo Curino
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-7953-YARN-7402.v1.patch, 
> YARN-7953-YARN-7402.v2.patch, YARN-7953-YARN-7402.v3.patch, 
> YARN-7953-YARN-7402.v4.patch, YARN-7953-YARN-7402.v5.patch, 
> YARN-7953-YARN-7402.v6.patch, YARN-7953.v1.patch
>
>
> This Jira tracks data structures and helper classes used by the core 
> algorithms of YARN-7402 umbrella Jira (currently YARN-7403, and YARN-7834).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8434) Nodemanager not registering to active RM in federation

2018-07-09 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537402#comment-16537402
 ] 

Subru Krishnan commented on YARN-8434:
--

Thanks [~bibinchundatt] for the patch. In our deployment, we have separate 
config (directories itself) for client and server as this allows us to not only 
control client behavior independent of server for scenarios exactly like this 
one but also is more secure as now all the server configs are being 
leaked/shared with clients. Will a similar approach work for you?

> Nodemanager not registering to active RM in federation
> --
>
> Key: YARN-8434
> URL: https://issues.apache.org/jira/browse/YARN-8434
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
> Attachments: YARN-8434.001.patch, YARN-8434.002.patch
>
>
> FederationRMFailoverProxyProvider doesn't handle connecting to active RM. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8484) Fix NPE during ServiceStop in Router classes

2018-07-09 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537385#comment-16537385
 ] 

Subru Krishnan commented on YARN-8484:
--

Thanks [~giovanni.fumarola] for the quick fix. How did you validate it? Can you 
add a test?

> Fix NPE during ServiceStop in Router classes
> 
>
> Key: YARN-8484
> URL: https://issues.apache.org/jira/browse/YARN-8484
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Giovanni Matteo Fumarola
>Assignee: Giovanni Matteo Fumarola
>Priority: Minor
> Attachments: YARN-8484.v1.patch
>
>
> Fix NPE during ServiceStop in Router classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8434) Nodemanager not registering to active RM in federation

2018-06-22 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520852#comment-16520852
 ] 

Subru Krishnan commented on YARN-8434:
--

[~bibinchundatt], thanks for reporting this. I would like to understand the 
context more, are you trying to use the {{FederationRMFailoverProxyProvider}} 
for NM - RM communication as we use \{{RequestHedgingRMFailoverProxyProvider}}? 
We currently use {{FederationRMFailoverProxyProvider}} for AM - RM protocol.

> Nodemanager not registering to active RM in federation
> --
>
> Key: YARN-8434
> URL: https://issues.apache.org/jira/browse/YARN-8434
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
>
> FederationRMFailoverProxyProvider doesn't handle connecting to active RM. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7953) [GQ] Data structures for federation global queues calculations

2018-05-29 Thread Subru Krishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan reassigned YARN-7953:


Assignee: Abhishek Modi  (was: Carlo Curino)

> [GQ] Data structures for federation global queues calculations
> --
>
> Key: YARN-7953
> URL: https://issues.apache.org/jira/browse/YARN-7953
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Carlo Curino
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-7953.v1.patch
>
>
> This Jira tracks data structures and helper classes used by the core 
> algorithms of YARN-7402 umbrella Jira (currently YARN-7403, and YARN-7834).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7405) [GQ] Bias container allocations based on global view

2018-05-25 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan reassigned YARN-7405:


Assignee: Abhishek Modi  (was: Arun Suresh)

> [GQ] Bias container allocations based on global view
> 
>
> Key: YARN-7405
> URL: https://issues.apache.org/jira/browse/YARN-7405
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation
>Reporter: Carlo Curino
>Assignee: Abhishek Modi
>Priority: Major
>
> Each RM in a federation should bias its local allocations of containers based 
> on the global over/under utilization of queues. As part of this the local RM 
> should account for the work that other RMs will be doing in between the 
> updates we receive via the heartbeats of YARN-7404 (the mechanics used for 
> synchronization).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7404) [GQ] propagate to GPG queue-level utilization/pending information

2018-05-25 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan reassigned YARN-7404:


Assignee: Abhishek Modi  (was: Jose Miguel Arreola)

> [GQ] propagate to GPG queue-level utilization/pending information
> -
>
> Key: YARN-7404
> URL: https://issues.apache.org/jira/browse/YARN-7404
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation
>Reporter: Carlo Curino
>Assignee: Abhishek Modi
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7900) [AMRMProxy] AMRMClientRelayer for stateful FederationInterceptor

2018-05-18 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481046#comment-16481046
 ] 

Subru Krishnan commented on YARN-7900:
--

[~botong]/[~asuresh], don't we need this in branch-2 as well?

> [AMRMProxy] AMRMClientRelayer for stateful FederationInterceptor
> 
>
> Key: YARN-7900
> URL: https://issues.apache.org/jira/browse/YARN-7900
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7900.v1.patch, YARN-7900.v2.patch, 
> YARN-7900.v3.patch, YARN-7900.v4.patch, YARN-7900.v5.patch, 
> YARN-7900.v6.patch, YARN-7900.v7.patch, YARN-7900.v8.patch, YARN-7900.v9.patch
>
>
> Inside stateful FederationInterceptor (YARN-7899), we need a component 
> similar to AMRMClient that remembers all pending (outstands) requests we've 
> sent to YarnRM, auto re-register and do full pending resend when YarnRM fails 
> over and throws ApplicationMasterNotRegisteredException back. This JIRA adds 
> this component as AMRMClientRelayer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8110) AMRMProxy recover should catch for all throwable to avoid premature exit

2018-04-06 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-8110:
-
Summary: AMRMProxy recover should catch for all throwable to avoid 
premature exit  (was: AMRMProxy recover should catch for all throwable retrying 
to recover apps)

> AMRMProxy recover should catch for all throwable to avoid premature exit
> 
>
> Key: YARN-8110
> URL: https://issues.apache.org/jira/browse/YARN-8110
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8110.v1.patch
>
>
> In NM work preserving restart, when AMRMProxy recovers applications one by 
> one, the current catch only catch for IOException. If one app recovery throws 
> other thing (e.g. RuntimeException), it will fail the entire AMRMProxy 
> recovery. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8010) Add config in FederationRMFailoverProxy to not bypass facade cache when failing over

2018-03-27 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416529#comment-16416529
 ] 

Subru Krishnan commented on YARN-8010:
--

Thanks [~botong] for the contribution and [~giovanni.fumarola] for the review, 
I have committed this to trunk/branch-3.1/branch-2/branch-2.9.

[~botong], it didn't apply cleanly to branch-3.0, so feel free to reopen and 
provide patch if you want this in 3.0.2+.

> Add config in FederationRMFailoverProxy to not bypass facade cache when 
> failing over
> 
>
> Key: YARN-8010
> URL: https://issues.apache.org/jira/browse/YARN-8010
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
> Attachments: YARN-8010.v1.patch, YARN-8010.v1.patch, 
> YARN-8010.v2.patch, YARN-8010.v3.patch
>
>
> Today when YarnRM is failing over, the FederationRMFailoverProxy running in 
> AMRMProxy will perform failover, try to get latest subcluster info from 
> FederationStateStore and then retry connect to the latest YarnRM master. When 
> calling getSubCluster() to FederationStateStoreFacade, it bypasses the cache 
> with a flush flag. When YarnRM is failing over, every AM heartbeat thread 
> creates a different thread inside FederationInterceptor, each of which keeps 
> performing failover several times. This leads to a big spike of getSubCluster 
> call to FederationStateStore. 
> Depending on the cluster setup (e.g. putting a VIP before all YarnRMs), 
> YarnRM master slave change might not result in RM addr change. In other 
> cases, a small delay of getting latest subcluster information may be 
> acceptable. This patch thus creates a config option, so that it is possible 
> to ask the FederationRMFailoverProxy to not flush cache when calling 
> getSubCluster(). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8010) Add config in FederationRMFailoverProxy to not bypass facade cache when failing over

2018-03-27 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-8010:
-
Summary: Add config in FederationRMFailoverProxy to not bypass facade cache 
when failing over  (was: add config in FederationRMFailoverProxy to not bypass 
facade cache when failing over)

> Add config in FederationRMFailoverProxy to not bypass facade cache when 
> failing over
> 
>
> Key: YARN-8010
> URL: https://issues.apache.org/jira/browse/YARN-8010
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
> Attachments: YARN-8010.v1.patch, YARN-8010.v1.patch, 
> YARN-8010.v2.patch, YARN-8010.v3.patch
>
>
> Today when YarnRM is failing over, the FederationRMFailoverProxy running in 
> AMRMProxy will perform failover, try to get latest subcluster info from 
> FederationStateStore and then retry connect to the latest YarnRM master. When 
> calling getSubCluster() to FederationStateStoreFacade, it bypasses the cache 
> with a flush flag. When YarnRM is failing over, every AM heartbeat thread 
> creates a different thread inside FederationInterceptor, each of which keeps 
> performing failover several times. This leads to a big spike of getSubCluster 
> call to FederationStateStore. 
> Depending on the cluster setup (e.g. putting a VIP before all YarnRMs), 
> YarnRM master slave change might not result in RM addr change. In other 
> cases, a small delay of getting latest subcluster information may be 
> acceptable. This patch thus creates a config option, so that it is possible 
> to ask the FederationRMFailoverProxy to not flush cache when calling 
> getSubCluster(). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8010) add config in FederationRMFailoverProxy to not bypass facade cache when failing over

2018-03-14 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399177#comment-16399177
 ] 

Subru Krishnan commented on YARN-8010:
--

Thanks [~botong] for the patch and [~giovanni.fumarola] for reviewing it.

 

Do you want this in trunk/branch-2 or or YARN-7402?

> add config in FederationRMFailoverProxy to not bypass facade cache when 
> failing over
> 
>
> Key: YARN-8010
> URL: https://issues.apache.org/jira/browse/YARN-8010
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
> Attachments: YARN-8010.v1.patch, YARN-8010.v1.patch, 
> YARN-8010.v2.patch, YARN-8010.v3.patch
>
>
> Today when YarnRM is failing over, the FederationRMFailoverProxy running in 
> AMRMProxy will perform failover, try to get latest subcluster info from 
> FederationStateStore and then retry connect to the latest YarnRM master. When 
> calling getSubCluster() to FederationStateStoreFacade, it bypasses the cache 
> with a flush flag. When YarnRM is failing over, every AM heartbeat thread 
> creates a different thread inside FederationInterceptor, each of which keeps 
> performing failover several times. This leads to a big spike of getSubCluster 
> call to FederationStateStore. 
> Depending on the cluster setup (e.g. putting a VIP before all YarnRMs), 
> YarnRM master slave change might not result in RM addr change. In other 
> cases, a small delay of getting latest subcluster information may be 
> acceptable. This patch thus creates a config option, so that it is possible 
> to ask the FederationRMFailoverProxy to not flush cache when calling 
> getSubCluster(). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7405) [GQ] Bias container allocations based on global view

2018-03-05 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan reassigned YARN-7405:


Assignee: Arun Suresh  (was: Subru Krishnan)

> [GQ] Bias container allocations based on global view
> 
>
> Key: YARN-7405
> URL: https://issues.apache.org/jira/browse/YARN-7405
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation
>Reporter: Carlo Curino
>Assignee: Arun Suresh
>Priority: Major
>
> Each RM in a federation should bias its local allocations of containers based 
> on the global over/under utilization of queues. As part of this the local RM 
> should account for the work that other RMs will be doing in between the 
> updates we receive via the heartbeats of YARN-7404 (the mechanics used for 
> synchronization).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7945) Java Doc error in UnmanagedAMPoolManager for branch-2

2018-02-20 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370770#comment-16370770
 ] 

Subru Krishnan commented on YARN-7945:
--

[~rohithsharma]/[~jlowe], thanks for bringing it to my attention.

[~jlowe], I am not sure how the import got dropped as it's in the patch and we 
specifically ran yetus against branch-2 successfully before committing.

[~botong], do you want to provide the quick fix?

> Java Doc error in UnmanagedAMPoolManager for branch-2
> -
>
> Key: YARN-7945
> URL: https://issues.apache.org/jira/browse/YARN-7945
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0, 2.9.1
>Reporter: Rohith Sharma K S
>Priority: Major
>
> In branch-2, I see an java doc error while building package. 
> {code}
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/Commit_Repos/branch-2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/uam/UnmanagedAMPoolManager.java:151:
>  error: reference not found
> [ERROR]* @see ApplicationSubmissionContext
> [ERROR]   ^
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/Commit_Repos/branch-2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/uam/UnmanagedAMPoolManager.java:204:
>  error: reference not found
> [ERROR]* @see ApplicationSubmissionContext
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7945) Java Doc error in UnmanagedAMPoolManager for branch-2

2018-02-20 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370770#comment-16370770
 ] 

Subru Krishnan edited comment on YARN-7945 at 2/21/18 12:02 AM:


[~rohithsharma]/[~jlowe], thanks for bringing it to my attention.

[~jlowe], I am not sure how the import got dropped as it's in the patch and we 
specifically ran yetus against branch-2 successfully before committing. Only 
likelihood is regression caused by trying to fix an unused import checkstyle 
warning at commit.

[~botong], do you want to provide the quick fix?


was (Author: subru):
[~rohithsharma]/[~jlowe], thanks for bringing it to my attention.

[~jlowe], I am not sure how the import got dropped as it's in the patch and we 
specifically ran yetus against branch-2 successfully before committing.

[~botong], do you want to provide the quick fix?

> Java Doc error in UnmanagedAMPoolManager for branch-2
> -
>
> Key: YARN-7945
> URL: https://issues.apache.org/jira/browse/YARN-7945
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0, 2.9.1
>Reporter: Rohith Sharma K S
>Priority: Major
>
> In branch-2, I see an java doc error while building package. 
> {code}
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/Commit_Repos/branch-2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/uam/UnmanagedAMPoolManager.java:151:
>  error: reference not found
> [ERROR]* @see ApplicationSubmissionContext
> [ERROR]   ^
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/Commit_Repos/branch-2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/uam/UnmanagedAMPoolManager.java:204:
>  error: reference not found
> [ERROR]* @see ApplicationSubmissionContext
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7934) [GQ] Refactor preemption calculators to allow overriding for Federation Global Algos

2018-02-16 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368055#comment-16368055
 ] 

Subru Krishnan commented on YARN-7934:
--

Thanks [~curino] for updating the patch, the latest rev (v4) LGTM. The test 
failure looks unrelated, can you confirm?

[~leftnoteasy], do you want to take a quick look before we commit?

> [GQ] Refactor preemption calculators to allow overriding for Federation 
> Global Algos
> 
>
> Key: YARN-7934
> URL: https://issues.apache.org/jira/browse/YARN-7934
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Carlo Curino
>Assignee: Carlo Curino
>Priority: Major
> Attachments: YARN-7934.v1.patch, YARN-7934.v2.patch, 
> YARN-7934.v3.patch, YARN-7934.v4.patch
>
>
> This Jira tracks minimal changes in the capacity scheduler preemption 
> mechanics that allow for sub-classing and overriding of certain behaviors, 
> which we use to implement federation global algorithms, e.g., in YARN-7403.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7934) [GQ] Refactor preemption calculators to allow overriding for Federation Global Algos

2018-02-15 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366351#comment-16366351
 ] 

Subru Krishnan commented on YARN-7934:
--

Thanks [~curino] for the patch, it looks fairly straightforward. I have only 
nit - can you add Javadocs for the new public and protected (especially so that 
overriding expectations are clear)  methods. Also I don't see any consumers for 
the public methods, is that in a subsequent patch?

> [GQ] Refactor preemption calculators to allow overriding for Federation 
> Global Algos
> 
>
> Key: YARN-7934
> URL: https://issues.apache.org/jira/browse/YARN-7934
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Carlo Curino
>Assignee: Carlo Curino
>Priority: Major
> Attachments: YARN-7934.v1.patch, YARN-7934.v2.patch
>
>
> This Jira tracks minimal changes in the capacity scheduler preemption 
> mechanics that allow for sub-classing and overriding of certain behaviors, 
> which we use to implement federation global algorithms, e.g., in YARN-7403.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7508) NPE in FiCaSchedulerApp when debug log enabled in async-scheduling mode

2018-01-09 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-7508:
-
Target Version/s:   (was: 3.1.0, 2.9.1)

> NPE in FiCaSchedulerApp when debug log enabled in async-scheduling mode
> ---
>
> Key: YARN-7508
> URL: https://issues.apache.org/jira/browse/YARN-7508
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha4
>Reporter: Tao Yang
>Assignee: Tao Yang
> Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.1
>
> Attachments: YARN-7508.001.patch
>
>
> YARN-6678 have fixed the IllegalStateException problem but the debug log it 
> added may cause NPE when trying to print containerId of non-existed reserved 
> container on this node. Replace 
> {{schedulerContainer.getSchedulerNode().getReservedContainer().getContainerId()}}
>  with {{schedulerContainer.getSchedulerNode().getReservedContainer()}} can 
> fix this problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7508) NPE in FiCaSchedulerApp when debug log enabled in async-scheduling mode

2018-01-09 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-7508:
-
Fix Version/s: 2.9.1

> NPE in FiCaSchedulerApp when debug log enabled in async-scheduling mode
> ---
>
> Key: YARN-7508
> URL: https://issues.apache.org/jira/browse/YARN-7508
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha4
>Reporter: Tao Yang
>Assignee: Tao Yang
> Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.1
>
> Attachments: YARN-7508.001.patch
>
>
> YARN-6678 have fixed the IllegalStateException problem but the debug log it 
> added may cause NPE when trying to print containerId of non-existed reserved 
> container on this node. Replace 
> {{schedulerContainer.getSchedulerNode().getReservedContainer().getContainerId()}}
>  with {{schedulerContainer.getSchedulerNode().getReservedContainer()}} can 
> fix this problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1709) Admission Control: Reservation subsystem

2017-12-22 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302037#comment-16302037
 ] 

Subru Krishnan commented on YARN-1709:
--

[~xingbao], thanks for your interest. I have responded to you in YARN-1051 
[here|https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=16302033].

> Admission Control: Reservation subsystem
> 
>
> Key: YARN-1709
> URL: https://issues.apache.org/jira/browse/YARN-1709
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Carlo Curino
>Assignee: Subru Krishnan
> Fix For: 2.6.0
>
> Attachments: YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, 
> YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, YARN-1709.patch
>
>
> This JIRA is about the key data structure used to track resources over time 
> to enable YARN-1051. The Reservation subsystem is conceptually a "plan" of 
> how the scheduler will allocate resources over-time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.

2017-12-22 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302033#comment-16302033
 ] 

Subru Krishnan edited comment on YARN-1051 at 12/22/17 10:42 PM:
-

[~xingbao], the behavior depends on whether there's any job that's using more 
than it's guaranteed resources in the specific node and if preemption is 
enabled or not in the cluster. 

If there's no job using excess resources in the specific node, then either:
* relax locality to rack
* wait for one of the running job AMs to release container(s)

If there is at least one job which is using excess resources in the specific 
node, then:
* If you have preemption is enabled (refer [here | 
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Capacity_Scheduler_container_preemption]
 on how to enable it), the over allocated container(s) will get preempted
*  wait for one of the running job AMs to release container(s)


was (Author: subru):
[~xingbao], the behavior depends on whether there's any job that's using more 
than it's guaranteed resources in the specific node and if preemption is 
enabled or not in the cluster. 

If there's no job using excess resources in the specific node, then either:
* relax locality to rack
* wait for one of the running job AMs to release container(s)

If there is at least one job which is using excess resources in the specific 
node, then:
* If you have preemption is enabled (refer 
[http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Capacity_Scheduler_container_preemption|here]
 on how to enable it), the over allocated container(s) will get preempted
*  wait for one of the running job AMs to release container(s)

> YARN Admission Control/Planner: enhancing the resource allocation model with 
> time.
> --
>
> Key: YARN-1051
> URL: https://issues.apache.org/jira/browse/YARN-1051
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, resourcemanager, scheduler
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Fix For: 2.6.0
>
> Attachments: YARN-1051-design.pdf, YARN-1051.1.patch, 
> YARN-1051.patch, curino_MSR-TR-2013-108.pdf, socc14-paper15.pdf, 
> techreport.pdf
>
>
> In this umbrella JIRA we propose to extend the YARN RM to handle time 
> explicitly, allowing users to "reserve" capacity over time. This is an 
> important step towards SLAs, long-running services, workflows, and helps for 
> gang scheduling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.

2017-12-22 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302033#comment-16302033
 ] 

Subru Krishnan commented on YARN-1051:
--

[~xingbao], the behavior depends on whether there's any job that's using more 
than it's guaranteed resources in the specific node and if preemption is 
enabled or not in the cluster. 

If there's no job using excess resources in the specific node, then either:
* relax locality to rack
* wait for one of the running job AMs to release container(s)

If there is at least one job which is using excess resources in the specific 
node, then:
* If you have preemption is enabled (refer 
[http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Capacity_Scheduler_container_preemption|here]
 on how to enable it), the over allocated container(s) will get preempted
*  wait for one of the running job AMs to release container(s)

> YARN Admission Control/Planner: enhancing the resource allocation model with 
> time.
> --
>
> Key: YARN-1051
> URL: https://issues.apache.org/jira/browse/YARN-1051
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, resourcemanager, scheduler
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Fix For: 2.6.0
>
> Attachments: YARN-1051-design.pdf, YARN-1051.1.patch, 
> YARN-1051.patch, curino_MSR-TR-2013-108.pdf, socc14-paper15.pdf, 
> techreport.pdf
>
>
> In this umbrella JIRA we propose to extend the YARN RM to handle time 
> explicitly, allowing users to "reserve" capacity over time. This is an 
> important step towards SLAs, long-running services, workflows, and helps for 
> gang scheduling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7630) Fix AMRMToken rollover handling in AMRMProxy

2017-12-14 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-7630:
-
Summary: Fix AMRMToken rollover handling in AMRMProxy  (was: Fix AMRMToken 
handling in AMRMProxy)

> Fix AMRMToken rollover handling in AMRMProxy
> 
>
> Key: YARN-7630
> URL: https://issues.apache.org/jira/browse/YARN-7630
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
> Attachments: YARN-7630.v1.patch, YARN-7630.v1.patch
>
>
> Symptom: after RM rolls over the master key for AMRMToken, whenever the RPC 
> connection from FederationInterceptor to RM breaks due to transient network 
> issue and reconnects, heartbeat to RM starts failing because of the “Invalid 
> AMRMToken” exception. Whenever it hits, it happens for both home RM and 
> secondary RMs. 
> Related facts: 
> 1. When RM issues a new AMRMToken, it always send with service name field as 
> empty string. RPC layer in AM side will set it properly before start using 
> it. 
> 2. UGI keeps all tokens using a map from serviceName->Token. Initially 
> AMRMClientUtils.createRMProxy() is used to load the first token and start the 
> RM connection. 
> 3. When RM renew the token, YarnServerSecurityUtils.updateAMRMToken() is used 
> to load it into UGI and replace the existing token (with the same serviceName 
> key). 
> Bug: 
> The bug is that 2-AMRMClientUtils.createRMProxy() and 
> 3-YarnServerSecurityUtils.updateAMRMToken() is not handling the sequence 
> consistently. We always need to load the token (with empty service name) into 
> UGI first before we set the serviceName, so that the previous AMRMToken will 
> be overridden. But 2 is doing it reversely. That’s why after RM rolls the 
> amrmToken, the UGI end up with two tokens. Whenever the RPC connection break 
> and reconnect, the wrong token could be picked and thus trigger the 
> exception. 
> Fix: 
> Should load the AMRMToken into UGI first and then update the service name 
> field for RPC



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7652) Handle AM register requests asynchronously in FederationInterceptor

2017-12-13 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-7652:
-
Description: We (cc [~goiri]/[~botong]) observed that the 
{{FederationInterceptor}} in {{AMRMProxy}} (and consequently the AM) is blocked 
if the _StateStore_ has outdated info about a _SubCluster_. This is because we 
handle AM register requests synchronously. This jira proposes to move to async 
similar to how we operate with allocate invocations.  (was: We (cc 
[~goiri]/[~botong]) observed that the {{FederationInterceptor}} in 
{{AMRMProxy}} and consequently the application is blocked if the _StateStore_ 
has outdated info about a _SubCluster_. This is because we handle AM register 
requests synchronously. This jira proposes to move to async similar to how we 
operate with allocate invocations.)

> Handle AM register requests asynchronously in FederationInterceptor
> ---
>
> Key: YARN-7652
> URL: https://issues.apache.org/jira/browse/YARN-7652
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: amrmproxy, federation
>Affects Versions: 2.9.0, 3.0.0
>Reporter: Subru Krishnan
>Assignee: Botong Huang
>
> We (cc [~goiri]/[~botong]) observed that the {{FederationInterceptor}} in 
> {{AMRMProxy}} (and consequently the AM) is blocked if the _StateStore_ has 
> outdated info about a _SubCluster_. This is because we handle AM register 
> requests synchronously. This jira proposes to move to async similar to how we 
> operate with allocate invocations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7642) Container execution type is not updated after promotion/demotion in NMContext

2017-12-13 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-7642:
-
Target Version/s: 2.9.1, 3.0.1

> Container execution type is not updated after promotion/demotion in NMContext
> -
>
> Key: YARN-7642
> URL: https://issues.apache.org/jira/browse/YARN-7642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: YARN-7642.001.patch
>
>
> Found this bug while working on YARN-7617. After calling API to promote a 
> container from OPPORTUNISTIC to GUARANTEED, node manager web page still shows 
> the container execution type as OPPORTUNISTIC. Looks like the container 
> execution type in NMContext was not updated accordingly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7649) RMContainer state transition exception after container update

2017-12-13 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-7649:
-
Target Version/s: 2.9.1, 3.0.1

> RMContainer state transition exception after container update
> -
>
> Key: YARN-7649
> URL: https://issues.apache.org/jira/browse/YARN-7649
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.0
>Reporter: Weiwei Yang
>Assignee: Arun Suresh
>
> I've been seen this in a cluster deployment as well as in UT, run 
> {{TestAMRMClient#testAMRMClientWithContainerPromotion}} could reproduce this, 
>  it doesn't fail the test case but following error message is shown up in the 
> log
> {noformat}
> 2017-12-13 19:41:31,817 ERROR rmcontainer.RMContainerImpl 
> (RMContainerImpl.java:handle(480)) - Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> RELEASED at ALLOCATED
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:478)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:65)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:675)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1586)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:155)
>   at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
>   at java.lang.Thread.run(Thread.java:748)
> 2017-12-13 19:41:31,817 ERROR rmcontainer.RMContainerImpl 
> (RMContainerImpl.java:handle(481)) - Invalid event RELEASED on container 
> container_1513165290804_0001_01_03
> {noformat}
> this seems to be related to YARN-6251.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7652) Handle AM register requests asynchronously in FederationInterceptor

2017-12-13 Thread Subru Krishnan (JIRA)
Subru Krishnan created YARN-7652:


 Summary: Handle AM register requests asynchronously in 
FederationInterceptor
 Key: YARN-7652
 URL: https://issues.apache.org/jira/browse/YARN-7652
 Project: Hadoop YARN
  Issue Type: Bug
  Components: amrmproxy, federation
Affects Versions: 2.9.0, 3.0.0
Reporter: Subru Krishnan
Assignee: Botong Huang


We (cc [~goiri]/[~botong]) observed that the {{FederationInterceptor}} in 
{{AMRMProxy}} and consequently the application is blocked if the _StateStore_ 
has outdated info about a _SubCluster_. This is because we handle AM register 
requests synchronously. This jira proposes to move to async similar to how we 
operate with allocate invocations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7511) NPE in ContainerLocalizer when localization failed for running container

2017-12-08 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-7511:
-
Target Version/s: 3.1.0, 2.9.1

> NPE in ContainerLocalizer when localization failed for running container
> 
>
> Key: YARN-7511
> URL: https://issues.apache.org/jira/browse/YARN-7511
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.0-alpha4, 2.9.1
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-7511.001.patch
>
>
> Error log:
> {noformat}
> 2017-09-30 20:14:32,839 FATAL [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
>         at 
> java.util.concurrent.ConcurrentHashMap.replaceNode(ConcurrentHashMap.java:1106)
>         at 
> java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1097)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceSet.resourceLocalizationFailed(ResourceSet.java:151)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceLocalizationFailedWhileRunningTransition.transition(ContainerImpl.java:821)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceLocalizationFailedWhileRunningTransition.transition(ContainerImpl.java:813)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1335)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:95)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1372)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1365)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
>         at java.lang.Thread.run(Thread.java:834)
> 2017-09-30 20:14:32,842 INFO [AsyncDispatcher ShutDown handler] 
> org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye..
> {noformat}
> Reproduce this problem:
> 1. Container was running and ContainerManagerImpl#localize was called for 
> this container
> 2. Localization failed in ResourceLocalizationService$LocalizerRunner#run and 
> sent out ContainerResourceFailedEvent with null LocalResourceRequest.
> 3. NPE when ResourceLocalizationFailedWhileRunningTransition#transition --> 
> container.resourceSet.resourceLocalizationFailed(null)
> I think we can fix this problem through ensuring that request is not null 
> before remove it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7508) NPE in FiCaSchedulerApp when debug log enabled in async-scheduling mode

2017-12-08 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-7508:
-
Target Version/s: 3.1.0, 2.9.1

> NPE in FiCaSchedulerApp when debug log enabled in async-scheduling mode
> ---
>
> Key: YARN-7508
> URL: https://issues.apache.org/jira/browse/YARN-7508
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.0, 3.0.0-alpha4
>Reporter: Tao Yang
>Assignee: Tao Yang
> Attachments: YARN-7508.001.patch
>
>
> YARN-6678 have fixed the IllegalStateException problem but the debug log it 
> added may cause NPE when trying to print containerId of non-existed reserved 
> container on this node. Replace 
> {{schedulerContainer.getSchedulerNode().getReservedContainer().getContainerId()}}
>  with {{schedulerContainer.getSchedulerNode().getReservedContainer()}} can 
> fix this problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7591) NPE in async-scheduling mode of CapacityScheduler

2017-12-08 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284441#comment-16284441
 ] 

Subru Krishnan commented on YARN-7591:
--

Thanks [~Tao Yang] for the contribution and [~leftnoteasy] for the 
review/commit. 

[~leftnoteasy], I see the commit in trunk but not in branch-2/2.9 so are you 
planning cherry-pick down?

> NPE in async-scheduling mode of CapacityScheduler
> -
>
> Key: YARN-7591
> URL: https://issues.apache.org/jira/browse/YARN-7591
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.0.0-alpha4, 2.9.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-7591.001.patch, YARN-7591.002.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, NPE may be raised in 
> special scenarios as below.
> (1) The user should be removed after its last application finished, NPE may 
> be raised if getting something from user object without the null check in 
> async-scheduling threads.
> (2) NPE may be raised when trying fulfill reservation for a finished 
> application in {{CapacityScheduler#allocateContainerOnSingleNode}}.
> {code}
> RMContainer reservedContainer = node.getReservedContainer();
> if (reservedContainer != null) {
>   FiCaSchedulerApp reservedApplication = getCurrentAttemptForContainer(
>   reservedContainer.getContainerId());
>   // NPE here: reservedApplication could be null after this application 
> finished
>   // Try to fulfill the reservation
>   LOG.info(
>   "Trying to fulfill reservation for application " + 
> reservedApplication
>   .getApplicationId() + " on node: " + node.getNodeID());
> {code}
> (3) If proposal1 (allocate containerX on node1) and proposal2 (reserve 
> containerY on node1) were generated by different async-scheduling threads 
> around the same time and proposal2 was submitted in front of proposal1, NPE 
> is raised when trying to submit proposal2 in 
> {{FiCaSchedulerApp#commonCheckContainerAllocation}}.
> {code}
> if (reservedContainerOnNode != null) {
>   // NPE here: allocation.getAllocateFromReservedContainer() should be 
> null for proposal2 in this case
>   RMContainer fromReservedContainer =
>   allocation.getAllocateFromReservedContainer().getRmContainer();
>   if (fromReservedContainer != reservedContainerOnNode) {
> if (LOG.isDebugEnabled()) {
>   LOG.debug(
>   "Try to allocate from a non-existed reserved container");
> }
> return false;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6704) Add support for work preserving NM restart when FederationInterceptor is enabled in AMRMProxyService

2017-12-08 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-6704:
-
Target Version/s:   (was: 3.1.0, 2.9.1)

> Add support for work preserving NM restart when FederationInterceptor is 
> enabled in AMRMProxyService
> 
>
> Key: YARN-6704
> URL: https://issues.apache.org/jira/browse/YARN-6704
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
> Fix For: 3.1.0, 2.10.0, 2.9.1
>
> Attachments: YARN-6704-YARN-2915.v1.patch, 
> YARN-6704-YARN-2915.v2.patch, YARN-6704.v3.patch, YARN-6704.v4.patch, 
> YARN-6704.v5.patch, YARN-6704.v6.patch, YARN-6704.v7.patch, 
> YARN-6704.v8.patch, YARN-6704.v9.patch
>
>
> YARN-1336 added the ability to restart NM without loosing any running 
> containers. {{AMRMProxy}} restart is added in YARN-6127. In a Federated YARN 
> environment, there's additional state in the {{FederationInterceptor}} to 
> allow for spanning across multiple sub-clusters, so we need to enhance 
> {{FederationInterceptor}} to support work-preserving restart.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6704) Add support for work preserving NM restart when FederationInterceptor is enabled in AMRMProxyService

2017-12-08 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-6704:
-
Summary: Add support for work preserving NM restart when 
FederationInterceptor is enabled in AMRMProxyService  (was: Add Federation 
Interceptor restart when work preserving NM is enabled)

> Add support for work preserving NM restart when FederationInterceptor is 
> enabled in AMRMProxyService
> 
>
> Key: YARN-6704
> URL: https://issues.apache.org/jira/browse/YARN-6704
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
> Attachments: YARN-6704-YARN-2915.v1.patch, 
> YARN-6704-YARN-2915.v2.patch, YARN-6704.v3.patch, YARN-6704.v4.patch, 
> YARN-6704.v5.patch, YARN-6704.v6.patch, YARN-6704.v7.patch, 
> YARN-6704.v8.patch, YARN-6704.v9.patch
>
>
> YARN-1336 added the ability to restart NM without loosing any running 
> containers. {{AMRMProxy}} restart is added in YARN-6127. In a Federated YARN 
> environment, there's additional state in the {{FederationInterceptor}} to 
> allow for spanning across multiple sub-clusters, so we need to enhance 
> {{FederationInterceptor}} to support work-preserving restart.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5871) Add support for reservation-based routing.

2017-12-05 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-5871:
-
Parent Issue: YARN-7402  (was: YARN-5597)

> Add support for reservation-based routing.
> --
>
> Key: YARN-5871
> URL: https://issues.apache.org/jira/browse/YARN-5871
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: YARN-2915
>Reporter: Carlo Curino
>Assignee: Carlo Curino
>  Labels: federation
> Attachments: YARN-5871-YARN-2915.01.patch, 
> YARN-5871-YARN-2915.01.patch, YARN-5871-YARN-2915.02.patch, 
> YARN-5871-YARN-2915.03.patch, YARN-5871-YARN-2915.04.patch
>
>
> Adding policies that can route reservations, and that then route applications 
> to where the reservation have been placed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6704) Add Federation Interceptor restart when work preserving NM is enabled

2017-12-05 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16279354#comment-16279354
 ] 

Subru Krishnan commented on YARN-6704:
--

Thanks [~botong] for updating the patch and for the clarification.

{code}I've changed the UAM token storage to use local NMSS instead when 
AMRMProxy HA is not enabled. {code}

Can you update the test to assert for the above and we are good to go!

> Add Federation Interceptor restart when work preserving NM is enabled
> -
>
> Key: YARN-6704
> URL: https://issues.apache.org/jira/browse/YARN-6704
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
> Attachments: YARN-6704-YARN-2915.v1.patch, 
> YARN-6704-YARN-2915.v2.patch, YARN-6704.v3.patch, YARN-6704.v4.patch, 
> YARN-6704.v5.patch, YARN-6704.v6.patch, YARN-6704.v7.patch
>
>
> YARN-1336 added the ability to restart NM without loosing any running 
> containers. {{AMRMProxy}} restart is added in YARN-6127. In a Federated YARN 
> environment, there's additional state in the {{FederationInterceptor}} to 
> allow for spanning across multiple sub-clusters, so we need to enhance 
> {{FederationInterceptor}} to support work-preserving restart.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7591) NPE in async-scheduling mode of CapacityScheduler

2017-12-01 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-7591:
-
Target Version/s: 2.9.1

> NPE in async-scheduling mode of CapacityScheduler
> -
>
> Key: YARN-7591
> URL: https://issues.apache.org/jira/browse/YARN-7591
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.0.0-alpha4, 2.9.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-7591.001.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, NPE may be raised in 
> special scenarios as below.
> (1) The user should be removed after its last application finished, NPE may 
> be raised if getting something from user object without the null check in 
> async-scheduling threads.
> (2) NPE may be raised when trying fulfill reservation for a finished 
> application in {{CapacityScheduler#allocateContainerOnSingleNode}}.
> {code}
> RMContainer reservedContainer = node.getReservedContainer();
> if (reservedContainer != null) {
>   FiCaSchedulerApp reservedApplication = getCurrentAttemptForContainer(
>   reservedContainer.getContainerId());
>   // NPE here: reservedApplication could be null after this application 
> finished
>   // Try to fulfill the reservation
>   LOG.info(
>   "Trying to fulfill reservation for application " + 
> reservedApplication
>   .getApplicationId() + " on node: " + node.getNodeID());
> {code}
> (3) If proposal1 (allocate containerX on node1) and proposal2 (reserve 
> containerY on node1) were generated by different async-scheduling threads 
> around the same time and proposal2 was submitted in front of proposal1, NPE 
> is raised when trying to submit proposal2 in 
> {{FiCaSchedulerApp#commonCheckContainerAllocation}}.
> {code}
> if (reservedContainerOnNode != null) {
>   // NPE here: allocation.getAllocateFromReservedContainer() should be 
> null for proposal2 in this case
>   RMContainer fromReservedContainer =
>   allocation.getAllocateFromReservedContainer().getRmContainer();
>   if (fromReservedContainer != reservedContainerOnNode) {
> if (LOG.isDebugEnabled()) {
>   LOG.debug(
>   "Try to allocate from a non-existed reserved container");
> }
> return false;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7509) AsyncScheduleThread and ResourceCommitterService are still running after RM is transitioned to standby

2017-11-27 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267425#comment-16267425
 ] 

Subru Krishnan commented on YARN-7509:
--

[~leftnoteasy], the fix version says 2.9.1 but it has not been cherry-picked to 
branch-2.9. Can you go ahead and do that? Thanks.

> AsyncScheduleThread and ResourceCommitterService are still running after RM 
> is transitioned to standby
> --
>
> Key: YARN-7509
> URL: https://issues.apache.org/jira/browse/YARN-7509
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha4, 2.9.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.1.0, 2.9.1, 3.0.1
>
> Attachments: YARN-7509.001.patch
>
>
> After RM is transitioned to standby, AsyncScheduleThread and 
> ResourceCommitterService will receive interrupt signal. When thread is 
> sleeping, it will ignore the interrupt signal since InterruptedException is 
> catched inside and the interrupt signal is cleared.
> For AsyncScheduleThread, InterruptedException was catched and ignored in  
> CapacityScheduler#schedule.
> For ResourceCommitterService, InterruptedException was catched inside and 
> ignored in ResourceCommitterService#run. 
> We should let the interrupt signal out and make these threads exit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7548) TestCapacityOverTimePolicy.testAllocation is flaky

2017-11-21 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261646#comment-16261646
 ] 

Subru Krishnan commented on YARN-7548:
--

Thanks [~haibo.chen] for reporting this. Adding [~curino] as he wrote the cool 
logic to generate allocations in tests :).

> TestCapacityOverTimePolicy.testAllocation is flaky
> --
>
> Key: YARN-7548
> URL: https://issues.apache.org/jira/browse/YARN-7548
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: reservation system
>Affects Versions: 3.0.0-beta1
>Reporter: Haibo Chen
>
> It failed in both YARN-7337 and YARN-6921 jenkins jobs.
> org.apache.hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy.testAllocation[Duration
>  90,000,000, height 0.25, numSubmission 1, periodic 8640)]
> *Stacktrace*
> junit.framework.AssertionFailedError: null
>   at junit.framework.Assert.fail(Assert.java:55)
>   at junit.framework.Assert.fail(Assert.java:64)
>   at junit.framework.TestCase.fail(TestCase.java:235)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.BaseSharingPolicyTest.runTest(BaseSharingPolicyTest.java:146)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy.testAllocation(TestCapacityOverTimePolicy.java:136)
> *Standard Output*
> 2017-11-20 23:57:03,759 INFO  [main] recovery.RMStateStore 
> (RMStateStore.java:transition(538)) - Storing reservation 
> allocation.reservation_-9026698577416205920_6337917439559340517
> 2017-11-20 23:57:03,759 INFO  [main] recovery.RMStateStore 
> (MemoryRMStateStore.java:storeReservationState(247)) - Storing 
> reservationallocation for 
> reservation_-9026698577416205920_6337917439559340517 for plan dedicated
> 2017-11-20 23:57:03,760 INFO  [main] reservation.InMemoryPlan 
> (InMemoryPlan.java:addReservation(373)) - Successfully added reservation: 
> reservation_-9026698577416205920_6337917439559340517 to plan.
> In-memory Plan: Parent Queue: dedicatedTotal Capacity:  vCores:1000>Step: 1000reservation_-9026698577416205920_6337917439559340517 
> user:u1 startTime: 0 endTime: 8640 Periodiciy: 8640 alloc:
> [Period: 8640
> 0: 
>  3423748: 
>  86223748: 
>  8640: 
>  9223372036854775807: null
>  ] 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7551) yarn.resourcemanager.reservation-system.max-periodicity is not in yarn-default.xml

2017-11-21 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261641#comment-16261641
 ] 

Subru Krishnan commented on YARN-7551:
--

Generally, we have been following the practice of exposing only what we 
consider as core configs in yarn-default. All advanced configs, we skip as I 
feel that we have way too many knobs in the first place.

> yarn.resourcemanager.reservation-system.max-periodicity is not in 
> yarn-default.xml
> --
>
> Key: YARN-7551
> URL: https://issues.apache.org/jira/browse/YARN-7551
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: reservation system
>Affects Versions: 3.0.0
>Reporter: Daniel Templeton
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6704) Add Federation Interceptor restart when work preserving NM is enabled

2017-11-21 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261588#comment-16261588
 ] 

Subru Krishnan commented on YARN-6704:
--

Thanks [~botong] for updating the patch. I went through it and have a few minor 
comments:
* Where are we cleaning up the registry/NMSS entries? This should be done when 
AM completes and should be covered in tests.
* Can we proceed with recovery even though not work preserving for secondaries 
when there's no registry setup as now we have HA and restart intermixed?
* Nit: move log level to debug for recovery of individual containers in home SC.

> Add Federation Interceptor restart when work preserving NM is enabled
> -
>
> Key: YARN-6704
> URL: https://issues.apache.org/jira/browse/YARN-6704
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
> Attachments: YARN-6704-YARN-2915.v1.patch, 
> YARN-6704-YARN-2915.v2.patch, YARN-6704.v3.patch, YARN-6704.v4.patch, 
> YARN-6704.v5.patch
>
>
> YARN-1336 added the ability to restart NM without loosing any running 
> containers. {{AMRMProxy}} restart is added in YARN-6127. In a Federated YARN 
> environment, there's additional state in the {{FederationInterceptor}} to 
> allow for spanning across multiple sub-clusters, so we need to enhance 
> {{FederationInterceptor}} to support work-preserving restart.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7390) All reservation related test cases failed when TestYarnClient runs against Fair Scheduler.

2017-11-20 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259991#comment-16259991
 ] 

Subru Krishnan commented on YARN-7390:
--

[~yufeigu]/[~haibo.chen], thanks for fixing this. Shouldn't it be included in 
branch-2/2.9 as you have 2.9.0 in the affect version? Will be great if you can 
run the test against branch-2/2.9 before pushing. Thanks!

> All reservation related test cases failed when TestYarnClient runs against 
> Fair Scheduler.
> --
>
> Key: YARN-7390
> URL: https://issues.apache.org/jira/browse/YARN-7390
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, reservation system
>Affects Versions: 2.9.0, 3.0.0, 3.1.0
>Reporter: Yufei Gu
>Assignee: Yufei Gu
> Fix For: 3.0.1
>
> Attachments: YARN-7390.001.patch, YARN-7390.002.patch, 
> YARN-7390.003.patch, YARN-7390.004.patch, YARN-7390.005.patch
>
>
> All reservation related test cases failed when {{TestYarnClient}} runs 
> against Fair Scheduler. To reproduce it, you need to set scheduler class to 
> Fair Scheduler in yarn-default.xml.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6645) Bug fix in ContainerImpl when calling the symLink of LinuxContainerExecutor

2017-11-20 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-6645:
-
Fix Version/s: (was: 2.9.0)
   2.9.1

> Bug fix in ContainerImpl when calling the symLink of LinuxContainerExecutor
> ---
>
> Key: YARN-6645
> URL: https://issues.apache.org/jira/browse/YARN-6645
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Bingxue Qiu
> Fix For: 2.9.1
>
> Attachments: error when creating symlink.png
>
>
> when creating symlink after the resource localized in our clusters , an 
> IOException has been thrown, because the nmPrivateDir doesn't exist. we add a 
> patch to fix it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7278) LinuxContainer in docker mode will be failed when nodemanager restart, because timeout for docker is too slow.

2017-11-20 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-7278:
-
Fix Version/s: (was: 2.9.0)
   2.9.1

> LinuxContainer in docker mode will be failed when nodemanager restart, 
> because timeout for docker is too slow.
> --
>
> Key: YARN-7278
> URL: https://issues.apache.org/jira/browse/YARN-7278
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
> Environment: CentOS
>Reporter: zhengchenyu
> Fix For: 2.9.1
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> In our cluster, nodemanagere recovery is turn on, and we use LinuxConainer 
> with docker mode.
> Container may be failed when nodemanager restart, exception is below:
> {code}
> [2017-09-29T15:47:14.433+08:00] [INFO] 
> containermanager.monitor.ContainersMonitorImpl.run(ContainersMonitorImpl.java 
> 472) [Container Monitor] : Memory usage of ProcessTree 120523 for 
> container-id container_1506600355508_0023_01_04: -1B of 10 GB physical 
> memory used; -1B of 31 GB virtual memory used
> [2017-09-29T15:47:15.219+08:00] [ERROR] 
> containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java
>  93) [ContainersLauncher #1] : Unable to recover container 
> container_1506600355508_0023_01_04
> java.io.IOException: Timeout while waiting for exit code from 
> container_1506600355508_0023_01_04
> [2017-09-29T15:47:15.220+08:00] [INFO] 
> containermanager.container.ContainerImpl.handle(ContainerImpl.java 1142) 
> [AsyncDispatcher event handler] : Container 
> container_1506600355508_0023_01_04 transitioned from RUNNING to 
> EXITED_WITH_FAILURE
> [2017-09-29T15:47:15.221+08:00] [INFO] 
> containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java
>  440) [AsyncDispatcher event handler] : Cleaning up container 
> container_1506600355508_0023_01_04
> {code}
> I guess the proccess is done, but 2 seconde later( the variable is msecLeft), 
> the *.pid.exitcode wasn't created. Then I changed variable to 2ms, The 
> container is succeed when nodemanger is restart.
> So I think it is too short for docker container to complete the work.
> In docker mode of LinuxContainer, nm monitor the real task which is launched 
> by "docker run" command. Then "docker wait" command will wait for exitcode, 
> then "docker rm" will delete the docker container. Lastly, container-executor 
> will write the exit code. So if some docker command is slow enough, nm 
> wouldn't monitor the container. In fact, docker rm is always slow. 
> I think the exit code of docker rm dosen't matter with the real task, so I 
> think we could move the operation of write "*.pid.exitcode" before the 
> command of docker rm. Or monitor the docker wait proccess, but not the real 
> task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6606) The implementation of LocalizationStatus in ContainerStatusProto

2017-11-20 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-6606:
-
Fix Version/s: (was: 2.9.0)
   2.9.1

> The implementation of LocalizationStatus in ContainerStatusProto
> 
>
> Key: YARN-6606
> URL: https://issues.apache.org/jira/browse/YARN-6606
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: nodemanager
>Affects Versions: 2.9.0
>Reporter: Bingxue Qiu
> Fix For: 2.9.1
>
> Attachments: YARN-6606.1.patch, YARN-6606.2.patch
>
>
> we have a use case, where the full implementation of localization status in 
> ContainerStatusProto 
> [Continuous-resource-localization|https://issues.apache.org/jira/secure/attachment/12825041/Continuous-resource-localization.pdf]
>need to be done , so we make it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6661) Too much CLEANUP event hang ApplicationMasterLauncher thread pool

2017-11-20 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-6661:
-
Fix Version/s: (was: 2.9.0)
   2.9.1

> Too much CLEANUP event hang ApplicationMasterLauncher thread pool
> -
>
> Key: YARN-6661
> URL: https://issues.apache.org/jira/browse/YARN-6661
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
> Environment: hadoop 2.7.2 
>Reporter: JackZhou
> Fix For: 2.9.1
>
>
> Some one else have already come up with the similar problem and fix it.
> We can look the jira(https://issues.apache.org/jira/browse/YARN-3809) for 
> detail.
> But I think the fix have not solve the problem completely, blow was the 
> problem I encountered:
> There is about 1000 nodes in my hadoop cluster, and I submit about 1800 apps.
> I failover my active rm and rm will failover all those 1800 apps.
> When a application failover, It will wait for AM container register itself. 
> But there is a bug in my AM (I do it intentionally), and it will not register 
> itself.
> So the RM will wait for about 10mins for the AM expiration, and it will send 
> a CLEANUP event to 
> ApplicationMasterLauncher thread pool. Because there is about 1800 apps, so 
> it will hang the ApplicationMasterLauncher
> thread pool for a large time. I have already use the 
> patch(https://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch),
>  so
> a CLEANUP event will hang a thread 10 * 20 = 200s. But I have 1800 apps, so 
> for each of my thread, it will
> hang 1800 / 50 * 200s = 7200s=20min.
> Because the AM have register itself during 10mins, so it will retry and 
> create a new application attempt. 
> The application attempt will accept a container from RM, and send a LAUNCH to 
> ApplicationMasterLauncher thread pool.
> Because the 1800 CLEANUP will hang the 50 thread pools about 20mins. So the 
> application attempt will not 
> start the AM container during 10min. 
> And it will expire, and send a CLEANUP event to ApplicationMasterLauncher 
> thread pools too.
> As you can see, none of my application can really run it. 
> Each of them have 5 application attempts as follows, and each of them keep 
> retrying.
> appattempt_1495786030132_4000_05
> appattempt_1495786030132_4000_04
> appattempt_1495786030132_4000_03
> appattempt_1495786030132_4000_02  
> appattempt_1495786030132_4000_01
> So all of my apps have hang several hours, and none of them can really run. 
> I think this is a bug!!! We can treat CLEANUP and LAUNCH as different events.
> And use some other thread to deal with LAUNCH event or use other way.
> Sorry, I english is so poor. I don't know have I describe it clearly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7190) Ensure only NM classpath in 2.x gets TSv2 related hbase jars, not the user classpath

2017-11-20 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-7190:
-
Fix Version/s: (was: 2.9.0)
   2.9.1

> Ensure only NM classpath in 2.x gets TSv2 related hbase jars, not the user 
> classpath
> 
>
> Key: YARN-7190
> URL: https://issues.apache.org/jira/browse/YARN-7190
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineclient, timelinereader, timelineserver
>Reporter: Vrushali C
>Assignee: Varun Saxena
> Fix For: YARN-5355_branch2, 2.9.1
>
> Attachments: YARN-7190-YARN-5355_branch2.01.patch, 
> YARN-7190-YARN-5355_branch2.02.patch, YARN-7190-YARN-5355_branch2.03.patch, 
> YARN-7190.01.patch
>
>
> [~jlowe] had a good observation about the user classpath getting extra jars 
> in hadoop 2.x brought in with TSv2.  If users start picking up Hadoop 2,x's 
> version of HBase jars instead of the ones they shipped with their job, it 
> could be a problem.
> So when TSv2 is to be used in 2,x, the hbase related jars should come into 
> only the NM classpath not the user classpath.
> Here is a list of some jars
> {code}
> commons-csv-1.0.jar
> commons-el-1.0.jar
> commons-httpclient-3.1.jar
> disruptor-3.3.0.jar
> findbugs-annotations-1.3.9-1.jar
> hbase-annotations-1.2.6.jar
> hbase-client-1.2.6.jar
> hbase-common-1.2.6.jar
> hbase-hadoop2-compat-1.2.6.jar
> hbase-hadoop-compat-1.2.6.jar
> hbase-prefix-tree-1.2.6.jar
> hbase-procedure-1.2.6.jar
> hbase-protocol-1.2.6.jar
> hbase-server-1.2.6.jar
> htrace-core-3.1.0-incubating.jar
> jamon-runtime-2.4.1.jar
> jasper-compiler-5.5.23.jar
> jasper-runtime-5.5.23.jar
> jcodings-1.0.8.jar
> joni-2.1.2.jar
> jsp-2.1-6.1.14.jar
> jsp-api-2.1-6.1.14.jar
> jsr311-api-1.1.1.jar
> metrics-core-2.2.0.jar
> servlet-api-2.5-6.1.14.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6128) Add support for AMRMProxy HA

2017-11-17 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257858#comment-16257858
 ] 

Subru Krishnan commented on YARN-6128:
--

Thanks [~botong]. I have committed to trunk but couldn't cherry-pick cleanly to 
branch-2, can you please provide a branch-2 patch?

> Add support for AMRMProxy HA
> 
>
> Key: YARN-6128
> URL: https://issues.apache.org/jira/browse/YARN-6128
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, nodemanager
>Reporter: Subru Krishnan
>Assignee: Botong Huang
> Attachments: YARN-6128.v0.patch, YARN-6128.v1.patch, 
> YARN-6128.v1.patch, YARN-6128.v10.patch, YARN-6128.v10.patch, 
> YARN-6128.v2.patch, YARN-6128.v3.patch, YARN-6128.v3.patch, 
> YARN-6128.v4.patch, YARN-6128.v5.patch, YARN-6128.v6.patch, 
> YARN-6128.v7.patch, YARN-6128.v8.patch, YARN-6128.v9.patch
>
>
> YARN-556 added the ability for RM failover without loosing any running 
> applications. In a Federated YARN environment, there's additional state in 
> the {{AMRMProxy}} to allow for spanning across multiple sub-clusters, so we 
> need to enhance {{AMRMProxy}} to support HA.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5075) Fix findbugs warning in hadoop-yarn-common module

2017-11-13 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-5075:
-
Fix Version/s: (was: 3.0.0)
   3.0.0-alpha1

> Fix findbugs warning in hadoop-yarn-common module
> -
>
> Key: YARN-5075
> URL: https://issues.apache.org/jira/browse/YARN-5075
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Akira Ajisaka
>Assignee: Arun Suresh
> Fix For: 2.9.0, 3.0.0-alpha1
>
> Attachments: YARN-5075.001.patch, YARN-5075.002.patch, findbugs.html
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5049) Extend NMStateStore to save queued container information

2017-11-13 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-5049:
-
Fix Version/s: (was: 3.0.0)
   3.0.0-alpha1

> Extend NMStateStore to save queued container information
> 
>
> Key: YARN-5049
> URL: https://issues.apache.org/jira/browse/YARN-5049
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Arun Suresh
> Fix For: 2.9.0, 3.0.0-alpha1
>
> Attachments: YARN-5049-addendum.branch-2.001.patch, 
> YARN-5049.001.patch, YARN-5049.002.patch, YARN-5049.003.patch
>
>
> This JIRA is about extending the NMStateStore to save queued container 
> information whenever a new container is added to the NM queue. 
> It also removes the information from the state store when the queued 
> container starts its execution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5073) Refactor startContainerInternal() in ContainerManager to remove unused parameter

2017-11-13 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-5073:
-
Fix Version/s: (was: 3.0.0)
   3.0.0-alpha1

> Refactor startContainerInternal() in ContainerManager to remove unused 
> parameter
> 
>
> Key: YARN-5073
> URL: https://issues.apache.org/jira/browse/YARN-5073
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Konstantinos Karanasos
>Priority: Minor
> Fix For: 2.9.0, 3.0.0-alpha1
>
> Attachments: YARN-5073.001.patch
>
>
> The nmTokenIdentifier is no longer needed as a parameter in the 
> startContainerInternal() method of the ContainerManager.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4412) Create ClusterMonitor to compute ordered list of preferred NMs for OPPORTUNITIC containers

2017-11-13 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-4412:
-
Fix Version/s: (was: 3.0.0)
   3.0.0-alpha1

> Create ClusterMonitor to compute ordered list of preferred NMs for 
> OPPORTUNITIC containers
> --
>
> Key: YARN-4412
> URL: https://issues.apache.org/jira/browse/YARN-4412
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Fix For: 2.9.0, 3.0.0-alpha1
>
> Attachments: YARN-4412-yarn-2877.v1.patch, 
> YARN-4412-yarn-2877.v2.patch, YARN-4412-yarn-2877.v3.patch, 
> YARN-4412-yarn-2877.v4.patch, YARN-4412-yarn-2877.v5.patch, 
> YARN-4412-yarn-2877.v6.patch, YARN-4412.007.patch, YARN-4412.008.patch, 
> YARN-4412.009.patch, YARN-4412.addendum-001.patch, YARN-4412.find-bugs.patch
>
>
> Introduce a Cluster Monitor that aggregates load information from individual 
> Node Managers and computes an ordered list of preferred Node managers to be 
> used as target Nodes for OPPORTUNISTIC container allocations. 
> This list can be pushed out to the Node Manager (specifically the AMRMProxy 
> running on the Node) via the Allocate Response. This will be used to make 
> local Scheduling decisions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4738) Notify the RM about the status of OPPORTUNISTIC containers

2017-11-13 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-4738:
-
Fix Version/s: (was: 3.0.0)
   3.0.0-alpha1

> Notify the RM about the status of OPPORTUNISTIC containers
> --
>
> Key: YARN-4738
> URL: https://issues.apache.org/jira/browse/YARN-4738
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Konstantinos Karanasos
> Fix For: 2.9.0, 3.0.0-alpha1
>
> Attachments: YARN-4738-yarn-2877.001.patch, 
> YARN-4738-yarn-2877.002.patch, YARN-4738.002.patch, YARN-4738.003.patch, 
> YARN-4738.004.patch, YARN-4738.005.patch, YARN-4738.006.patch
>
>
> When an OPPORTUNISTIC container finishes its execution (either successfully 
> or because it failed/got killed), the RM needs to be notified.
> This way the AM also gets notified in turn about the successfully completed 
> tasks, as well as for rescheduling failed/killed tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4991) Fix ContainerRequest Constructor to set nodelabelExpression correctly

2017-11-13 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-4991:
-
Fix Version/s: (was: 3.0.0)
   3.0.0-alpha1

> Fix ContainerRequest Constructor to set nodelabelExpression correctly
> -
>
> Key: YARN-4991
> URL: https://issues.apache.org/jira/browse/YARN-4991
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: test
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Fix For: 2.9.0, 3.0.0-alpha1
>
> Attachments: 0001-YARN-4991.patch
>
>
> TestAMRMClient#testAskWithInvalidNodeLabels
> TestAMRMClient#testAskWithNodeLabels are failing
> {{ContainerRequest}} labels are always set as {{null}}
> {code}
> public ContainerRequest(Resource capability, String[] nodes, String[] 
> racks,
> Priority priority, boolean relaxLocality, String 
> nodeLabelsExpression) {
>   this(capability, nodes, racks, priority, relaxLocality, null,
>   ExecutionType.GUARANTEED);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-2995) Enhance UI to show cluster resource utilization of various container Execution types

2017-11-13 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-2995:
-
Fix Version/s: (was: 3.0.0)
   3.0.0-alpha2

> Enhance UI to show cluster resource utilization of various container 
> Execution types
> 
>
> Key: YARN-2995
> URL: https://issues.apache.org/jira/browse/YARN-2995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sriram Rao
>Assignee: Konstantinos Karanasos
>Priority: Blocker
> Fix For: 2.9.0, 3.0.0-alpha2
>
> Attachments: YARN-2995.001.patch, YARN-2995.002.patch, 
> YARN-2995.003.patch, YARN-2995.004.patch, all-nodes.png, all-nodes.png, 
> opp-container.png
>
>
> This JIRA proposes to extend the Resource manager UI to show how cluster 
> resources are being used to run *guaranteed start* and *queueable* 
> containers.  For example, a graph that shows over time, the fraction of  
> running containers that are *guaranteed start* and the fraction of running 
> containers that are *queueable*. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4335) Allow ResourceRequests to specify ExecutionType of a request ask

2017-11-13 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-4335:
-
Fix Version/s: (was: 3.0.0)
   3.0.0-alpha1

> Allow ResourceRequests to specify ExecutionType of a request ask
> 
>
> Key: YARN-4335
> URL: https://issues.apache.org/jira/browse/YARN-4335
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Konstantinos Karanasos
> Fix For: 2.9.0, 3.0.0-alpha1
>
> Attachments: YARN-4335-yarn-2877.001.patch, YARN-4335.002.patch, 
> YARN-4335.003.patch
>
>
> YARN-2882 introduced container types that are internal (not user-facing) and 
> are used by the ContainerManager during execution at the NM.
> With this JIRA we are introducing (user-facing) resource request types that 
> are used by the AM to specify the type of the ResourceRequest.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-2888) Corrective mechanisms for rebalancing NM container queues

2017-11-13 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-2888:
-
Fix Version/s: (was: 3.0.0)
   3.0.0-alpha1

> Corrective mechanisms for rebalancing NM container queues
> -
>
> Key: YARN-2888
> URL: https://issues.apache.org/jira/browse/YARN-2888
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Arun Suresh
> Fix For: 2.9.0, 3.0.0-alpha1
>
> Attachments: YARN-2888-yarn-2877.001.patch, 
> YARN-2888-yarn-2877.002.patch, YARN-2888.003.patch, YARN-2888.004.patch, 
> YARN-2888.005.patch, YARN-2888.006.patch, YARN-2888.007.patch, 
> YARN-2888.008.patch, YARN-2888.009.patch, YARN-2888.010.patch, 
> YARN-2888.011.patch
>
>
> Bad queuing decisions by the LocalRMs (e.g., due to the distributed nature of 
> the scheduling decisions or due to having a stale image of the system) may 
> lead to an imbalance in the waiting times of the NM container queues. This 
> can in turn have an impact in job execution times and cluster utilization.
> To this end, we introduce corrective mechanisms that may remove (whenever 
> needed) container requests from overloaded queues, adding them to less-loaded 
> ones.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-2885) Create AMRMProxy request interceptor for distributed scheduling decisions for queueable containers

2017-11-13 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-2885:
-
Fix Version/s: (was: 3.0.0)
   3.0.0-alpha1

> Create AMRMProxy request interceptor for distributed scheduling decisions for 
> queueable containers
> --
>
> Key: YARN-2885
> URL: https://issues.apache.org/jira/browse/YARN-2885
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Arun Suresh
> Fix For: 2.9.0, 3.0.0-alpha1
>
> Attachments: YARN-2885-yarn-2877.001.patch, 
> YARN-2885-yarn-2877.002.patch, YARN-2885-yarn-2877.full-2.patch, 
> YARN-2885-yarn-2877.full-3.patch, YARN-2885-yarn-2877.full.patch, 
> YARN-2885-yarn-2877.v4.patch, YARN-2885-yarn-2877.v5.patch, 
> YARN-2885-yarn-2877.v6.patch, YARN-2885-yarn-2877.v7.patch, 
> YARN-2885-yarn-2877.v8.patch, YARN-2885-yarn-2877.v9.patch, 
> YARN-2885.010.patch, YARN-2885.011.patch, YARN-2885.012.patch, 
> YARN-2885_api_changes.patch
>
>
> We propose to add a Local ResourceManager (LocalRM) to the NM in order to 
> support distributed scheduling decisions. 
> Architecturally we leverage the RMProxy, introduced in YARN-2884. 
> The LocalRM makes distributed decisions for queuable containers requests. 
> Guaranteed-start requests are still handled by the central RM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-2883) Queuing of container requests in the NM

2017-11-13 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-2883:
-
Fix Version/s: (was: 3.0.0)
   3.0.0-alpha1

> Queuing of container requests in the NM
> ---
>
> Key: YARN-2883
> URL: https://issues.apache.org/jira/browse/YARN-2883
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Konstantinos Karanasos
> Fix For: 2.9.0, 3.0.0-alpha1
>
> Attachments: YARN-2883-trunk.004.patch, YARN-2883-trunk.005.patch, 
> YARN-2883-trunk.006.patch, YARN-2883-trunk.007.patch, 
> YARN-2883-trunk.008.patch, YARN-2883-trunk.009.patch, 
> YARN-2883-trunk.010.patch, YARN-2883-trunk.011.patch, 
> YARN-2883-trunk.012.patch, YARN-2883-trunk.013.patch, 
> YARN-2883-yarn-2877.001.patch, YARN-2883-yarn-2877.002.patch, 
> YARN-2883-yarn-2877.003.patch, YARN-2883-yarn-2877.004.patch, 
> YARN-2883.013.patch, YARN-2883.014.patch, YARN-2883.015.patch
>
>
> We propose to add a queue in each NM, where queueable container requests can 
> be held.
> Based on the available resources in the node and the containers in the queue, 
> the NM will decide when to allow the execution of a queued container.
> In order to ensure the instantaneous start of a guaranteed-start container, 
> the NM may decide to pre-empt/kill running queueable containers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-2884) Proxying all AM-RM communications

2017-11-13 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-2884:
-
Fix Version/s: (was: 3.0.0)
   2.8.0
   3.0.0-alpha1

> Proxying all AM-RM communications
> -
>
> Key: YARN-2884
> URL: https://issues.apache.org/jira/browse/YARN-2884
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Carlo Curino
>Assignee: Kishore Chaliparambil
> Fix For: 2.8.0, 2.9.0, 3.0.0-alpha1
>
> Attachments: YARN-2884-V1.patch, YARN-2884-V10.patch, 
> YARN-2884-V11.patch, YARN-2884-V12.patch, YARN-2884-V13.patch, 
> YARN-2884-V2.patch, YARN-2884-V3.patch, YARN-2884-V4.patch, 
> YARN-2884-V5.patch, YARN-2884-V6.patch, YARN-2884-V7.patch, 
> YARN-2884-V8.patch, YARN-2884-V9.patch
>
>
> We introduce the notion of an RMProxy, running on each node (or once per 
> rack). Upon start the AM is forced (via tokens and configuration) to direct 
> all its requests to a new services running on the NM that provide a proxy to 
> the central RM. 
> This give us a place to:
> 1) perform distributed scheduling decisions
> 2) throttling mis-behaving AMs
> 3) mask the access to a federation of RMs



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6918) Remove acls after queue delete to avoid memory leak

2017-11-13 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249897#comment-16249897
 ] 

Subru Krishnan edited comment on YARN-6918 at 11/13/17 6:04 PM:


Thanks [~bibinchundatt] for raising this and [~sunilg] for bringing this to my 
attention. I took a look at it and feel it's relevant but not really a blocker:
* The new queue management is an _experimental_ feature that users/admin have 
to explicitly *opt-in*.
* Deleting a queue should be a rare event (unlike application management) so is 
not in the routine or normal code path of app execution.

Accordingly I have set the priority of this JIRA to major and target version to 
2.9.1. 


was (Author: subru):
Thanks [~bibinchundatt] for raising this and [~sunilg] for bringing this to my 
attention. I took a look at it and feel it's relevant but not really a blocker:
* The new queue management is an _experimental_ feature that users/admin have 
to explicitly *opt-in*.
* Deleting a queue should be a rare event (unlike application management).

Accordingly I have set the priority of this JIRA to major and target version to 
2.9.1. 

> Remove acls after queue delete to avoid memory leak
> ---
>
> Key: YARN-6918
> URL: https://issues.apache.org/jira/browse/YARN-6918
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: YARN-6918.001.patch
>
>
> Acl for deleted queue need to removed from allAcls to avoid leak 
> (Priority,YarnAuthorizer)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6918) Remove acls after queue delete to avoid memory leak

2017-11-13 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-6918:
-
Target Version/s: 2.9.1

> Remove acls after queue delete to avoid memory leak
> ---
>
> Key: YARN-6918
> URL: https://issues.apache.org/jira/browse/YARN-6918
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: YARN-6918.001.patch
>
>
> Acl for deleted queue need to removed from allAcls to avoid leak 
> (Priority,YarnAuthorizer)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6918) Remove acls after queue delete to avoid memory leak

2017-11-13 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249897#comment-16249897
 ] 

Subru Krishnan commented on YARN-6918:
--

Thanks [~bibinchundatt] for raising this and [~sunilg] for bringing this to my 
attention. I took a look at it and feel it's relevant but not really a blocker:
* The new queue management is an _experimental_ feature that users/admin have 
to explicitly *opt-in*.
* Deleting a queue should be a rare event (unlike application management).

Accordingly I have set the priority of this JIRA to major and target version to 
2.9.1. 

> Remove acls after queue delete to avoid memory leak
> ---
>
> Key: YARN-6918
> URL: https://issues.apache.org/jira/browse/YARN-6918
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: YARN-6918.001.patch
>
>
> Acl for deleted queue need to removed from allAcls to avoid leak 
> (Priority,YarnAuthorizer)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6918) Remove acls after queue delete to avoid memory leak

2017-11-13 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-6918:
-
Priority: Major  (was: Critical)

> Remove acls after queue delete to avoid memory leak
> ---
>
> Key: YARN-6918
> URL: https://issues.apache.org/jira/browse/YARN-6918
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: YARN-6918.001.patch
>
>
> Acl for deleted queue need to removed from allAcls to avoid leak 
> (Priority,YarnAuthorizer)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7478) TEST-cetest fails in branch-2

2017-11-12 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan resolved YARN-7478.
--
   Resolution: Implemented
Fix Version/s: 2.9.0

Thanks [~varun_saxena] for bringing this up and [~leftnoteasy] for pointing out 
that YARN-7412 fixes it, so I simply cherry-picked YARN-7412 to 
branch=2/2.9/2.9.0.

> TEST-cetest fails in branch-2
> -
>
> Key: YARN-7478
> URL: https://issues.apache.org/jira/browse/YARN-7478
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Subru Krishnan
>Priority: Minor
> Fix For: 2.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7476) Fix miscellaneous issues in ATSv2 after merge to branch-2

2017-11-12 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248932#comment-16248932
 ] 

Subru Krishnan edited comment on YARN-7476 at 11/12/17 6:03 PM:


Thanks [~varun_saxena] for the fix, created YARN-7478 to track the test 
failure. I have committed this to branch-2/branch-2.9/branch-2.9.0.


was (Author: subru):
Thanks [~varun_saxena] for the fix, I have committed this to 
branch-2/branch-2.9/branch-2.9.0.

> Fix miscellaneous issues in ATSv2 after merge to branch-2
> -
>
> Key: YARN-7476
> URL: https://issues.apache.org/jira/browse/YARN-7476
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Fix For: 2.9.0
>
> Attachments: YARN-7476-branch-2.01.patch
>
>
> a) We are still using Resource#getMemory in 
> NMTimelinePublisher#publishContainerCreatedEvent. This has been deprecated 
> since YARN-4844. Better to use getMemorySize instead.
> b) Post YARN-5865, application priority should be fetched from RMAppImpl 
> instead of app submission context. But we are still fetching it from 
> submission context while publishing entities to timeline service. This would 
> mean that if priority is updated, it will not be published to timeline 
> service.
> c) The order of app_collectors in NodeHeartbeatResponseProto is different 
> from trunk. Better to make it consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7478) TEST-cetest fails in branch-2

2017-11-12 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248923#comment-16248923
 ] 

Subru Krishnan commented on YARN-7478:
--

Refer to Yetus report in YARN-7476/YARN-5049 etc.

> TEST-cetest fails in branch-2
> -
>
> Key: YARN-7478
> URL: https://issues.apache.org/jira/browse/YARN-7478
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Subru Krishnan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7475) Fix Container log link in new YARN UI

2017-11-12 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-7475:
-
Summary: Fix Container log link in new YARN UI  (was: container log link is 
not working in new YARN UI)

> Fix Container log link in new YARN UI
> -
>
> Key: YARN-7475
> URL: https://issues.apache.org/jira/browse/YARN-7475
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-beta1
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: YARN-7475.001.patch
>
>
> Container log link is broken



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5049) Extend NMStateStore to save queued container information

2017-11-12 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248919#comment-16248919
 ] 

Subru Krishnan commented on YARN-5049:
--

+1 to the addendum patch. Thanks [~varun_saxena] for reporting it and 
[~asuresh] for fixing it. The test failure is unrelated and tracked in 
YARN-7478.

> Extend NMStateStore to save queued container information
> 
>
> Key: YARN-5049
> URL: https://issues.apache.org/jira/browse/YARN-5049
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Arun Suresh
> Fix For: 2.9.0, 3.0.0
>
> Attachments: YARN-5049-addendum.branch-2.001.patch, 
> YARN-5049.001.patch, YARN-5049.002.patch, YARN-5049.003.patch
>
>
> This JIRA is about extending the NMStateStore to save queued container 
> information whenever a new container is added to the NM queue. 
> It also removes the information from the state store when the queued 
> container starts its execution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7478) TEST-cetest fails in branch-2

2017-11-12 Thread Subru Krishnan (JIRA)
Subru Krishnan created YARN-7478:


 Summary: TEST-cetest fails in branch-2
 Key: YARN-7478
 URL: https://issues.apache.org/jira/browse/YARN-7478
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Subru Krishnan
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7476) Fix miscellaneous issues in ATSv2 after merge to branch-2

2017-11-11 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248715#comment-16248715
 ] 

Subru Krishnan edited comment on YARN-7476 at 11/11/17 11:07 PM:
-

Thanks [~varun_saxena] for the thorough investigation to dig this up. I 
compared the files between trunk and branch-2 and you are spot on, +1 on the 
patch (pending Yetus).


was (Author: subru):
Thanks [~varun_saxena] for the thorough investigation to dig this up. +1 on the 
patch (pending Yetus).

> Fix miscellaneous issues in ATSv2 after merge to branch-2
> -
>
> Key: YARN-7476
> URL: https://issues.apache.org/jira/browse/YARN-7476
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-7476-branch-2.01.patch
>
>
> a) We are still using Resource#getMemory in 
> NMTimelinePublisher#publishContainerCreatedEvent. This has been deprecated 
> since YARN-4844. Better to use getMemorySize instead.
> b) Post YARN-5865, application priority should be fetched from RMAppImpl 
> instead of app submission context. But we are still fetching it from 
> submission context while publishing entities to timeline service. This would 
> mean that if priority is updated, it will not be published to timeline 
> service.
> c) The order of app_collectors in NodeHeartbeatResponseProto is different 
> from trunk. Better to make it consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7476) Fix miscellaneous issues in ATSv2 after merge to branch-2

2017-11-11 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248715#comment-16248715
 ] 

Subru Krishnan commented on YARN-7476:
--

Thanks [~varun_saxena] for the thorough investigation to dig this up. +1 on the 
patch (pending Yetus).

> Fix miscellaneous issues in ATSv2 after merge to branch-2
> -
>
> Key: YARN-7476
> URL: https://issues.apache.org/jira/browse/YARN-7476
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-7476-branch-2.01.patch
>
>
> a) We are still using Resource#getMemory in 
> NMTimelinePublisher#publishContainerCreatedEvent. This has been deprecated 
> since YARN-4844. Better to use getMemorySize instead.
> b) Post YARN-5865, application priority should be fetched from RMAppImpl 
> instead of app submission context. But we are still fetching it from 
> submission context while publishing entities to timeline service. This would 
> mean that if priority is updated, it will not be published to timeline 
> service.
> c) The order of app_collectors in NodeHeartbeatResponseProto is different 
> from trunk. Better to make it consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >