[jira] [Commented] (YARN-9681) AM resource limit is incorrect for queue

2019-07-29 Thread ANANDA G B (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895778#comment-16895778
 ] 

ANANDA G B commented on YARN-9681:
--

Hi [~sunilg] [~leftnoteasy] [~bibinchundatt]

Can you please review the code.

> AM resource limit is incorrect for queue
> 
>
> Key: YARN-9681
> URL: https://issues.apache.org/jira/browse/YARN-9681
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1, 3.1.2
>Reporter: ANANDA G B
>Assignee: ANANDA G B
>Priority: Major
>  Labels: patch
> Attachments: After running job on queue1.png, Before running job on 
> queue1.png, YARN-9681.0001.patch, YARN-9681.0002.patch, YARN-9681.0003.patch
>
>
> After running the job on Queue1 of Partition1, then Queue1 of 
> DEFAULT_PARTITION's 'Max Application Master Resources' is calculated wrongly. 
> Please find the attachement.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9694) UI always show default-rack for all the nodes while running SLS.

2019-07-29 Thread Abhishek Modi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895637#comment-16895637
 ] 

Abhishek Modi commented on YARN-9694:
-

Thanks [~elgoiri] for reviewing it.

GenerateNodeTableMapping generates a node to rack mapping file which is then 
being used by TableMapping to resolve rack names. The format required by 
TableMapping is a two column text file where first column specifies node name 
and second column specifies rack name. I am generating this file as part of 
generateNodeTableMapping.

 

May be I will remove the format name from the file. Will upload an updated 
patch with changing the format and having better javadoc.

 

 

 

> UI always show default-rack for all the nodes while running SLS.
> 
>
> Key: YARN-9694
> URL: https://issues.apache.org/jira/browse/YARN-9694
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9694.001.patch, YARN-9694.002.patch
>
>
> Currently, independent of the specification of the nodes in SLS.json or 
> nodes.json, UI always shows that rack of the node is default-rack.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9694) UI always show default-rack for all the nodes while running SLS.

2019-07-29 Thread JIRA


[ 
https://issues.apache.org/jira/browse/YARN-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895572#comment-16895572
 ] 

Íñigo Goiri commented on YARN-9694:
---

Thanks [~abmodi] for the patch, a few comments:
* Can you add a javadoc explaining {{testGenerateNodeTableMapping()}}?
* Why do you delete the file first in {{testGenerateNodeTableMapping()}}?
* Can you explain a littl ebetter what's happening in 
{{generateNodeTableMapping()}}? The whole format "node rack" should be referred 
somewhere.
* It doesn't look like csv format for tableMapping.csv.

> UI always show default-rack for all the nodes while running SLS.
> 
>
> Key: YARN-9694
> URL: https://issues.apache.org/jira/browse/YARN-9694
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9694.001.patch, YARN-9694.002.patch
>
>
> Currently, independent of the specification of the nodes in SLS.json or 
> nodes.json, UI always shows that rack of the node is default-rack.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9694) UI always show default-rack for all the nodes while running SLS.

2019-07-29 Thread Abhishek Modi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1689#comment-1689
 ] 

Abhishek Modi commented on YARN-9694:
-

[~elgoiri] could you please review it. Thanks.

> UI always show default-rack for all the nodes while running SLS.
> 
>
> Key: YARN-9694
> URL: https://issues.apache.org/jira/browse/YARN-9694
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-9694.001.patch, YARN-9694.002.patch
>
>
> Currently, independent of the specification of the nodes in SLS.json or 
> nodes.json, UI always shows that rack of the node is default-rack.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9702) Backport YARN-5788 to branch-2.8

2019-07-29 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895454#comment-16895454
 ] 

Eric Payne commented on YARN-9702:
--

Thanks [~samkhan] for the backport.

+1. Will commit to branch-2.8

> Backport YARN-5788 to branch-2.8
> 
>
> Key: YARN-9702
> URL: https://issues.apache.org/jira/browse/YARN-9702
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.6
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: YARN-9702-branch-2.8.001.patch
>
>
> Backport YARN-5788 to branch-2.8.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9701) Yarn service cli commands do not connect to ssl enabled RM using ssl-client.xml configs

2019-07-29 Thread Tarun Parimi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-9701:
---
Description: 
Yarn service commands use the yarn service rest api. When ssl is enabled for 
RM, the yarn service commands fail as they don't read the ssl-client.xml 
configs to create ssl connection to the rest api.

This becomes a problem especially for self signed certificates as the 
truststore location specified at ssl.client.truststore.location is not 
considered by commands.

As workaround, we need to import the certificates to the java default cacert 
for the yarn service commands to work via ssl. It would be more proper if the 
yarn service commands makes use of the configs at ssl-client.xml instead to 
configure and create an ssl client connection. This workaround may not even 
work if there are additional properties configured in ssl-client.xml that are 
necessary apart from the truststore related properties.

  was:
Yarn service commands use the yarn service rest api. When ssl is enabled for 
RM, the yarn service commands fail as they don't read the ssl-client.xml 
configs to create ssl connection to the rest api.

This becomes a problem especially for self signed certificates as the 
truststore location specified at ssl.client.truststore.location is not 
considered by commands.

As workaround, we need to import the certificates to the java default cacert 
for the yarn service commands to work via ssl. It would be more proper if the 
yarn service commands makes use of the configs at ssl-client.xml instead to 
configure and create an ssl client connection.


> Yarn service cli commands do not connect to ssl enabled RM using 
> ssl-client.xml configs
> ---
>
> Key: YARN-9701
> URL: https://issues.apache.org/jira/browse/YARN-9701
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
>
> Yarn service commands use the yarn service rest api. When ssl is enabled for 
> RM, the yarn service commands fail as they don't read the ssl-client.xml 
> configs to create ssl connection to the rest api.
> This becomes a problem especially for self signed certificates as the 
> truststore location specified at ssl.client.truststore.location is not 
> considered by commands.
> As workaround, we need to import the certificates to the java default cacert 
> for the yarn service commands to work via ssl. It would be more proper if the 
> yarn service commands makes use of the configs at ssl-client.xml instead to 
> configure and create an ssl client connection. This workaround may not even 
> work if there are additional properties configured in ssl-client.xml that are 
> necessary apart from the truststore related properties.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9586) [QA] Need more doc for yarn.federation.policy-manager-params when LoadBasedRouterPolicy is used

2019-07-29 Thread qiuliang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895211#comment-16895211
 ] 

qiuliang commented on YARN-9586:


Hi [~shenyinjie], I still don't know how to write this json file. Can you give 
me an example? Thank you~

> [QA] Need more doc for yarn.federation.policy-manager-params when 
> LoadBasedRouterPolicy is used
> ---
>
> Key: YARN-9586
> URL: https://issues.apache.org/jira/browse/YARN-9586
> Project: Hadoop YARN
>  Issue Type: Wish
>  Components: federation
>Reporter: Shen Yinjie
>Priority: Major
>
> We picked LoadBasedRouterPolicy for YARN federation, but had no idea what to 
>  set to yarn.federation.policy-manager-params. Is there a demo config or more 
> detailed description for this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7621) Support submitting apps with queue path for CapacityScheduler

2019-07-29 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895048#comment-16895048
 ] 

zhoukang edited comment on YARN-7621 at 7/29/19 8:27 AM:
-

NIT: Since in our cluster there will exist 'root.a.a1' and 'root.b.a1'.
So we also support initialize queues with full path name.
{code:java}
@Override
  public void addQueue(String queueName, CSQueue queue) {
queueName = queue.getQueuePath();
this.queues.put(queueName, queue);
  }
{code}
Do you think this feature should be supported in community version 
[~cheersyang][~Tao Yang]


was (Author: cane):
NIT: Since in our cluster there will exist 'root.a.a1' and 'root.b.a1'.
So we also supported initialize queues with full path name.
{code:java}
@Override
  public void addQueue(String queueName, CSQueue queue) {
queueName = queue.getQueuePath();
this.queues.put(queueName, queue);
  }
{code}
Do you think this feature should be supported in community version 
[~cheersyang][~Tao Yang]

> Support submitting apps with queue path for CapacityScheduler
> -
>
> Key: YARN-7621
> URL: https://issues.apache.org/jira/browse/YARN-7621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-7621.001.patch, YARN-7621.002.patch
>
>
> Currently there is a difference of queue definition in 
> ApplicationSubmissionContext between CapacityScheduler and FairScheduler. 
> FairScheduler needs queue path but CapacityScheduler needs queue name. There 
> is no doubt of the correction of queue definition for CapacityScheduler 
> because it does not allow duplicate leaf queue names, but it's hard to switch 
> between FairScheduler and CapacityScheduler. I propose to support submitting 
> apps with queue path for CapacityScheduler to make the interface clearer and 
> scheduler switch smoothly.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7621) Support submitting apps with queue path for CapacityScheduler

2019-07-29 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895048#comment-16895048
 ] 

zhoukang commented on YARN-7621:


NIT: Since in our cluster there will exist 'root.a.a1' and 'root.b.a1'.
So we also supported initialize queues with full path name.
{code:java}
@Override
  public void addQueue(String queueName, CSQueue queue) {
queueName = queue.getQueuePath();
this.queues.put(queueName, queue);
  }
{code}
Do you think this feature should be supported in community version 
[~cheersyang][~Tao Yang]

> Support submitting apps with queue path for CapacityScheduler
> -
>
> Key: YARN-7621
> URL: https://issues.apache.org/jira/browse/YARN-7621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-7621.001.patch, YARN-7621.002.patch
>
>
> Currently there is a difference of queue definition in 
> ApplicationSubmissionContext between CapacityScheduler and FairScheduler. 
> FairScheduler needs queue path but CapacityScheduler needs queue name. There 
> is no doubt of the correction of queue definition for CapacityScheduler 
> because it does not allow duplicate leaf queue names, but it's hard to switch 
> between FairScheduler and CapacityScheduler. I propose to support submitting 
> apps with queue path for CapacityScheduler to make the interface clearer and 
> scheduler switch smoothly.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7621) Support submitting apps with queue path for CapacityScheduler

2019-07-29 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895039#comment-16895039
 ] 

zhoukang commented on YARN-7621:


I have not learned how to review hadoop pr.(i will learn later)
But i have reviewed the patch attached and the modification is the same as our 
inner version which is LGTM.

> Support submitting apps with queue path for CapacityScheduler
> -
>
> Key: YARN-7621
> URL: https://issues.apache.org/jira/browse/YARN-7621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-7621.001.patch, YARN-7621.002.patch
>
>
> Currently there is a difference of queue definition in 
> ApplicationSubmissionContext between CapacityScheduler and FairScheduler. 
> FairScheduler needs queue path but CapacityScheduler needs queue name. There 
> is no doubt of the correction of queue definition for CapacityScheduler 
> because it does not allow duplicate leaf queue names, but it's hard to switch 
> between FairScheduler and CapacityScheduler. I propose to support submitting 
> apps with queue path for CapacityScheduler to make the interface clearer and 
> scheduler switch smoothly.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9712) ResourceManager goes into a deadlock while transitioning to standby

2019-07-29 Thread Tarun Parimi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895033#comment-16895033
 ] 

Tarun Parimi edited comment on YARN-9712 at 7/29/19 8:05 AM:
-

{quote}2. While transitioning to standby, a java.lang.InterruptedException 
occurs in RMStateStore while removing/storing RMDelegationToken. This is 
because RMSecretManagerService will be stopped while transitioning to standby.
{quote}
Looks like this scenario can be prevented with the fix in YARN-6647 from 
version 3.0.0 onwards.


was (Author: tarunparimi):
bq. 2. While transitioning to standby, a java.lang.InterruptedException occurs 
in RMStateStore while removing/storing RMDelegationToken. This is because 
RMSecretManagerService will be stopped while transitioning to standby.
Looks like this scenario can prevented with the fix in YARN-6647. 

> ResourceManager goes into a deadlock while transitioning to standby
> ---
>
> Key: YARN-9712
> URL: https://issues.apache.org/jira/browse/YARN-9712
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, RM
>Affects Versions: 2.9.0
>Reporter: Tarun Parimi
>Priority: Major
>
> We have observed RM go into a deadlock while transitioning to standby in a 
> heavily loaded production cluster which can observe random connection loss to 
> a zookeeper session and also has a large amount of RMDelegationToken requests 
> due to oozie jobs.
> On analyzing the jstack and the logs, this seems to happen when the below 
> sequence of events occur.
> 1. Zookeeper session is lost and so the ActiveStandbyElector service will do 
> transitionToStandby . This transitionToStandby is a synchronized method and 
> so will acquire a lock on ResourceManager. 
> {code:java}
> 2019-07-25 14:31:24,497 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:processWatchEvent(621)) - Session expired. 
> Entering neutral mode and rejoining... 
> 2019-07-25 14:31:28,084 INFO resourcemanager.ResourceManager 
> (ResourceManager.java:transitionToStandby(1134)) - Transitioning to standby 
> state 
> {code}
> 2. While transitioning to standby, a java.lang.InterruptedException occurs in 
> RMStateStore while removing/storing RMDelegationToken. This is because 
> RMSecretManagerService will be stopped while transitioning to standby.
> {code:java}
> 2019-07-25 14:31:28,576 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(373)) - Error While Removing RMDelegationToken 
> and SequenceNumber
> java.lang.InterruptedException
> 2019-07-25 14:31:28,576 ERROR recovery.RMStateStore 
> (RMStateStore.java:notifyStoreOperationFailedInternal(992)) - State store 
> operation failed 
> java.lang.InterruptedException 
> {code}
> 3. When state store error occurs, a RMFatalEvent of type STATE_STORE_FENCED 
> will be sent. 
> {code:java}
> 2019-07-25 14:31:28,579 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(767)) - Received RMFatalEvent of type 
> STATE_STORE_FENCED, caused by java.lang.InterruptedException 
> {code}
> 4. The problem occurs when the RMFatalEventDispatcher calls getConfig() . 
> This also needs a lock on ResourceManager since its a synchronized method. 
> This will cause the rmDispatcher eventHandlingThread to become blocked.
> {code:java}
> private class RMFatalEventDispatcher implements EventHandler {
> @Override
> public void handle(RMFatalEvent event) {
>   LOG.error("Received " + event);
>   if (HAUtil.isHAEnabled(getConfig())) {
> // If we're in an HA config, the right answer is always to go into
> // standby.
> LOG.warn("Transitioning the resource manager to standby.");
> handleTransitionToStandByInNewThread();
> {code}
> 5. The transitionToStandby will wait forever as the eventHandlingThread of 
> rmDispatcher is blocked. This causes a deadlock and RM will not become active 
> until restarted.
> Below are the relevant threads in the jstack captured.
> The transitionToStandby thread that waits forever.
> {code:java}
> "main-EventThread" #138239 daemon prio=5 os_prio=0 tid=0x7fea473b2800 
> nid=0x2f411 in Object.wait() [0x7fda5bef5000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1245)
> - locked <0x7fdb6c5059a0> (a java.lang.Thread)
> at java.lang.Thread.join(Thread.java:1319)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:161)
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> - locked <0x7fdb6c538ca0> (a java.lang.Object)
> at 
> 

[jira] [Assigned] (YARN-7621) Support submitting apps with queue path for CapacityScheduler

2019-07-29 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang reassigned YARN-7621:
--

Assignee: (was: zhoukang)

> Support submitting apps with queue path for CapacityScheduler
> -
>
> Key: YARN-7621
> URL: https://issues.apache.org/jira/browse/YARN-7621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-7621.001.patch, YARN-7621.002.patch
>
>
> Currently there is a difference of queue definition in 
> ApplicationSubmissionContext between CapacityScheduler and FairScheduler. 
> FairScheduler needs queue path but CapacityScheduler needs queue name. There 
> is no doubt of the correction of queue definition for CapacityScheduler 
> because it does not allow duplicate leaf queue names, but it's hard to switch 
> between FairScheduler and CapacityScheduler. I propose to support submitting 
> apps with queue path for CapacityScheduler to make the interface clearer and 
> scheduler switch smoothly.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7621) Support submitting apps with queue path for CapacityScheduler

2019-07-29 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang reassigned YARN-7621:
--

Assignee: zhoukang  (was: Tao Yang)

> Support submitting apps with queue path for CapacityScheduler
> -
>
> Key: YARN-7621
> URL: https://issues.apache.org/jira/browse/YARN-7621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: zhoukang
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-7621.001.patch, YARN-7621.002.patch
>
>
> Currently there is a difference of queue definition in 
> ApplicationSubmissionContext between CapacityScheduler and FairScheduler. 
> FairScheduler needs queue path but CapacityScheduler needs queue name. There 
> is no doubt of the correction of queue definition for CapacityScheduler 
> because it does not allow duplicate leaf queue names, but it's hard to switch 
> between FairScheduler and CapacityScheduler. I propose to support submitting 
> apps with queue path for CapacityScheduler to make the interface clearer and 
> scheduler switch smoothly.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9712) ResourceManager goes into a deadlock while transitioning to standby

2019-07-29 Thread Tarun Parimi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895033#comment-16895033
 ] 

Tarun Parimi commented on YARN-9712:


bq. 2. While transitioning to standby, a java.lang.InterruptedException occurs 
in RMStateStore while removing/storing RMDelegationToken. This is because 
RMSecretManagerService will be stopped while transitioning to standby.
Looks like this scenario can prevented with the fix in YARN-6647. 

> ResourceManager goes into a deadlock while transitioning to standby
> ---
>
> Key: YARN-9712
> URL: https://issues.apache.org/jira/browse/YARN-9712
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, RM
>Affects Versions: 2.9.0
>Reporter: Tarun Parimi
>Priority: Major
>
> We have observed RM go into a deadlock while transitioning to standby in a 
> heavily loaded production cluster which can observe random connection loss to 
> a zookeeper session and also has a large amount of RMDelegationToken requests 
> due to oozie jobs.
> On analyzing the jstack and the logs, this seems to happen when the below 
> sequence of events occur.
> 1. Zookeeper session is lost and so the ActiveStandbyElector service will do 
> transitionToStandby . This transitionToStandby is a synchronized method and 
> so will acquire a lock on ResourceManager. 
> {code:java}
> 2019-07-25 14:31:24,497 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:processWatchEvent(621)) - Session expired. 
> Entering neutral mode and rejoining... 
> 2019-07-25 14:31:28,084 INFO resourcemanager.ResourceManager 
> (ResourceManager.java:transitionToStandby(1134)) - Transitioning to standby 
> state 
> {code}
> 2. While transitioning to standby, a java.lang.InterruptedException occurs in 
> RMStateStore while removing/storing RMDelegationToken. This is because 
> RMSecretManagerService will be stopped while transitioning to standby.
> {code:java}
> 2019-07-25 14:31:28,576 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(373)) - Error While Removing RMDelegationToken 
> and SequenceNumber
> java.lang.InterruptedException
> 2019-07-25 14:31:28,576 ERROR recovery.RMStateStore 
> (RMStateStore.java:notifyStoreOperationFailedInternal(992)) - State store 
> operation failed 
> java.lang.InterruptedException 
> {code}
> 3. When state store error occurs, a RMFatalEvent of type STATE_STORE_FENCED 
> will be sent. 
> {code:java}
> 2019-07-25 14:31:28,579 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(767)) - Received RMFatalEvent of type 
> STATE_STORE_FENCED, caused by java.lang.InterruptedException 
> {code}
> 4. The problem occurs when the RMFatalEventDispatcher calls getConfig() . 
> This also needs a lock on ResourceManager since its a synchronized method. 
> This will cause the rmDispatcher eventHandlingThread to become blocked.
> {code:java}
> private class RMFatalEventDispatcher implements EventHandler {
> @Override
> public void handle(RMFatalEvent event) {
>   LOG.error("Received " + event);
>   if (HAUtil.isHAEnabled(getConfig())) {
> // If we're in an HA config, the right answer is always to go into
> // standby.
> LOG.warn("Transitioning the resource manager to standby.");
> handleTransitionToStandByInNewThread();
> {code}
> 5. The transitionToStandby will wait forever as the eventHandlingThread of 
> rmDispatcher is blocked. This causes a deadlock and RM will not become active 
> until restarted.
> Below are the relevant threads in the jstack captured.
> The transitionToStandby thread that waits forever.
> {code:java}
> "main-EventThread" #138239 daemon prio=5 os_prio=0 tid=0x7fea473b2800 
> nid=0x2f411 in Object.wait() [0x7fda5bef5000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1245)
> - locked <0x7fdb6c5059a0> (a java.lang.Thread)
> at java.lang.Thread.join(Thread.java:1319)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:161)
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> - locked <0x7fdb6c538ca0> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.resetRMContext(ResourceManager.java:1323)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.reinitialize(ResourceManager.java:1091)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1139)
> - locked <0x7fdb33e418f0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
> at 
> 

[jira] [Commented] (YARN-7621) Support submitting apps with queue path for CapacityScheduler

2019-07-29 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895027#comment-16895027
 ] 

zhoukang commented on YARN-7621:


Sure i will [~cheersyang]

> Support submitting apps with queue path for CapacityScheduler
> -
>
> Key: YARN-7621
> URL: https://issues.apache.org/jira/browse/YARN-7621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-7621.001.patch, YARN-7621.002.patch
>
>
> Currently there is a difference of queue definition in 
> ApplicationSubmissionContext between CapacityScheduler and FairScheduler. 
> FairScheduler needs queue path but CapacityScheduler needs queue name. There 
> is no doubt of the correction of queue definition for CapacityScheduler 
> because it does not allow duplicate leaf queue names, but it's hard to switch 
> between FairScheduler and CapacityScheduler. I propose to support submitting 
> apps with queue path for CapacityScheduler to make the interface clearer and 
> scheduler switch smoothly.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9712) ResourceManager goes into a deadlock while transitioning to standby

2019-07-29 Thread Tarun Parimi (JIRA)
Tarun Parimi created YARN-9712:
--

 Summary: ResourceManager goes into a deadlock while transitioning 
to standby
 Key: YARN-9712
 URL: https://issues.apache.org/jira/browse/YARN-9712
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, RM
Affects Versions: 2.9.0
Reporter: Tarun Parimi


We have observed RM go into a deadlock while transitioning to standby in a 
heavily loaded production cluster which can observe random connection loss to a 
zookeeper session and also has a large amount of RMDelegationToken requests due 
to oozie jobs.

On analyzing the jstack and the logs, this seems to happen when the below 
sequence of events occur.

1. Zookeeper session is lost and so the ActiveStandbyElector service will do 
transitionToStandby . This transitionToStandby is a synchronized method and so 
will acquire a lock on ResourceManager. 
{code:java}
2019-07-25 14:31:24,497 INFO ha.ActiveStandbyElector 
(ActiveStandbyElector.java:processWatchEvent(621)) - Session expired. Entering 
neutral mode and rejoining... 
2019-07-25 14:31:28,084 INFO resourcemanager.ResourceManager 
(ResourceManager.java:transitionToStandby(1134)) - Transitioning to standby 
state 
{code}


2. While transitioning to standby, a java.lang.InterruptedException occurs in 
RMStateStore while removing/storing RMDelegationToken. This is because 
RMSecretManagerService will be stopped while transitioning to standby.
{code:java}
2019-07-25 14:31:28,576 ERROR recovery.RMStateStore 
(RMStateStore.java:transition(373)) - Error While Removing RMDelegationToken 
and SequenceNumber
java.lang.InterruptedException
2019-07-25 14:31:28,576 ERROR recovery.RMStateStore 
(RMStateStore.java:notifyStoreOperationFailedInternal(992)) - State store 
operation failed 
java.lang.InterruptedException 
{code}


3. When state store error occurs, a RMFatalEvent of type STATE_STORE_FENCED 
will be sent. 

{code:java}
2019-07-25 14:31:28,579 ERROR resourcemanager.ResourceManager 
(ResourceManager.java:handle(767)) - Received RMFatalEvent of type 
STATE_STORE_FENCED, caused by java.lang.InterruptedException 
{code}


4. The problem occurs when the RMFatalEventDispatcher calls getConfig() . This 
also needs a lock on ResourceManager since its a synchronized method. This will 
cause the rmDispatcher eventHandlingThread to become blocked.

{code:java}
private class RMFatalEventDispatcher implements EventHandler {
@Override
public void handle(RMFatalEvent event) {
  LOG.error("Received " + event);

  if (HAUtil.isHAEnabled(getConfig())) {
// If we're in an HA config, the right answer is always to go into
// standby.
LOG.warn("Transitioning the resource manager to standby.");
handleTransitionToStandByInNewThread();
{code}

5. The transitionToStandby will wait forever as the eventHandlingThread of 
rmDispatcher is blocked. This causes a deadlock and RM will not become active 
until restarted.

Below are the relevant threads in the jstack captured.

The transitionToStandby thread that waits forever.
{code:java}
"main-EventThread" #138239 daemon prio=5 os_prio=0 tid=0x7fea473b2800 
nid=0x2f411 in Object.wait() [0x7fda5bef5000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1245)
- locked <0x7fdb6c5059a0> (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1319)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:161)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x7fdb6c538ca0> (a java.lang.Object)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.resetRMContext(ResourceManager.java:1323)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.reinitialize(ResourceManager.java:1091)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1139)
- locked <0x7fdb33e418f0> (a 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:355)
- locked <0x7fdb33e41828> (a 
org.apache.hadoop.yarn.server.resourcemanager.AdminService)
at 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:147)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:970)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:480)
- locked <0x7fdb33e7bb88> (a 
org.apache.hadoop.ha.ActiveStandbyElector)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:617)
at