[jira] [Created] (YARN-11302) hadoop-yarn-applications-mawo-core module publishes tar file during maven deploy

2022-09-13 Thread Steven Rand (Jira)
Steven Rand created YARN-11302:
--

 Summary: hadoop-yarn-applications-mawo-core module publishes tar 
file during maven deploy
 Key: YARN-11302
 URL: https://issues.apache.org/jira/browse/YARN-11302
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications, yarn
Affects Versions: 3.3.4
Reporter: Steven Rand


The {{hadoop-yarn-applications-mawo-core}} module will currently publish a file 
with extension {{bin.tar.gz}} during the maven deploy step: 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-mawo/hadoop-yarn-applications-mawo-core/src/assembly/bin.xml#L16.

I don't know whether the community considers this to be a bug or not, but 
creating a ticket because the deploy step typically creates JAR and POM files, 
and some maven repositories that are intended to only host JARs will have 
allowlists of file extensions that block tarballs from being published, and 
therefore cause the maven deploy operation to fail with this error:

{code}
Caused by: org.apache.maven.wagon.TransferFailedException: Failed to transfer 
file: 
https:///artifactory//org/apache/hadoop/applications/mawo/hadoop-yarn-applications-mawo-core//hadoop-yarn-applications-mawo-core--bin.tar.gz.
 Return code is: 409, ReasonPhrase: .
{code}

Feel free to close if the community doesn't consider this to be a problem, but 
notably it is a regression from versions predating mawo when only JAR and POM 
files were published in the deploy step.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11184) fenced active RM not failing over correctly in HA setup

2022-06-14 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554318#comment-17554318
 ] 

Steven Rand commented on YARN-11184:


Possibly [ZOOKEEPER-2251|https://issues.apache.org/jira/browse/ZOOKEEPER-2251] 
is related? The thread dump is different, but it appears to be a similar 
problem of the {{StandByTransitionThread}} waiting indefinitely for a response. 
The ZK version used client side by hadoop does not include the fix for that 
issue.

> fenced active RM not failing over correctly in HA setup
> ---
>
> Key: YARN-11184
> URL: https://issues.apache.org/jira/browse/YARN-11184
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.3
>Reporter: Steven Rand
>Priority: Major
> Attachments: image-2022-06-14-16-38-00-336.png, 
> image-2022-06-14-16-39-50-278.png, image-2022-06-14-16-41-39-742.png, 
> image-2022-06-14-16-44-45-101.png
>
>
> We've observed an issue recently on a production cluster running 3.2.3 in 
> which a fenced Resource Manager remains active, but does not communicate with 
> the ZK state store, and therefore cannot function correctly. This did not 
> occur while running 3.2.2 on the same cluster.
> In more detail, what seems to happen is: 
> 1. The active RM gets a {{NodeExists}} error from ZK while storing an app in 
> the state store. I suspect that this is caused by some transient connection 
> issue that causes the first node creation request to succeed, but for the 
> response to not reach the RM, triggering a duplicate request which fails with 
> this error.
> !image-2022-06-14-16-38-00-336.png!
> 2. Because of this error, the active RM is fenced.
> !image-2022-06-14-16-39-50-278.png!
> 3. Because it is fenced, the active RM starts to transition to standby.
> !image-2022-06-14-16-41-39-742.png! 4. However, the RM never fully 
> transitions to standby. It never logs {{Transitioning RM to Standby mode}} 
> from the run method of {{{}StandByTransitionRunnable{}}}: 
> [https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java#L1195.]
>  Related to this, a jstack of the RM shows that thread being {{RUNNABLE}}, 
> but evidently not making progress:
>  !image-2022-06-14-16-44-45-101.png! 
> So the RM doesn't work because it is fenced, but remains active, which causes 
> an outage until a failover is manually initiated.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11184) fenced active RM not failing over correctly in HA setup

2022-06-14 Thread Steven Rand (Jira)
Steven Rand created YARN-11184:
--

 Summary: fenced active RM not failing over correctly in HA setup
 Key: YARN-11184
 URL: https://issues.apache.org/jira/browse/YARN-11184
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.2.3
Reporter: Steven Rand
 Attachments: image-2022-06-14-16-38-00-336.png, 
image-2022-06-14-16-39-50-278.png, image-2022-06-14-16-41-39-742.png, 
image-2022-06-14-16-44-45-101.png

We've observed an issue recently on a production cluster running 3.2.3 in which 
a fenced Resource Manager remains active, but does not communicate with the ZK 
state store, and therefore cannot function correctly. This did not occur while 
running 3.2.2 on the same cluster.

In more detail, what seems to happen is: 

1. The active RM gets a {{NodeExists}} error from ZK while storing an app in 
the state store. I suspect that this is caused by some transient connection 
issue that causes the first node creation request to succeed, but for the 
response to not reach the RM, triggering a duplicate request which fails with 
this error.

!image-2022-06-14-16-38-00-336.png!

2. Because of this error, the active RM is fenced.

!image-2022-06-14-16-39-50-278.png!

3. Because it is fenced, the active RM starts to transition to standby.

!image-2022-06-14-16-41-39-742.png! 4. However, the RM never fully transitions 
to standby. It never logs {{Transitioning RM to Standby mode}} from the run 
method of {{{}StandByTransitionRunnable{}}}: 
[https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java#L1195.]
 Related to this, a jstack of the RM shows that thread being {{RUNNABLE}}, but 
evidently not making progress:

 !image-2022-06-14-16-44-45-101.png! 

So the RM doesn't work because it is fenced, but remains active, which causes 
an outage until a failover is manually initiated.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2

2020-10-15 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214933#comment-17214933
 ] 

Steven Rand commented on YARN-10244:


Thanks [~aajisaka]!

> backport YARN-9848 to branch-3.2
> 
>
> Key: YARN-10244
> URL: https://issues.apache.org/jira/browse/YARN-10244
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-10244-branch-3.2.001.patch, 
> YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch
>
>
> Backporting YARN-9848 to branch-3.2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2

2020-09-16 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196839#comment-17196839
 ] 

Steven Rand commented on YARN-10244:


Just to be clear, I'm considering this finished from my side unless someone 
tells me that I've misunderstood something. The test failures are caused by 
YARN-10249, not by the patch, so I don't think further action is needed from my 
end.

> backport YARN-9848 to branch-3.2
> 
>
> Key: YARN-10244
> URL: https://issues.apache.org/jira/browse/YARN-10244
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-10244-branch-3.2.001.patch, 
> YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch
>
>
> Backporting YARN-9848 to branch-3.2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2

2020-09-12 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194719#comment-17194719
 ] 

Steven Rand commented on YARN-10244:


Thanks all for helping with the backport of this to branch-3.2. My guess is 
that the tests will keep failing until we resolve YARN-10249.

> backport YARN-9848 to branch-3.2
> 
>
> Key: YARN-10244
> URL: https://issues.apache.org/jira/browse/YARN-10244
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-10244-branch-3.2.001.patch, 
> YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch
>
>
> Backporting YARN-9848 to branch-3.2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10249) Various ResourceManager tests are failing on branch-3.2

2020-05-03 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098400#comment-17098400
 ] 

Steven Rand commented on YARN-10249:


Also happening in YARN-10244

> Various ResourceManager tests are failing on branch-3.2
> ---
>
> Key: YARN-10249
> URL: https://issues.apache.org/jira/browse/YARN-10249
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10249.branch-3.2.POC001.patch
>
>
> Various tests are failing on branch-3.2. Some examples can be found in: 
> YARN-10003, YARN-10002, YARN-10237. The seemingly common thing that all of 
> the failing tests are RM/Capacity Scheduler related, and the failures are 
> flaky.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2

2020-05-03 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098397#comment-17098397
 ] 

Steven Rand commented on YARN-10244:


The test failures for all three patches are caused by YARN-10249, not by the 
patches themselves.

> backport YARN-9848 to branch-3.2
> 
>
> Key: YARN-10244
> URL: https://issues.apache.org/jira/browse/YARN-10244
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-10244-branch-3.2.001.patch, 
> YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch
>
>
> Backporting YARN-9848 to branch-3.2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10244) backport YARN-9848 to branch-3.2

2020-05-02 Thread Steven Rand (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-10244:
---
Attachment: YARN-10244-branch-3.2.003.patch

> backport YARN-9848 to branch-3.2
> 
>
> Key: YARN-10244
> URL: https://issues.apache.org/jira/browse/YARN-10244
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-10244-branch-3.2.001.patch, 
> YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch
>
>
> Backporting YARN-9848 to branch-3.2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10244) backport YARN-9848 to branch-3.2

2020-05-02 Thread Steven Rand (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-10244:
---
Attachment: YARN-10244-branch-3.2.002.patch

> backport YARN-9848 to branch-3.2
> 
>
> Key: YARN-10244
> URL: https://issues.apache.org/jira/browse/YARN-10244
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-10244-branch-3.2.001.patch, 
> YARN-10244-branch-3.2.002.patch
>
>
> Backporting YARN-9848 to branch-3.2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9848) revert YARN-4946

2020-04-26 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092756#comment-17092756
 ] 

Steven Rand edited comment on YARN-9848 at 4/26/20, 3:21 PM:
-

I created YARN-10244 for branch-3.2.

For resolving this issue, I'm not a committer, so I think someone else will 
have to merge the patch to trunk and branch-3.3.0.


was (Author: steven rand):
I created YARN-10244 for branch-3.2.

For resolving this issue, I'm not a committer, so I think someone else will 
have to merge the patch to trunk and branch-3.3.0.
 

> revert YARN-4946
> 
>
> Key: YARN-9848
> URL: https://issues.apache.org/jira/browse/YARN-9848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Blocker
> Attachments: YARN-9848-01.patch, YARN-9848.002.patch, 
> YARN-9848.003.patch
>
>
> In YARN-4946, we've been discussing a revert due to the potential for keeping 
> more applications in the state store than desired, and the potential to 
> greatly increase RM recovery times.
>  
> I'm in favor of reverting the patch, but other ideas along the lines of 
> YARN-9571 would work as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9848) revert YARN-4946

2020-04-26 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092756#comment-17092756
 ] 

Steven Rand commented on YARN-9848:
---

I created YARN-10244 for branch-3.2.

For resolving this issue, I'm not a committer, so I think someone else will 
have to merge the patch to trunk and branch-3.3.0.
 

> revert YARN-4946
> 
>
> Key: YARN-9848
> URL: https://issues.apache.org/jira/browse/YARN-9848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Blocker
> Attachments: YARN-9848-01.patch, YARN-9848.002.patch, 
> YARN-9848.003.patch
>
>
> In YARN-4946, we've been discussing a revert due to the potential for keeping 
> more applications in the state store than desired, and the potential to 
> greatly increase RM recovery times.
>  
> I'm in favor of reverting the patch, but other ideas along the lines of 
> YARN-9571 would work as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10244) backport YARN-9848 to branch-3.2

2020-04-26 Thread Steven Rand (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-10244:
---
Attachment: YARN-10244-branch-3.2.001.patch

> backport YARN-9848 to branch-3.2
> 
>
> Key: YARN-10244
> URL: https://issues.apache.org/jira/browse/YARN-10244
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-10244-branch-3.2.001.patch
>
>
> Backporting YARN-9848 to branch-3.2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10244) backport YARN-9848 to branch-3.2

2020-04-26 Thread Steven Rand (Jira)
Steven Rand created YARN-10244:
--

 Summary: backport YARN-9848 to branch-3.2
 Key: YARN-10244
 URL: https://issues.apache.org/jira/browse/YARN-10244
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation, resourcemanager
Reporter: Steven Rand
Assignee: Steven Rand


Backporting YARN-9848 to branch-3.2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9848) revert YARN-4946

2020-04-20 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087990#comment-17087990
 ] 

Steven Rand commented on YARN-9848:
---

Thanks all. I also have a patch for branch-3.2 so that we can include it in a 
3.2 maintenance release like [~vinodkv] suggested. Do I upload the patch to 
this JIRA, or is it better to make a new one?

> revert YARN-4946
> 
>
> Key: YARN-9848
> URL: https://issues.apache.org/jira/browse/YARN-9848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Blocker
> Attachments: YARN-9848-01.patch, YARN-9848.002.patch, 
> YARN-9848.003.patch
>
>
> In YARN-4946, we've been discussing a revert due to the potential for keeping 
> more applications in the state store than desired, and the potential to 
> greatly increase RM recovery times.
>  
> I'm in favor of reverting the patch, but other ideas along the lines of 
> YARN-9571 would work as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9848) revert YARN-4946

2020-04-14 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083449#comment-17083449
 ] 

Steven Rand edited comment on YARN-9848 at 4/14/20, 5:30 PM:
-

>From trying to apply the patch locally, it seems that trunk has changed since 
>I wrote it, and it no longer applies cleanly. I'll upload a new one soon.

EDIT: The {{YARN-9848.003.patch}} file accounts for the changes from YARN-9886 
and applies to trunk.
 


was (Author: steven rand):
>From trying to apply the patch locally, it seems that trunk has changed since 
>I wrote it, and it no longer applies cleanly. I'll upload a new one soon.

> revert YARN-4946
> 
>
> Key: YARN-9848
> URL: https://issues.apache.org/jira/browse/YARN-9848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Priority: Blocker
> Attachments: YARN-9848-01.patch, YARN-9848.002.patch, 
> YARN-9848.003.patch
>
>
> In YARN-4946, we've been discussing a revert due to the potential for keeping 
> more applications in the state store than desired, and the potential to 
> greatly increase RM recovery times.
>  
> I'm in favor of reverting the patch, but other ideas along the lines of 
> YARN-9571 would work as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9848) revert YARN-4946

2020-04-14 Thread Steven Rand (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-9848:
--
Attachment: YARN-9848.003.patch

> revert YARN-4946
> 
>
> Key: YARN-9848
> URL: https://issues.apache.org/jira/browse/YARN-9848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Priority: Blocker
> Attachments: YARN-9848-01.patch, YARN-9848.002.patch, 
> YARN-9848.003.patch
>
>
> In YARN-4946, we've been discussing a revert due to the potential for keeping 
> more applications in the state store than desired, and the potential to 
> greatly increase RM recovery times.
>  
> I'm in favor of reverting the patch, but other ideas along the lines of 
> YARN-9571 would work as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9848) revert YARN-4946

2020-04-14 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083449#comment-17083449
 ] 

Steven Rand commented on YARN-9848:
---

>From trying to apply the patch locally, it seems that trunk has changed since 
>I wrote it, and it no longer applies cleanly. I'll upload a new one soon.

> revert YARN-4946
> 
>
> Key: YARN-9848
> URL: https://issues.apache.org/jira/browse/YARN-9848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Priority: Blocker
> Attachments: YARN-9848-01.patch, YARN-9848.002.patch
>
>
> In YARN-4946, we've been discussing a revert due to the potential for keeping 
> more applications in the state store than desired, and the potential to 
> greatly increase RM recovery times.
>  
> I'm in favor of reverting the patch, but other ideas along the lines of 
> YARN-9571 would work as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9848) revert YARN-4946

2020-04-14 Thread Steven Rand (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-9848:
--
Attachment: YARN-9848.002.patch

> revert YARN-4946
> 
>
> Key: YARN-9848
> URL: https://issues.apache.org/jira/browse/YARN-9848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Priority: Blocker
> Attachments: YARN-9848-01.patch, YARN-9848.002.patch
>
>
> In YARN-4946, we've been discussing a revert due to the potential for keeping 
> more applications in the state store than desired, and the potential to 
> greatly increase RM recovery times.
>  
> I'm in favor of reverting the patch, but other ideas along the lines of 
> YARN-9571 would work as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8990) Fix fair scheduler race condition in app submit and queue cleanup

2020-01-27 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024799#comment-17024799
 ] 

Steven Rand commented on YARN-8990:
---

How would people feel about cherrypicking this and YARN-8992 to {{branch-3.2}}? 
It seems like we should do that before {{branch-3.2.2}} gets cut for an 
eventual 3.2.2 release.

> Fix fair scheduler race condition in app submit and queue cleanup
> -
>
> Key: YARN-8990
> URL: https://issues.apache.org/jira/browse/YARN-8990
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Blocker
> Fix For: 3.2.0, 3.3.0
>
> Attachments: YARN-8990.001.patch, YARN-8990.002.patch
>
>
> With the introduction of the dynamic queue deletion in YARN-8191 a race 
> condition was introduced that can cause a queue to be removed while an 
> application submit is in progress.
> The issue occurs in {{FairScheduler.addApplication()}} when an application is 
> submitted to a dynamic queue which is empty or the queue does not exist yet. 
> If during the processing of the application submit the 
> {{AllocationFileLoaderService}} kicks of for an update the queue clean up 
> will be run first. The application submit first creates the queue and get a 
> reference back to the queue. 
> Other checks are performed and as the last action before getting ready to 
> generate an AppAttempt the queue is updated to show the submitted application 
> ID..
> The time between the queue creation and the queue update to show the submit 
> is long enough for the queue to be removed. The application however is lost 
> and will never get any resources assigned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state

2020-01-08 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011446#comment-17011446
 ] 

Steven Rand commented on YARN-4946:
---

Any update on what we want to do here? It seems like we're starting to plan new 
releases, and I think it'd be good to either revert or make some adjustment 
before those come out.

> RM should not consider an application as COMPLETED when log aggregation is 
> not in a terminal state
> --
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-4946.001.patch, YARN-4946.002.patch, 
> YARN-4946.003.patch, YARN-4946.004.patch
>
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed Application (e.g. RM failover, enough time has passed, 
> etc), the tool won't find the Application in the RM and will just assume that 
> its log aggregation succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> The RM should not consider an app to be fully completed (and thus removed 
> from its history) until the aggregation status has reached a terminal state 
> (e.g. SUCCEEDED, FAILED, TIME_OUT).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8990) Fix fair scheduler race condition in app submit and queue cleanup

2019-11-04 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967099#comment-16967099
 ] 

Steven Rand edited comment on YARN-8990 at 11/4/19 11:43 PM:
-

Hi all,

Unfortunately, this patch never made its way into the 3.2.1 release, which is 
affected by this race condition. I think what happened is that it was committed 
to trunk and backported to branch-3.2.0, but not to branch-3.2 (or 
branch-3.2.1).

And unless I'm misinterpreting the git history, the 3.2.1 release is also 
missing YARN-8992, despite the fix version of that ticket. 

We should at minimum make sure that the fixes for these race conditions are in 
3.2.2. Since this was a blocker and the impact is pretty serious, there may be 
more things we want to do, e.g., messaging and/or expediting the 3.2.2 release, 
but I'll leave that up to you to decide.


was (Author: steven rand):
Hi all,

Unfortunately, this patch never made its way into the 3.2.1 release, which is 
affected by this race condition. I think what happened is that it was committed 
to trunk and backported to branch-3.2.0, but not to branch-3.2 (or 
branch-3.2.1).

And unless I'm misinterpreting the git history, the 3.2.1 release is also 
missing YARN-8992, despite the fix version of that ticket. 

We should at minimum make sure that the fixes for these race conditions are in 
3.2.2. Since this was a blocker and the impact is pretty serious, there may be 
more things we want to do, e.g., messaging or expediting the 3.2.2 release, but 
I'll leave that up you to decide.

> Fix fair scheduler race condition in app submit and queue cleanup
> -
>
> Key: YARN-8990
> URL: https://issues.apache.org/jira/browse/YARN-8990
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Blocker
> Fix For: 3.2.0, 3.3.0
>
> Attachments: YARN-8990.001.patch, YARN-8990.002.patch
>
>
> With the introduction of the dynamic queue deletion in YARN-8191 a race 
> condition was introduced that can cause a queue to be removed while an 
> application submit is in progress.
> The issue occurs in {{FairScheduler.addApplication()}} when an application is 
> submitted to a dynamic queue which is empty or the queue does not exist yet. 
> If during the processing of the application submit the 
> {{AllocationFileLoaderService}} kicks of for an update the queue clean up 
> will be run first. The application submit first creates the queue and get a 
> reference back to the queue. 
> Other checks are performed and as the last action before getting ready to 
> generate an AppAttempt the queue is updated to show the submitted application 
> ID..
> The time between the queue creation and the queue update to show the submit 
> is long enough for the queue to be removed. The application however is lost 
> and will never get any resources assigned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8990) Fix fair scheduler race condition in app submit and queue cleanup

2019-11-04 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967099#comment-16967099
 ] 

Steven Rand commented on YARN-8990:
---

Hi all,

Unfortunately, this patch never made its way into the 3.2.1 release, which is 
affected by this race condition. I think what happened is that it was committed 
to trunk and backported to branch-3.2.0, but not to branch-3.2 (or 
branch-3.2.1).

And unless I'm misinterpreting the git history, the 3.2.1 release is also 
missing YARN-8992, despite the fix version of that ticket. 

We should at minimum make sure that the fixes for these race conditions are in 
3.2.2. Since this was a blocker and the impact is pretty serious, there may be 
more things we want to do, e.g., messaging or expediting the 3.2.2 release, but 
I'll leave that up you to decide.

> Fix fair scheduler race condition in app submit and queue cleanup
> -
>
> Key: YARN-8990
> URL: https://issues.apache.org/jira/browse/YARN-8990
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Blocker
> Fix For: 3.2.0, 3.3.0
>
> Attachments: YARN-8990.001.patch, YARN-8990.002.patch
>
>
> With the introduction of the dynamic queue deletion in YARN-8191 a race 
> condition was introduced that can cause a queue to be removed while an 
> application submit is in progress.
> The issue occurs in {{FairScheduler.addApplication()}} when an application is 
> submitted to a dynamic queue which is empty or the queue does not exist yet. 
> If during the processing of the application submit the 
> {{AllocationFileLoaderService}} kicks of for an update the queue clean up 
> will be run first. The application submit first creates the queue and get a 
> reference back to the queue. 
> Other checks are performed and as the last action before getting ready to 
> generate an AppAttempt the queue is updated to show the submitted application 
> ID..
> The time between the queue creation and the queue update to show the submit 
> is long enough for the queue to be removed. The application however is lost 
> and will never get any resources assigned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8470) Fair scheduler exception with SLS

2019-10-08 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947163#comment-16947163
 ] 

Steven Rand commented on YARN-8470:
---

Hi [~snemeth], [~szegedim],

Friendly ping on this ticket. We've hit this issue in a production cluster 
running 3.2.1.

> Fair scheduler exception with SLS
> -
>
> Key: YARN-8470
> URL: https://issues.apache.org/jira/browse/YARN-8470
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Szilard Nemeth
>Priority: Major
>
> I ran into the following exception with sls:
> 2018-06-26 13:34:04,358 ERROR resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> FSPreemptionThread, that exited unexpectedly: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreemptOnNode(FSPreemptionThread.java:207)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreemptForOneContainer(FSPreemptionThread.java:161)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreempt(FSPreemptionThread.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.run(FSPreemptionThread.java:81)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9850) document or revert change in which DefaultContainerExecutor no longer propagates NM env to containers

2019-09-21 Thread Steven Rand (Jira)
Steven Rand created YARN-9850:
-

 Summary: document or revert change in which 
DefaultContainerExecutor no longer propagates NM env to containers
 Key: YARN-9850
 URL: https://issues.apache.org/jira/browse/YARN-9850
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Steven Rand


After 
[https://github.com/apache/hadoop/commit/9d4d30243b0fc9630da51a2c17b543ef671d035c],
 containers launched by the {{DefaultContainerExecutor}} no longer inherit the 
environment of the NodeManager.

I don't object to the commit (I actually prefer the new behavior), but I do 
think that it's a notable breaking change, as people may be relying on 
variables in the NM environment for their containers to behave correctly.

As far as I can tell, we don't currently include this behavior change in the 
release notes for Hadoop 3, and it's a particularly tricky one to track down, 
since there's no JIRA ticket for it.

I think that we should at least include this change in the release notes for 
the 3.0.0 release. Arguably it's worth having the DefaultContainerExecutor set 
{{inheritParentEnv}} to true when it creates its {{ShellCommandExecutor}} since 
that preserves the old behavior and is less surprising to users, but I don't 
feel strongly either way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9552) FairScheduler: NODE_UPDATE can cause NoSuchElementException

2019-09-20 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934105#comment-16934105
 ] 

Steven Rand commented on YARN-9552:
---

Thanks!

> FairScheduler: NODE_UPDATE can cause NoSuchElementException
> ---
>
> Key: YARN-9552
> URL: https://issues.apache.org/jira/browse/YARN-9552
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9552-001.patch, YARN-9552-002.patch, 
> YARN-9552-003.patch, YARN-9552-004.patch
>
>
> We observed a race condition inside YARN with the following stack trace:
> {noformat}
> 18/11/07 06:45:09.559 SchedulerEventDispatcher:Event Processor ERROR 
> EventDispatcher: Error in handling event type NODE_UPDATE to the Event 
> Dispatcher
> java.util.NoSuchElementException
> at 
> java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
> at 
> java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1373)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:353)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1094)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:961)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1183)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:132)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> This is basically the same as the one described in YARN-7382, but the root 
> cause is different.
> When we create an application attempt, we create an {{FSAppAttempt}} object. 
> This contains an {{AppSchedulingInfo}} which contains a set of 
> {{SchedulerRequestKey}}. Initially, this set is empty and only initialized a 
> bit later on a separate thread during a state transition:
> {noformat}
> 2019-05-07 15:58:02,659 INFO  [RM StateStore dispatcher] 
> recovery.RMStateStore (RMStateStore.java:transition(239)) - Storing info for 
> app: application_1557237478804_0001
> 2019-05-07 15:58:02,684 INFO  [RM Event dispatcher] rmapp.RMAppImpl 
> (RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change 
> from NEW_SAVING to SUBMITTED on event = APP_NEW_SAVED
> 2019-05-07 15:58:02,690 INFO  [SchedulerEventDispatcher:Event Processor] 
> fair.FairScheduler (FairScheduler.java:addApplication(490)) - Accepted 
> application application_1557237478804_0001 from user: bacskop, in queue: 
> root.bacskop, currently num of applications: 1
> 2019-05-07 15:58:02,698 INFO  [RM Event dispatcher] rmapp.RMAppImpl 
> (RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change 
> from SUBMITTED to ACCEPTED on event = APP_ACCEPTED
> 2019-05-07 15:58:02,731 INFO  [RM Event dispatcher] 
> resourcemanager.ApplicationMasterService 
> (ApplicationMasterService.java:registerAppAttempt(434)) - Registering app 
> attempt : appattempt_1557237478804_0001_01
> 2019-05-07 15:58:02,732 INFO  [RM Event dispatcher] attempt.RMAppAttemptImpl 
> (RMAppAttemptImpl.java:handle(920)) - appattempt_1557237478804_0001_01 
> State change from NEW to SUBMITTED on event = START
> 2019-05-07 15:58:02,746 INFO  [SchedulerEventDispatcher:Event Processor] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:(207)) - *** In the constructor of 
> SchedulerApplicationAttempt
> 2019-05-07 15:58:02,747 INFO  [SchedulerEventDispatcher:Event Processor] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:(230)) - *** Contents of 
> appSchedulingInfo: []
> 2019-05-07 15:58:02,752 INFO  [SchedulerEventDispatcher:Event Processor] 
> fair.FairScheduler (FairScheduler.java:addApplicationAttempt(546)) - Added 
> Application Attempt appattempt_1557237478804_0001_01 to scheduler from 
> user: bacskop
> 

[jira] [Commented] (YARN-9848) revert YARN-4946

2019-09-20 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934095#comment-16934095
 ] 

Steven Rand commented on YARN-9848:
---

Attached a patch which reverts YARN-4946 on trunk. The revert applied cleanly 
to the logic in {{RMAppManager}}, but had several conflicts in 
{{TestAppManager}}.

Tagging [~ccondit], [~wangda], [~rkanter], [~snemeth]

> revert YARN-4946
> 
>
> Key: YARN-9848
> URL: https://issues.apache.org/jira/browse/YARN-9848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Priority: Major
> Attachments: YARN-9848-01.patch
>
>
> In YARN-4946, we've been discussing a revert due to the potential for keeping 
> more applications in the state store than desired, and the potential to 
> greatly increase RM recovery times.
>  
> I'm in favor of reverting the patch, but other ideas along the lines of 
> YARN-9571 would work as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state

2019-09-20 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934091#comment-16934091
 ] 

Steven Rand commented on YARN-4946:
---

I created YARN-9848 for reverting.

> RM should not consider an application as COMPLETED when log aggregation is 
> not in a terminal state
> --
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-4946.001.patch, YARN-4946.002.patch, 
> YARN-4946.003.patch, YARN-4946.004.patch
>
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed Application (e.g. RM failover, enough time has passed, 
> etc), the tool won't find the Application in the RM and will just assume that 
> its log aggregation succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> The RM should not consider an app to be fully completed (and thus removed 
> from its history) until the aggregation status has reached a terminal state 
> (e.g. SUCCEEDED, FAILED, TIME_OUT).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9848) revert YARN-4946

2019-09-20 Thread Steven Rand (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-9848:
--
Attachment: YARN-9848-01.patch

> revert YARN-4946
> 
>
> Key: YARN-9848
> URL: https://issues.apache.org/jira/browse/YARN-9848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Priority: Major
> Attachments: YARN-9848-01.patch
>
>
> In YARN-4946, we've been discussing a revert due to the potential for keeping 
> more applications in the state store than desired, and the potential to 
> greatly increase RM recovery times.
>  
> I'm in favor of reverting the patch, but other ideas along the lines of 
> YARN-9571 would work as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9848) revert YARN-4946

2019-09-19 Thread Steven Rand (Jira)
Steven Rand created YARN-9848:
-

 Summary: revert YARN-4946
 Key: YARN-9848
 URL: https://issues.apache.org/jira/browse/YARN-9848
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation, resourcemanager
Reporter: Steven Rand


In YARN-4946, we've been discussing a revert due to the potential for keeping 
more applications in the state store than desired, and the potential to greatly 
increase RM recovery times.

 

I'm in favor of reverting the patch, but other ideas along the lines of 
YARN-9571 would work as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9552) FairScheduler: NODE_UPDATE can cause NoSuchElementException

2019-09-19 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934050#comment-16934050
 ] 

Steven Rand commented on YARN-9552:
---

This seems like an important fix since it prevents the RM from crashing – any 
chance we can backport it to the 3.2 and 3.1 maintenance releases?

> FairScheduler: NODE_UPDATE can cause NoSuchElementException
> ---
>
> Key: YARN-9552
> URL: https://issues.apache.org/jira/browse/YARN-9552
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9552-001.patch, YARN-9552-002.patch, 
> YARN-9552-003.patch, YARN-9552-004.patch
>
>
> We observed a race condition inside YARN with the following stack trace:
> {noformat}
> 18/11/07 06:45:09.559 SchedulerEventDispatcher:Event Processor ERROR 
> EventDispatcher: Error in handling event type NODE_UPDATE to the Event 
> Dispatcher
> java.util.NoSuchElementException
> at 
> java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
> at 
> java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1373)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:353)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1094)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:961)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1183)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:132)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> This is basically the same as the one described in YARN-7382, but the root 
> cause is different.
> When we create an application attempt, we create an {{FSAppAttempt}} object. 
> This contains an {{AppSchedulingInfo}} which contains a set of 
> {{SchedulerRequestKey}}. Initially, this set is empty and only initialized a 
> bit later on a separate thread during a state transition:
> {noformat}
> 2019-05-07 15:58:02,659 INFO  [RM StateStore dispatcher] 
> recovery.RMStateStore (RMStateStore.java:transition(239)) - Storing info for 
> app: application_1557237478804_0001
> 2019-05-07 15:58:02,684 INFO  [RM Event dispatcher] rmapp.RMAppImpl 
> (RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change 
> from NEW_SAVING to SUBMITTED on event = APP_NEW_SAVED
> 2019-05-07 15:58:02,690 INFO  [SchedulerEventDispatcher:Event Processor] 
> fair.FairScheduler (FairScheduler.java:addApplication(490)) - Accepted 
> application application_1557237478804_0001 from user: bacskop, in queue: 
> root.bacskop, currently num of applications: 1
> 2019-05-07 15:58:02,698 INFO  [RM Event dispatcher] rmapp.RMAppImpl 
> (RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change 
> from SUBMITTED to ACCEPTED on event = APP_ACCEPTED
> 2019-05-07 15:58:02,731 INFO  [RM Event dispatcher] 
> resourcemanager.ApplicationMasterService 
> (ApplicationMasterService.java:registerAppAttempt(434)) - Registering app 
> attempt : appattempt_1557237478804_0001_01
> 2019-05-07 15:58:02,732 INFO  [RM Event dispatcher] attempt.RMAppAttemptImpl 
> (RMAppAttemptImpl.java:handle(920)) - appattempt_1557237478804_0001_01 
> State change from NEW to SUBMITTED on event = START
> 2019-05-07 15:58:02,746 INFO  [SchedulerEventDispatcher:Event Processor] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:(207)) - *** In the constructor of 
> SchedulerApplicationAttempt
> 2019-05-07 15:58:02,747 INFO  [SchedulerEventDispatcher:Event Processor] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:(230)) - *** Contents of 
> appSchedulingInfo: []
> 2019-05-07 15:58:02,752 INFO  [SchedulerEventDispatcher:Event Processor] 
> fair.FairScheduler 

[jira] [Commented] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state

2019-08-06 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900669#comment-16900669
 ] 

Steven Rand commented on YARN-4946:
---

I reverted this patch in our fork, and now RM recovery time is back to normal, 
and the number of apps being stored in ZK respects the configured maximum again.

Friendly ping for [~wangda] and/or [~ccondit] on the question of either 
reverting this or pursuing a followup along the lines of YARN-9571.

> RM should not consider an application as COMPLETED when log aggregation is 
> not in a terminal state
> --
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-4946.001.patch, YARN-4946.002.patch, 
> YARN-4946.003.patch, YARN-4946.004.patch
>
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed Application (e.g. RM failover, enough time has passed, 
> etc), the tool won't find the Application in the RM and will just assume that 
> its log aggregation succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> The RM should not consider an app to be fully completed (and thus removed 
> from its history) until the aggregation status has reached a terminal state 
> (e.g. SUCCEEDED, FAILED, TIME_OUT).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state

2019-08-02 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898599#comment-16898599
 ] 

Steven Rand edited comment on YARN-4946 at 8/2/19 6:17 AM:
---

I noticed after upgrading a cluster to 3.2.0 that RM recovery now takes about 
20 minutes, whereas before it took less than one minute.

I checked the RM's logs, and noticed that it hits the code path added in this 
patch more than 18 million times
{code:java}
# The log rotation settings allow for only 20 log files, so actually this 
number is lower than the real count.
$ grep 'but not removing' hadoop--resourcemanager-.log* | 
wc -l
18092893
{code}
I checked in ZK, and according to {{./zkCli.sh ls 
/rmstore/ZKRMStateRoot/RMAppRoot}}, I have 9,755 apps in the RM state store, 
even though the configured max is 1,000.

I think that what happens when RM recovery starts is:
 * Some number of apps in the state store cause us to handle an 
{{APP_COMPLETED}} event during recovery. I'm not sure exactly how many – 
presumably just those that are finished?
 * Each time we handle one of these events, we call 
{{removeCompletedAppsFromStateStore}} and {{removeCompletedAppsFromMemory}}, 
and in both cases we realize that there are more apps both in ZK and in memory 
than is allowed (limit for both is 1,000).
 * So for each of these events, we go through the for loops in both 
{{removeCompletedAppsFromStateStore}} and {{removeCompletedAppsFromMemory}} 
that try to remove apps from ZK and from memory.
 * For whatever reason – probably a separate issue on this cluster – log 
aggregation isn't complete for any of these apps. So the for loops never manage 
to delete apps. And since the for loops are deterministic, they try to delete 
the same apps every time, but never make progress.

And I think the repetition of these for loops for each {{APP_COMPLETED}} event 
explains the 18 million number – if we can have at most 9,755 finished apps in 
the state store, and for each of those apps we trigger 2 for loops that can 
have at most 8,755 iterations, we very quickly wind up with a lot of iterations.

Because this change can lead to much longer RM recovery times in circumstances 
like this one, I think I prefer option {{a}} from the two listed above.

Or, I think it's also reasonable to modify the patch from YARN-9571 to have a 
hardcoded TTL.


was (Author: steven rand):
I noticed after upgrading a cluster to 3.2.0 that RM recovery now takes about 
20 minutes, whereas before it took less than one minute.

I checked the RM's logs, and noticed that it hits the code path added in this 
patch more than 18 million times
{code:java}
# The log rotation settings allow for only 20 log files, so actually this 
number is lower than the real count.
$ grep 'but not removing' hadoop-palantir-resourcemanager-.log* | wc 
-l
18092893
{code}
I checked in ZK, and according to {{./zkCli.sh ls 
/rmstore/ZKRMStateRoot/RMAppRoot}}, I have 9,755 apps in the RM state store, 
even though the configured max is 1,000.

I think that what happens when RM recovery starts is:
 * Some number of apps in the state store cause us to handle an 
{{APP_COMPLETED}} event during recovery. I'm not sure exactly how many – 
presumably just those that are finished?
 * Each time we handle one of these events, we call 
{{removeCompletedAppsFromStateStore}} and {{removeCompletedAppsFromMemory}}, 
and in both cases we realize that there are more apps both in ZK and in memory 
than is allowed (limit for both is 1,000).
 * So for each of these events, we go through the for loops in both 
{{removeCompletedAppsFromStateStore}} and {{removeCompletedAppsFromMemory}} 
that try to remove apps from ZK and from memory.
 * For whatever reason – probably a separate issue on this cluster – log 
aggregation isn't complete for any of these apps. So the for loops never manage 
to delete apps. And since the for loops are deterministic, they try to delete 
the same apps every time, but never make progress.

And I think the repetition of these for loops for each {{APP_COMPLETED}} event 
explains the 18 million number – if we can have at most 9,755 finished apps in 
the state store, and for each of those apps we trigger 2 for loops that can 
have at most 8,755 iterations, we very quickly wind up with a lot of iterations.

Because this change can lead to much longer RM recovery times in circumstances 
like this one, I think I prefer option {{a}} from the two listed above.

Or, I think it's also reasonable to modify the patch from YARN-9571 to have a 
hardcoded TTL.

> RM should not consider an application as COMPLETED when log aggregation is 
> not in a terminal state
> --
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue 

[jira] [Commented] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state

2019-08-02 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898599#comment-16898599
 ] 

Steven Rand commented on YARN-4946:
---

I noticed after upgrading a cluster to 3.2.0 that RM recovery now takes about 
20 minutes, whereas before it took less than one minute.

I checked the RM's logs, and noticed that it hits the code path added in this 
patch more than 18 million times
{code:java}
# The log rotation settings allow for only 20 log files, so actually this 
number is lower than the real count.
$ grep 'but not removing' hadoop-palantir-resourcemanager-.log* | wc 
-l
18092893
{code}
I checked in ZK, and according to {{./zkCli.sh ls 
/rmstore/ZKRMStateRoot/RMAppRoot}}, I have 9,755 apps in the RM state store, 
even though the configured max is 1,000.

I think that what happens when RM recovery starts is:
 * Some number of apps in the state store cause us to handle an 
{{APP_COMPLETED}} event during recovery. I'm not sure exactly how many – 
presumably just those that are finished?
 * Each time we handle one of these events, we call 
{{removeCompletedAppsFromStateStore}} and {{removeCompletedAppsFromMemory}}, 
and in both cases we realize that there are more apps both in ZK and in memory 
than is allowed (limit for both is 1,000).
 * So for each of these events, we go through the for loops in both 
{{removeCompletedAppsFromStateStore}} and {{removeCompletedAppsFromMemory}} 
that try to remove apps from ZK and from memory.
 * For whatever reason – probably a separate issue on this cluster – log 
aggregation isn't complete for any of these apps. So the for loops never manage 
to delete apps. And since the for loops are deterministic, they try to delete 
the same apps every time, but never make progress.

And I think the repetition of these for loops for each {{APP_COMPLETED}} event 
explains the 18 million number – if we can have at most 9,755 finished apps in 
the state store, and for each of those apps we trigger 2 for loops that can 
have at most 8,755 iterations, we very quickly wind up with a lot of iterations.

Because this change can lead to much longer RM recovery times in circumstances 
like this one, I think I prefer option {{a}} from the two listed above.

Or, I think it's also reasonable to modify the patch from YARN-9571 to have a 
hardcoded TTL.

> RM should not consider an application as COMPLETED when log aggregation is 
> not in a terminal state
> --
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-4946.001.patch, YARN-4946.002.patch, 
> YARN-4946.003.patch, YARN-4946.004.patch
>
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed Application (e.g. RM failover, enough time has passed, 
> etc), the tool won't find the Application in the RM and will just assume that 
> its log aggregation succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> The RM should not consider an app to be fully completed (and thus removed 
> from its history) until the aggregation status has reached a terminal state 
> (e.g. SUCCEEDED, FAILED, TIME_OUT).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9277) Add more restrictions In FairScheduler Preemption

2019-02-12 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766783#comment-16766783
 ] 

Steven Rand commented on YARN-9277:
---

{code}
+// We should not preempt container which has been running for a long time.
+if ((System.currentTimeMillis() - container.getCreationTime()) >=
+getQueue().getFSContext().getPreemptionConfig()
+.getToBePreemptedContainerRuntimeThreshold()) {
+  logPreemptContainerPreCheckInfo(
+  "this container already run a long time!");
+  return false;
+}
+
{code}

I disagree with this because it allows for situations in which starved 
applications can't preempt applications that are over their fair shares. If 
application A is starved and application B is over its fair share, but happens 
to have all its containers running for more than the threshold, then 
application A is unable to preempt and will remain starved.

It might be reasonable to sort preemptable containers by runtime and preempt 
those that have started most recently. However, I worry that this unfairly 
biases the scheduler against applications with shorter-lived tasks.

If code can't be optimized, and really does require very long-running tasks, 
then these jobs can be run in a queue from which preemption isn't allowed via 
the {{allowPreemptionFrom}} property.

> Add more restrictions In FairScheduler Preemption 
> --
>
> Key: YARN-9277
> URL: https://issues.apache.org/jira/browse/YARN-9277
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Zhaohui Xin
>Assignee: Zhaohui Xin
>Priority: Major
> Attachments: YARN-9277.001.patch, YARN-9277.002.patch
>
>
>  
> I think we should add more restrictions in fair scheduler preemption. 
>  * We should not preempt self
>  * We should not preempt high priority job
>  * We should not preempt container which has been running for a long time.
>  * ...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9041) Optimize FSPreemptionThread#identifyContainersToPreempt method

2018-11-28 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702538#comment-16702538
 ] 

Steven Rand commented on YARN-9041:
---

bq. If we not allowed relax locality, it will executes three statements before 
used this patch. Otherwise it executes only one statement after used this 
patch. So I think reorder the conditions can improve the performance.

Yes, but it could also be true that {{bestContainers}} is {{null}}, which would 
short-circuit the other three checks, or that 
{{ResourceRequest.isAnyLocation(rr.getResourceName())}} is true, which would 
also short-circuit the other three. It's not immediately clear to me which 
condition is most likely to not be met / which one makes the most sense to put 
first in the hope of short-circuiting the others.

Anyway though, all four checks should be very cheap since all just involve 
looking at some object that's already in memory, and none have to make RPC 
calls or do any computation. So I'm okay with any order.

> Optimize FSPreemptionThread#identifyContainersToPreempt method
> --
>
> Key: YARN-9041
> URL: https://issues.apache.org/jira/browse/YARN-9041
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler preemption
>Reporter: Wanqiang Ji
>Assignee: Wanqiang Ji
>Priority: Major
> Attachments: YARN-9041.001.patch, YARN-9041.002.patch, 
> YARN-9041.003.patch, YARN-9041.004.patch, YARN-9041.005.patch
>
>
> In FSPreemptionThread#identifyContainersToPreempt method, I suggest if AM 
> preemption, and locality relaxation is allowed, then the search space is 
> expanded to all nodes changed to the remaining nodes. The remaining nodes are 
> equal to all nodes minus the potential nodes.
> Judging condition changed to:
>  # rr.getRelaxLocality()
>  # !ResourceRequest.isAnyLocation(rr.getResourceName())
>  # bestContainers != null
>  # bestContainers.numAMContainers > 0
> If I understand the deviation, please criticize me. thx~



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9066) Deprecate Fair Scheduler min share

2018-11-27 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701348#comment-16701348
 ] 

Steven Rand commented on YARN-9066:
---

+1 -- I agree with the attached doc that since a schedulable's fair share is 
already its guaranteed minimum allocation, it's redundant/confusing to have a 
min share as well. 

> Deprecate Fair Scheduler min share
> --
>
> Key: YARN-9066
> URL: https://issues.apache.org/jira/browse/YARN-9066
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Haibo Chen
>Priority: Major
> Attachments: Proposal_Deprecate_FS_Min_Share.pdf
>
>
> See the attached docs for details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9041) Optimize FSPreemptionThread#identifyContainersToPreempt method

2018-11-26 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699843#comment-16699843
 ] 

Steven Rand commented on YARN-9041:
---

Yes, the v2 patch resolves my concern -- thanks [~jiwq] for fixing that.

I'm curious, what's the motivation for reordering the conditions in the {{if}} 
block?

> Optimize FSPreemptionThread#identifyContainersToPreempt method
> --
>
> Key: YARN-9041
> URL: https://issues.apache.org/jira/browse/YARN-9041
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler preemption
>Reporter: Wanqiang Ji
>Assignee: Wanqiang Ji
>Priority: Major
> Attachments: YARN-9041.001.patch, YARN-9041.002.patch
>
>
> In FSPreemptionThread#identifyContainersToPreempt method, I suggest if AM 
> preemption, and locality relaxation is allowed, then the search space is 
> expanded to all nodes changed to the remaining nodes. The remaining nodes are 
> equal to all nodes minus the potential nodes.
> Judging condition changed to:
>  # rr.getRelaxLocality()
>  # !ResourceRequest.isAnyLocation(rr.getResourceName())
>  # bestContainers != null
>  # bestContainers.numAMContainers > 0
> If I understand the deviation, please criticize me. thx~



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9041) Optimize FSPreemptionThread#identifyContainersToPreempt method

2018-11-20 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694329#comment-16694329
 ] 

Steven Rand commented on YARN-9041:
---

I'm not sure that this is correct. I think that it can lead to failure to 
preempt in cases where we should be preempting. This will happen if the initial 
{{potentialNodes}} contain preemptible containers, but the remaining nodes 
don't.

Example to illustrate what I'm thinking:

* We have nodes A, B, and C
* At first {{potentialNodes}} includes only node A because we're preempting for 
a node-local request for that node
* We find that we can preempt a container on node A, but it's an 
ApplicationMaster
* With this patch, we change the search space to be only nodes B and C (without 
the patch, the search space becomes A, B, and C)
* There are no preemptible containers on nodes B and C

The outcome in this example is that we don't preempt at all. However, what we 
want to do is preempt the AM container on node A.

Hopefully that makes sense, but let me know if I'm misunderstanding.

> Optimize FSPreemptionThread#identifyContainersToPreempt method
> --
>
> Key: YARN-9041
> URL: https://issues.apache.org/jira/browse/YARN-9041
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler preemption
>Reporter: Wanqiang Ji
>Assignee: Wanqiang Ji
>Priority: Major
> Attachments: YARN-9041.001.patch
>
>
> In FSPreemptionThread#identifyContainersToPreempt method, I suggest if AM 
> preemption, and locality relaxation is allowed, then the search space is 
> expanded to all nodes changed to the remaining nodes. The remaining nodes are 
> equal to all nodes minus the potential nodes.
> Judging condition changed to:
>  # rr.getRelaxLocality()
>  # !ResourceRequest.isAnyLocation(rr.getResourceName())
>  # bestContainers != null
>  # bestContainers.numAMContainers > 0
> If I understand the deviation, please criticize me. thx~



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8903) when NM becomes unhealthy due to local disk usage, have option to kill application using most space instead of releasing all containers on node

2018-10-17 Thread Steven Rand (JIRA)
Steven Rand created YARN-8903:
-

 Summary: when NM becomes unhealthy due to local disk usage, have 
option to kill application using most space instead of releasing all containers 
on node
 Key: YARN-8903
 URL: https://issues.apache.org/jira/browse/YARN-8903
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 3.1.1
Reporter: Steven Rand


We sometimes experience an issue in which a single application, usually a Spark 
job, causes at least one node in a YARN cluster to become unhealthy by filling 
up the local dir(s) on that node past the threshold at which the node is 
considered unhealthy.

When this happens, the impact is potentially large depending on what else is 
running on that node, as all containers on that node are lost. Sometimes not 
much else is running on the node and it's fine, but other times we lose AM 
containers from other apps and/or non-AM containers with long-running tasks.

I thought that it would be helpful to add an option (default false) whereby if 
a node is going to become unhealthy due to full local disk(s), it instead 
identifies the application that's using the most local disk space on that node, 
and kills that application. (Roughly analogous to how the OOM killer in Linux 
picks one process to kill rather than letting the machine crash.)

The benefit is that only one application is impacted, and no other application 
loses any containers. This prevents one user's poorly written code that 
shuffles/spills huge amounts of data from negatively impacting other users.

The downside is that we're killing the entire application, not just the task(s) 
responsible for the local disk usage. I believe it's necessary to kill the 
whole application instead of identifying the container running the relevant 
task(s), because doing so would require more knowledge of the internal state of 
aux services responsible for shuffling than what YARN has according to my 
understanding.

If this seems reasonable, I can work on the implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7903) Method getStarvedResourceRequests() only consider the first encountered resource

2018-02-08 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357923#comment-16357923
 ] 

Steven Rand commented on YARN-7903:
---

Agreed that having a concept of delay scheduling for preemption is a good idea 
and would help with both JIRAs. We might be able to use 
{{FSAppAttempt.getAllowedLocalityLevel}} or 
{{FSAppAttempt.getAllowedLocalityLevelByTime}}, since those already have logic 
for checking whether the app has waited longer than the threshold for requests 
with some {{SchedulerKey}} (which seems to really just mean priority?). I'll 
defer to others though on whether it makes sense for delay logic in preemption 
to match delay logic in allocation -- possibly there are differences between 
the two that call for separate logic.

I'm also quite confused as to how we should be thinking about different RRs 
from the same app at the same priority. I spent some time digging through the 
code today, but don't really understand it yet. There are a couple pieces of 
code I found that deal with deduping/deconflicting RRs, but I wasn't sure how 
to interpret them:

* {{VisitedResourceRequestTracker}} seems to consider RRs with the same 
priority and capability to be logically the same
* {{AppSchedulingInfo#internalAddResourceRequests}} seems to consider RRs with 
the same {{SchedulerRequestKey}} and resourceName to be logically the same

> Method getStarvedResourceRequests() only consider the first encountered 
> resource
> 
>
> Key: YARN-7903
> URL: https://issues.apache.org/jira/browse/YARN-7903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Yufei Gu
>Priority: Major
>
> We need to specify rack and ANY while submitting a node local resource 
> request, as YARN-7561 discussed. For example:
> {code}
> ResourceRequest nodeRequest =
> createResourceRequest(GB, node1.getHostName(), 1, 1, false);
> ResourceRequest rackRequest =
> createResourceRequest(GB, node1.getRackName(), 1, 1, false);
> ResourceRequest anyRequest =
> createResourceRequest(GB, ResourceRequest.ANY, 1, 1, false);
> List resourceRequests =
> Arrays.asList(nodeRequest, rackRequest, anyRequest);
> {code}
> However, method getStarvedResourceRequests() only consider the first 
> encountered resource, which most likely is ResourceRequest.ANY. That's a 
> mismatch for locality request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7655) Avoid AM preemption caused by RRs for specific nodes or racks

2018-02-08 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357913#comment-16357913
 ] 

Steven Rand commented on YARN-7655:
---

Thanks [~yufeigu]. I filed YARN-7910 for the {{TODO}} in the unit test. 
Unfortunately I also realized that I made a mistake in how I interpreted the 
value of {{ResourceRequest.getRelaxLocality}} -- filed YARN-7911 for that.

> Avoid AM preemption caused by RRs for specific nodes or racks
> -
>
> Key: YARN-7655
> URL: https://issues.apache.org/jira/browse/YARN-7655
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-7655-001.patch, YARN-7655-002.patch, 
> YARN-7655-003.patch, YARN-7655-004.patch
>
>
> We frequently see AM preemptions when 
> {{starvedApp.getStarvedResourceRequests()}} in 
> {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs 
> that request containers on a specific node. Since this causes us to only 
> consider one node to preempt containers on, the really good work that was 
> done in YARN-5830 doesn't save us from AM preemption. Even though there might 
> be multiple nodes on which we could preempt enough non-AM containers to 
> satisfy the app's starvation, we often wind up preempting one or more AM 
> containers on the single node that we're considering.
> A proposed solution is that if we're going to preempt one or more AM 
> containers for an RR that specifies a node or rack, then we should instead 
> expand the search space to consider all nodes. That way we take advantage of 
> YARN-5830, and only preempt AMs if there's no alternative. I've attached a 
> patch with an initial implementation of this. We've been running it on a few 
> clusters, and have seen AM preemptions drop from double-digit occurrences on 
> many days to zero.
> Of course, the tradeoff is some loss of locality, since the starved app is 
> less likely to be allocated resources at the most specific locality level 
> that it asked for. My opinion is that this tradeoff is worth it, but 
> interested to hear what others think as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7911) Method identifyContainersToPreempt uses ResourceRequest#getRelaxLocality incorrectly

2018-02-08 Thread Steven Rand (JIRA)
Steven Rand created YARN-7911:
-

 Summary: Method identifyContainersToPreempt uses 
ResourceRequest#getRelaxLocality incorrectly
 Key: YARN-7911
 URL: https://issues.apache.org/jira/browse/YARN-7911
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 3.1.0
Reporter: Steven Rand
Assignee: Steven Rand


After YARN-7655, in {{identifyContainersToPreempt}} we expand the search space 
to all nodes if we had previously only considered a subset to satisfy a 
{{NODE_LOCAL}} or {{RACK_LOCAL}} RR, and were going to preempt AM containers as 
a result, and the RR allowed locality to be relaxed:

{code}
// Don't preempt AM containers just to satisfy local requests if relax
// locality is enabled.
if (bestContainers != null
&& bestContainers.numAMContainers > 0
&& !ResourceRequest.isAnyLocation(rr.getResourceName())
&& rr.getRelaxLocality()) {
  bestContainers = identifyContainersToPreemptForOneContainer(
  scheduler.getNodeTracker().getAllNodes(), rr);
}
{code}

This turns out to be based on a misunderstanding of what 
{{rr.getRelaxLocality}} means. I had believed that it means that locality can 
be relaxed _from_ that level. However, it actually means that locality can be 
relaxed _to_ that level: 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ResourceRequest.java#L450.

For example, suppose we have {{relaxLocality}} set to {{true}} at the node 
level, but {{false}} at the rack and {{ANY}} levels. This is saying that we 
cannot relax locality to the rack level. However, the current behavior after 
YARN-7655 is to interpret relaxLocality being true at the node level as saying 
that it's okay to satisfy the request elsewhere.

What we should do instead is check whether relaxLocality is enabled for the 
corresponding RR at the next level. So if we're considering a node-level RR, we 
should find the corresponding rack-level RR and check whether relaxLocality is 
enabled for it. And similarly, if we're considering a rack-level RR, we should 
check the corresponding any-level RR.

It may also be better to use {{FSAppAttempt#getAllowedLocalityLevel}} instead 
of explicitly checking {{relaxLocality}}, but I'm not sure which is correct.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7910) Fix TODO in TestFairSchedulerPreemption#testRelaxLocalityToNotPreemptAM

2018-02-08 Thread Steven Rand (JIRA)
Steven Rand created YARN-7910:
-

 Summary: Fix TODO in 
TestFairSchedulerPreemption#testRelaxLocalityToNotPreemptAM
 Key: YARN-7910
 URL: https://issues.apache.org/jira/browse/YARN-7910
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler, test
Affects Versions: 3.1.0
Reporter: Steven Rand
Assignee: Steven Rand


In YARN-7655, we left a {{TODO}} in the newly added test:

{code}
// TODO (YARN-7655) The starved app should be allocated 4 containers.
// It should be possible to modify the RRs such that this is true
// after YARN-7903.
verifyPreemption(0, 4);
{code}

This JIRA is to track resolving that after YARN-7903 is resolved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7903) Method getStarvedResourceRequests() only consider the first encountered resource

2018-02-07 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356327#comment-16356327
 ] 

Steven Rand commented on YARN-7903:
---

Agreed that it seems weird/wrong to ignore locality when considering which of 
an app's RRs to preempt for. I think it's worth noting though that if we change 
the code to choose the most local request, then we increase the frequency of 
the failure mode described in YARN-6956, where we fail to preempt because 
{{getStarvedResourceRequests}} returns only {{NODE_LOCAL}} RRs, and there 
aren't any preemptable containers on those nodes (even though there are 
preemptable containers on other nodes). I think that we should try to make 
progress on that JIRA as well as this one.

> Method getStarvedResourceRequests() only consider the first encountered 
> resource
> 
>
> Key: YARN-7903
> URL: https://issues.apache.org/jira/browse/YARN-7903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Yufei Gu
>Priority: Major
>
> We need to specify rack and ANY while submitting a node local resource 
> request, as YARN-7561 discussed. For example:
> {code}
> ResourceRequest nodeRequest =
> createResourceRequest(GB, node1.getHostName(), 1, 1, false);
> ResourceRequest rackRequest =
> createResourceRequest(GB, node1.getRackName(), 1, 1, false);
> ResourceRequest anyRequest =
> createResourceRequest(GB, ResourceRequest.ANY, 1, 1, false);
> List resourceRequests =
> Arrays.asList(nodeRequest, rackRequest, anyRequest);
> {code}
> However, method getStarvedResourceRequests() only consider the first 
> encountered resource, which most likely is ResourceRequest.ANY. That's a 
> mismatch for locality request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks

2018-02-07 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356308#comment-16356308
 ] 

Steven Rand commented on YARN-7655:
---

Sounds good, I revised the patch to mention YARN-7903 in a comment in the test. 
Is this patch blocked on YARN-7903, or is it enough to leave the {{TODO}} for 
now and revise the RRs after that JIRA is resolved?

Relatedly, I thought it was odd that {{FSAppAttempt#hasContainerForNode}} only 
considers the size of the off-switch ask, even when there also exists a 
RACK_LOCAL request and/or a NODE_LOCAL request. I don't understand that code 
super well though, so it might be correct.

> avoid AM preemption caused by RRs for specific nodes or racks
> -
>
> Key: YARN-7655
> URL: https://issues.apache.org/jira/browse/YARN-7655
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-7655-001.patch, YARN-7655-002.patch, 
> YARN-7655-003.patch, YARN-7655-004.patch
>
>
> We frequently see AM preemptions when 
> {{starvedApp.getStarvedResourceRequests()}} in 
> {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs 
> that request containers on a specific node. Since this causes us to only 
> consider one node to preempt containers on, the really good work that was 
> done in YARN-5830 doesn't save us from AM preemption. Even though there might 
> be multiple nodes on which we could preempt enough non-AM containers to 
> satisfy the app's starvation, we often wind up preempting one or more AM 
> containers on the single node that we're considering.
> A proposed solution is that if we're going to preempt one or more AM 
> containers for an RR that specifies a node or rack, then we should instead 
> expand the search space to consider all nodes. That way we take advantage of 
> YARN-5830, and only preempt AMs if there's no alternative. I've attached a 
> patch with an initial implementation of this. We've been running it on a few 
> clusters, and have seen AM preemptions drop from double-digit occurrences on 
> many days to zero.
> Of course, the tradeoff is some loss of locality, since the starved app is 
> less likely to be allocated resources at the most specific locality level 
> that it asked for. My opinion is that this tradeoff is worth it, but 
> interested to hear what others think as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks

2018-02-07 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-7655:
--
Attachment: YARN-7655-004.patch

> avoid AM preemption caused by RRs for specific nodes or racks
> -
>
> Key: YARN-7655
> URL: https://issues.apache.org/jira/browse/YARN-7655
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-7655-001.patch, YARN-7655-002.patch, 
> YARN-7655-003.patch, YARN-7655-004.patch
>
>
> We frequently see AM preemptions when 
> {{starvedApp.getStarvedResourceRequests()}} in 
> {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs 
> that request containers on a specific node. Since this causes us to only 
> consider one node to preempt containers on, the really good work that was 
> done in YARN-5830 doesn't save us from AM preemption. Even though there might 
> be multiple nodes on which we could preempt enough non-AM containers to 
> satisfy the app's starvation, we often wind up preempting one or more AM 
> containers on the single node that we're considering.
> A proposed solution is that if we're going to preempt one or more AM 
> containers for an RR that specifies a node or rack, then we should instead 
> expand the search space to consider all nodes. That way we take advantage of 
> YARN-5830, and only preempt AMs if there's no alternative. I've attached a 
> patch with an initial implementation of this. We've been running it on a few 
> clusters, and have seen AM preemptions drop from double-digit occurrences on 
> many days to zero.
> Of course, the tradeoff is some loss of locality, since the starved app is 
> less likely to be allocated resources at the most specific locality level 
> that it asked for. My opinion is that this tradeoff is worth it, but 
> interested to hear what others think as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks

2018-02-05 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-7655:
--
Attachment: YARN-7655-003.patch

> avoid AM preemption caused by RRs for specific nodes or racks
> -
>
> Key: YARN-7655
> URL: https://issues.apache.org/jira/browse/YARN-7655
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-7655-001.patch, YARN-7655-002.patch, 
> YARN-7655-003.patch
>
>
> We frequently see AM preemptions when 
> {{starvedApp.getStarvedResourceRequests()}} in 
> {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs 
> that request containers on a specific node. Since this causes us to only 
> consider one node to preempt containers on, the really good work that was 
> done in YARN-5830 doesn't save us from AM preemption. Even though there might 
> be multiple nodes on which we could preempt enough non-AM containers to 
> satisfy the app's starvation, we often wind up preempting one or more AM 
> containers on the single node that we're considering.
> A proposed solution is that if we're going to preempt one or more AM 
> containers for an RR that specifies a node or rack, then we should instead 
> expand the search space to consider all nodes. That way we take advantage of 
> YARN-5830, and only preempt AMs if there's no alternative. I've attached a 
> patch with an initial implementation of this. We've been running it on a few 
> clusters, and have seen AM preemptions drop from double-digit occurrences on 
> many days to zero.
> Of course, the tradeoff is some loss of locality, since the starved app is 
> less likely to be allocated resources at the most specific locality level 
> that it asked for. My opinion is that this tradeoff is worth it, but 
> interested to hear what others think as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks

2018-02-05 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353279#comment-16353279
 ] 

Steven Rand commented on YARN-7655:
---

The concern I have with all three RRs being the same size is that we don't 
necessarily consider the {{NODE_LOCAL}} RR for preemption. My understanding is 
that we might wind up preempting for one of the other RRs, in which case we're 
no longer testing the change to the production code. Let me know if I'm 
misunderstanding though.

> avoid AM preemption caused by RRs for specific nodes or racks
> -
>
> Key: YARN-7655
> URL: https://issues.apache.org/jira/browse/YARN-7655
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-7655-001.patch, YARN-7655-002.patch
>
>
> We frequently see AM preemptions when 
> {{starvedApp.getStarvedResourceRequests()}} in 
> {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs 
> that request containers on a specific node. Since this causes us to only 
> consider one node to preempt containers on, the really good work that was 
> done in YARN-5830 doesn't save us from AM preemption. Even though there might 
> be multiple nodes on which we could preempt enough non-AM containers to 
> satisfy the app's starvation, we often wind up preempting one or more AM 
> containers on the single node that we're considering.
> A proposed solution is that if we're going to preempt one or more AM 
> containers for an RR that specifies a node or rack, then we should instead 
> expand the search space to consider all nodes. That way we take advantage of 
> YARN-5830, and only preempt AMs if there's no alternative. I've attached a 
> patch with an initial implementation of this. We've been running it on a few 
> clusters, and have seen AM preemptions drop from double-digit occurrences on 
> many days to zero.
> Of course, the tradeoff is some loss of locality, since the starved app is 
> less likely to be allocated resources at the most specific locality level 
> that it asked for. My opinion is that this tradeoff is worth it, but 
> interested to hear what others think as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks

2018-02-04 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16351704#comment-16351704
 ] 

Steven Rand commented on YARN-7655:
---

Thanks [~yufeigu], new patch is attached. 

Unfortunately I'm still struggling to have the starved app be allocated the 
right number of containers in the test (though the preemption part happens 
correctly). The details of that are in my first comment above. It seems like 
the options are:

* What the current patch does, which is just leave a TODO above where we check 
for allocation.
* Only test that the preemption went as expected, and don't test allocation, 
i.e., don't call {{verifyPreemption}}.
* Find a way to have the allocation work out while still guaranteeing that the 
RR we consider for preemption is the {{NODE_LOCAL}} one. I thought I'd be able 
to figure this out, but have to admit I've been unsuccessful.

> avoid AM preemption caused by RRs for specific nodes or racks
> -
>
> Key: YARN-7655
> URL: https://issues.apache.org/jira/browse/YARN-7655
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-7655-001.patch, YARN-7655-002.patch
>
>
> We frequently see AM preemptions when 
> {{starvedApp.getStarvedResourceRequests()}} in 
> {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs 
> that request containers on a specific node. Since this causes us to only 
> consider one node to preempt containers on, the really good work that was 
> done in YARN-5830 doesn't save us from AM preemption. Even though there might 
> be multiple nodes on which we could preempt enough non-AM containers to 
> satisfy the app's starvation, we often wind up preempting one or more AM 
> containers on the single node that we're considering.
> A proposed solution is that if we're going to preempt one or more AM 
> containers for an RR that specifies a node or rack, then we should instead 
> expand the search space to consider all nodes. That way we take advantage of 
> YARN-5830, and only preempt AMs if there's no alternative. I've attached a 
> patch with an initial implementation of this. We've been running it on a few 
> clusters, and have seen AM preemptions drop from double-digit occurrences on 
> many days to zero.
> Of course, the tradeoff is some loss of locality, since the starved app is 
> less likely to be allocated resources at the most specific locality level 
> that it asked for. My opinion is that this tradeoff is worth it, but 
> interested to hear what others think as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks

2018-02-04 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-7655:
--
Attachment: YARN-7655-002.patch

> avoid AM preemption caused by RRs for specific nodes or racks
> -
>
> Key: YARN-7655
> URL: https://issues.apache.org/jira/browse/YARN-7655
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-7655-001.patch, YARN-7655-002.patch
>
>
> We frequently see AM preemptions when 
> {{starvedApp.getStarvedResourceRequests()}} in 
> {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs 
> that request containers on a specific node. Since this causes us to only 
> consider one node to preempt containers on, the really good work that was 
> done in YARN-5830 doesn't save us from AM preemption. Even though there might 
> be multiple nodes on which we could preempt enough non-AM containers to 
> satisfy the app's starvation, we often wind up preempting one or more AM 
> containers on the single node that we're considering.
> A proposed solution is that if we're going to preempt one or more AM 
> containers for an RR that specifies a node or rack, then we should instead 
> expand the search space to consider all nodes. That way we take advantage of 
> YARN-5830, and only preempt AMs if there's no alternative. I've attached a 
> patch with an initial implementation of this. We've been running it on a few 
> clusters, and have seen AM preemptions drop from double-digit occurrences on 
> many days to zero.
> Of course, the tradeoff is some loss of locality, since the starved app is 
> less likely to be allocated resources at the most specific locality level 
> that it asked for. My opinion is that this tradeoff is worth it, but 
> interested to hear what others think as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks

2018-01-29 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344335#comment-16344335
 ] 

Steven Rand commented on YARN-7655:
---

Sounds good, thanks!

> avoid AM preemption caused by RRs for specific nodes or racks
> -
>
> Key: YARN-7655
> URL: https://issues.apache.org/jira/browse/YARN-7655
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-7655-001.patch
>
>
> We frequently see AM preemptions when 
> {{starvedApp.getStarvedResourceRequests()}} in 
> {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs 
> that request containers on a specific node. Since this causes us to only 
> consider one node to preempt containers on, the really good work that was 
> done in YARN-5830 doesn't save us from AM preemption. Even though there might 
> be multiple nodes on which we could preempt enough non-AM containers to 
> satisfy the app's starvation, we often wind up preempting one or more AM 
> containers on the single node that we're considering.
> A proposed solution is that if we're going to preempt one or more AM 
> containers for an RR that specifies a node or rack, then we should instead 
> expand the search space to consider all nodes. That way we take advantage of 
> YARN-5830, and only preempt AMs if there's no alternative. I've attached a 
> patch with an initial implementation of this. We've been running it on a few 
> clusters, and have seen AM preemptions drop from double-digit occurrences on 
> many days to zero.
> Of course, the tradeoff is some loss of locality, since the starved app is 
> less likely to be allocated resources at the most specific locality level 
> that it asked for. My opinion is that this tradeoff is worth it, but 
> interested to hear what others think as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks

2018-01-26 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341964#comment-16341964
 ] 

Steven Rand commented on YARN-7655:
---

I'm not sure whether many AMs wind up on a limited number of NMs. It's quite 
possible -- my guess based on application patterns is that these clusters are 
running more AMs per node than most other clusters are. 

Thanks for the two links. it does look like both of those things would let us 
spread out the AMs better, which should lead to fewer total AM preemptions, 
though not necessarily prevent local requests from causing them.

Do you think the patch is worth pursuing? I'll buy that the clusters I have in 
mind likely were seeing so many AM preemptions due to a combination of custom 
config and access patterns involving many YARN applications, and therefore many 
AMs. On the other hand, the patch is a small change, and should be beneficial 
if you value not having to retry your app due to AM preemption more than you 
value the associated loss of locality, which I suspect most people do.

> avoid AM preemption caused by RRs for specific nodes or racks
> -
>
> Key: YARN-7655
> URL: https://issues.apache.org/jira/browse/YARN-7655
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-7655-001.patch
>
>
> We frequently see AM preemptions when 
> {{starvedApp.getStarvedResourceRequests()}} in 
> {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs 
> that request containers on a specific node. Since this causes us to only 
> consider one node to preempt containers on, the really good work that was 
> done in YARN-5830 doesn't save us from AM preemption. Even though there might 
> be multiple nodes on which we could preempt enough non-AM containers to 
> satisfy the app's starvation, we often wind up preempting one or more AM 
> containers on the single node that we're considering.
> A proposed solution is that if we're going to preempt one or more AM 
> containers for an RR that specifies a node or rack, then we should instead 
> expand the search space to consider all nodes. That way we take advantage of 
> YARN-5830, and only preempt AMs if there's no alternative. I've attached a 
> patch with an initial implementation of this. We've been running it on a few 
> clusters, and have seen AM preemptions drop from double-digit occurrences on 
> many days to zero.
> Of course, the tradeoff is some loss of locality, since the starved app is 
> less likely to be allocated resources at the most specific locality level 
> that it asked for. My opinion is that this tradeoff is worth it, but 
> interested to hear what others think as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks

2018-01-14 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325928#comment-16325928
 ] 

Steven Rand edited comment on YARN-7655 at 1/15/18 6:49 AM:


Thanks [~yufeigu] for taking a look. The cluster sizes and nodes should be 
pretty reasonable – for the three clusters I have in mind, the nodes are AWS 
ec2 instances with around 120 GB of RAM and around 20 vcores. The clusters 
range in size from double-digits to low triple-digits.

That said, there is some configuration in place at these clusters which could 
explain high rates of AM preemption. Specifically:
 * The default max AM share is set to -1. Unfortunately the max AM share 
feature, while totally reasonable as far as I can tell, was causing a good deal 
of confusion when apps would fail to start for no apparent reason upon hitting 
the limit, and we disabled it in the hope that having one less variable would 
make the scheduler's behavior easier to understand.
 * The default fair share preemption threshold is set to 1.0. This was also an 
attempt to reduce confusion, as failure to preempt while below fair share (but 
above fair share * the threshold) was commonly misinterpreted as a bug.
 * The preemption timeouts for fair share and min share are also non-default – 
they're set to one second each.

Possibly the configuration overrides, along with access patterns that include 
apps frequently starting up or increasing their demand via Spark's dynamic 
allocation feature, are the issue here, in which case we don't need to pursue 
this JIRA further. Data on whether or not other YARN deployments experience 
this issue would be useful, though not easy to come by, as I had to add custom 
logging to identify NODE_LOCAL requests as the cause of most AM preemptions at 
these clusters.


was (Author: steven rand):
Thanks [~yufeigu] for taking a look. The cluster sizes and nodes should be 
pretty reasonable -- for the three clusters I have in mind, the nodes are AWS 
ec2 instances with around 120 GB of RAM and around 20 vcores. The clusters 
range in size from double-digits to low triple-digits.

That said, there is some configuration in place at these clusters which could 
explain high rates of AM preemption. Specifically:

* The default max AM share is set to -1. Unfortunately the max AM share 
feature, while totally reasonable as far as I can tell, was causing a good deal 
of confusion when apps would fail to start for no apparently reason upon 
hitting the limit, and we disabled it in the hope that having one less variable 
would make the scheduler's behavior easier to understand.
* The default fair share preemption threshold is set to 1.0. This was also an 
attempt to reduce confusion, as failure to preempt while below fair share (but 
above fair share * the threshold) was commonly misinterpreted as a bug.
* The preemption timeouts for fair share and min share are also non-default -- 
they're set to one second each.

Possibly the configuration overrides, along with access patterns that include 
apps frequently starting up or increasing their demand via Spark's dynamic 
allocation feature, are the issue here, in which case we don't need to pursue 
this JIRA further. Data on whether or not other YARN deployments experience 
this issue would be useful, though not easy to come by, as I had to add custom 
logging to identify NODE_LOCAL requests as the cause of most AM preemptions at 
these clusters.

> avoid AM preemption caused by RRs for specific nodes or racks
> -
>
> Key: YARN-7655
> URL: https://issues.apache.org/jira/browse/YARN-7655
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-7655-001.patch
>
>
> We frequently see AM preemptions when 
> {{starvedApp.getStarvedResourceRequests()}} in 
> {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs 
> that request containers on a specific node. Since this causes us to only 
> consider one node to preempt containers on, the really good work that was 
> done in YARN-5830 doesn't save us from AM preemption. Even though there might 
> be multiple nodes on which we could preempt enough non-AM containers to 
> satisfy the app's starvation, we often wind up preempting one or more AM 
> containers on the single node that we're considering.
> A proposed solution is that if we're going to preempt one or more AM 
> containers for an RR that specifies a node or rack, then we should instead 
> expand the search space to consider all nodes. That way we take advantage of 
> YARN-5830, and only preempt AMs if there's no alternative. I've attached a 
> patch with an initial 

[jira] [Commented] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks

2018-01-14 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325928#comment-16325928
 ] 

Steven Rand commented on YARN-7655:
---

Thanks [~yufeigu] for taking a look. The cluster sizes and nodes should be 
pretty reasonable -- for the three clusters I have in mind, the nodes are AWS 
ec2 instances with around 120 GB of RAM and around 20 vcores. The clusters 
range in size from double-digits to low triple-digits.

That said, there is some configuration in place at these clusters which could 
explain high rates of AM preemption. Specifically:

* The default max AM share is set to -1. Unfortunately the max AM share 
feature, while totally reasonable as far as I can tell, was causing a good deal 
of confusion when apps would fail to start for no apparently reason upon 
hitting the limit, and we disabled it in the hope that having one less variable 
would make the scheduler's behavior easier to understand.
* The default fair share preemption threshold is set to 1.0. This was also an 
attempt to reduce confusion, as failure to preempt while below fair share (but 
above fair share * the threshold) was commonly misinterpreted as a bug.
* The preemption timeouts for fair share and min share are also non-default -- 
they're set to one second each.

Possibly the configuration overrides, along with access patterns that include 
apps frequently starting up or increasing their demand via Spark's dynamic 
allocation feature, are the issue here, in which case we don't need to pursue 
this JIRA further. Data on whether or not other YARN deployments experience 
this issue would be useful, though not easy to come by, as I had to add custom 
logging to identify NODE_LOCAL requests as the cause of most AM preemptions at 
these clusters.

> avoid AM preemption caused by RRs for specific nodes or racks
> -
>
> Key: YARN-7655
> URL: https://issues.apache.org/jira/browse/YARN-7655
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-7655-001.patch
>
>
> We frequently see AM preemptions when 
> {{starvedApp.getStarvedResourceRequests()}} in 
> {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs 
> that request containers on a specific node. Since this causes us to only 
> consider one node to preempt containers on, the really good work that was 
> done in YARN-5830 doesn't save us from AM preemption. Even though there might 
> be multiple nodes on which we could preempt enough non-AM containers to 
> satisfy the app's starvation, we often wind up preempting one or more AM 
> containers on the single node that we're considering.
> A proposed solution is that if we're going to preempt one or more AM 
> containers for an RR that specifies a node or rack, then we should instead 
> expand the search space to consider all nodes. That way we take advantage of 
> YARN-5830, and only preempt AMs if there's no alternative. I've attached a 
> patch with an initial implementation of this. We've been running it on a few 
> clusters, and have seen AM preemptions drop from double-digit occurrences on 
> many days to zero.
> Of course, the tradeoff is some loss of locality, since the starved app is 
> less likely to be allocated resources at the most specific locality level 
> that it asked for. My opinion is that this tradeoff is worth it, but 
> interested to hear what others think as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks

2018-01-03 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310553#comment-16310553
 ] 

Steven Rand commented on YARN-7655:
---

Tagging [~yufeigu] and [~templedf] for thoughts. I can work through the above 
weirdness with the test case, but interested to hear what people think of the 
proposed change.

> avoid AM preemption caused by RRs for specific nodes or racks
> -
>
> Key: YARN-7655
> URL: https://issues.apache.org/jira/browse/YARN-7655
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-7655-001.patch
>
>
> We frequently see AM preemptions when 
> {{starvedApp.getStarvedResourceRequests()}} in 
> {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs 
> that request containers on a specific node. Since this causes us to only 
> consider one node to preempt containers on, the really good work that was 
> done in YARN-5830 doesn't save us from AM preemption. Even though there might 
> be multiple nodes on which we could preempt enough non-AM containers to 
> satisfy the app's starvation, we often wind up preempting one or more AM 
> containers on the single node that we're considering.
> A proposed solution is that if we're going to preempt one or more AM 
> containers for an RR that specifies a node or rack, then we should instead 
> expand the search space to consider all nodes. That way we take advantage of 
> YARN-5830, and only preempt AMs if there's no alternative. I've attached a 
> patch with an initial implementation of this. We've been running it on a few 
> clusters, and have seen AM preemptions drop from double-digit occurrences on 
> many days to zero.
> Of course, the tradeoff is some loss of locality, since the starved app is 
> less likely to be allocated resources at the most specific locality level 
> that it asked for. My opinion is that this tradeoff is worth it, but 
> interested to hear what others think as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks

2017-12-13 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290073#comment-16290073
 ] 

Steven Rand commented on YARN-7655:
---

One issue I'm having with the test in the patch is that preemption works as 
expected, but the starved app doesn't have any containers allocated to it. I 
think the series of events that causes this is:

* For purposes of the test, I'm only interested in requesting resources on a 
particular node. But as discussed in YARN-7561, this requires me to also make a 
rack-local request and a request for any node at the same priority.
* To make sure that the RR that we consider for preemption is the node-local 
one, I made the other two RRs too big to be satisfied, so that way 
{{getStarvedResourceRequests}} skips them.
* However, when we go to allocate the preempted resources to the starving app, 
it turns out that {{FSAppAttempt#hasContainerForNode}} only looks at the 
capacity of the off-switch ask: 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1071.
 This causes it to decide that the starving app can't be allocated resources on 
the node, since I intentionally made the off-switch RR too big to fit on any of 
the test nodes. The fact that the node-local request (for the other node) is 
small enough to fit on this node gets ignored.

I'm having trouble figuring out what to do about this. I had assumed that if 
relaxLocality was true for an RR, then it would be able to be satisfied on node 
B even though it asked for node A. Is this not correct? Or should 
FSAppAttempt#hasContainerForNode be modified to check the sizes of the asks at 
rack and node-level (if those exist)?

> avoid AM preemption caused by RRs for specific nodes or racks
> -
>
> Key: YARN-7655
> URL: https://issues.apache.org/jira/browse/YARN-7655
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-7655-001.patch
>
>
> We frequently see AM preemptions when 
> {{starvedApp.getStarvedResourceRequests()}} in 
> {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs 
> that request containers on a specific node. Since this causes us to only 
> consider one node to preempt containers on, the really good work that was 
> done in YARN-5830 doesn't save us from AM preemption. Even though there might 
> be multiple nodes on which we could preempt enough non-AM containers to 
> satisfy the app's starvation, we often wind up preempting one or more AM 
> containers on the single node that we're considering.
> A proposed solution is that if we're going to preempt one or more AM 
> containers for an RR that specifies a node or rack, then we should instead 
> expand the search space to consider all nodes. That way we take advantage of 
> YARN-5830, and only preempt AMs if there's no alternative. I've attached a 
> patch with an initial implementation of this. We've been running it on a few 
> clusters, and have seen AM preemptions drop from double-digit occurrences on 
> many days to zero.
> Of course, the tradeoff is some loss of locality, since the starved app is 
> less likely to be allocated resources at the most specific locality level 
> that it asked for. My opinion is that this tradeoff is worth it, but 
> interested to hear what others think as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks

2017-12-13 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-7655:
--
Attachment: YARN-7655-001.patch

> avoid AM preemption caused by RRs for specific nodes or racks
> -
>
> Key: YARN-7655
> URL: https://issues.apache.org/jira/browse/YARN-7655
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-7655-001.patch
>
>
> We frequently see AM preemptions when 
> {{starvedApp.getStarvedResourceRequests()}} in 
> {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs 
> that request containers on a specific node. Since this causes us to only 
> consider one node to preempt containers on, the really good work that was 
> done in YARN-5830 doesn't save us from AM preemption. Even though there might 
> be multiple nodes on which we could preempt enough non-AM containers to 
> satisfy the app's starvation, we often wind up preempting one or more AM 
> containers on the single node that we're considering.
> A proposed solution is that if we're going to preempt one or more AM 
> containers for an RR that specifies a node or rack, then we should instead 
> expand the search space to consider all nodes. That way we take advantage of 
> YARN-5830, and only preempt AMs if there's no alternative. I've attached a 
> patch with an initial implementation of this. We've been running it on a few 
> clusters, and have seen AM preemptions drop from double-digit occurrences on 
> many days to zero.
> Of course, the tradeoff is some loss of locality, since the starved app is 
> less likely to be allocated resources at the most specific locality level 
> that it asked for. My opinion is that this tradeoff is worth it, but 
> interested to hear what others think as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks

2017-12-13 Thread Steven Rand (JIRA)
Steven Rand created YARN-7655:
-

 Summary: avoid AM preemption caused by RRs for specific nodes or 
racks
 Key: YARN-7655
 URL: https://issues.apache.org/jira/browse/YARN-7655
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 3.0.0
Reporter: Steven Rand
Assignee: Steven Rand


We frequently see AM preemptions when 
{{starvedApp.getStarvedResourceRequests()}} in 
{{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs 
that request containers on a specific node. Since this causes us to only 
consider one node to preempt containers on, the really good work that was done 
in YARN-5830 doesn't save us from AM preemption. Even though there might be 
multiple nodes on which we could preempt enough non-AM containers to satisfy 
the app's starvation, we often wind up preempting one or more AM containers on 
the single node that we're considering.

A proposed solution is that if we're going to preempt one or more AM containers 
for an RR that specifies a node or rack, then we should instead expand the 
search space to consider all nodes. That way we take advantage of YARN-5830, 
and only preempt AMs if there's no alternative. I've attached a patch with an 
initial implementation of this. We've been running it on a few clusters, and 
have seen AM preemptions drop from double-digit occurrences on many days to 
zero.

Of course, the tradeoff is some loss of locality, since the starved app is less 
likely to be allocated resources at the most specific locality level that it 
asked for. My opinion is that this tradeoff is worth it, but interested to hear 
what others think as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7290) canContainerBePreempted can return true when it shouldn't

2017-11-22 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-7290:
--
Attachment: YARN-7290.005.patch

Thanks, [~yufeigu]. Attaching a new patch which removes the list of containers, 
and changes {{resourcesToPreemptByApp}} to a {{Map}}.

Re: the checkstyle issues, one of them no longer applies to the new patch. The 
other one is that the new {{containersByApp}} variable should be made private 
and an accessor method should be created for it. I'm happy to do that, but it 
also would be inconsistent with the other variables in 
{{PreemptableContainers}}, which aren't private and don't have getters. I don't 
have a strong opinion, so happy to handle this however people prefer.

> canContainerBePreempted can return true when it shouldn't
> -
>
> Key: YARN-7290
> URL: https://issues.apache.org/jira/browse/YARN-7290
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.0.0-beta1
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-7290-failing-test.patch, YARN-7290.001.patch, 
> YARN-7290.002.patch, YARN-7290.003.patch, YARN-7290.004.patch, 
> YARN-7290.005.patch
>
>
> In FSAppAttempt#canContainerBePreempted, we make sure that preempting the 
> given container would not put the app below its fair share:
> {code}
> // Check if the app's allocation will be over its fairshare even
> // after preempting this container
> Resource usageAfterPreemption = Resources.clone(getResourceUsage());
> // Subtract resources of containers already queued for preemption
> synchronized (preemptionVariablesLock) {
>   Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted);
> }
> // Subtract this container's allocation to compute usage after preemption
> Resources.subtractFrom(
> usageAfterPreemption, container.getAllocatedResource());
> return !isUsageBelowShare(usageAfterPreemption, getFairShare());
> {code}
> However, this only considers one container in isolation, and fails to 
> consider containers for the same app that we already added to 
> {{preemptableContainers}} in 
> FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a 
> case where we preempt multiple containers from the same app, none of which by 
> itself puts the app below fair share, but which cumulatively do so.
> I've attached a patch with a test to show this behavior. The flow is:
> 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated 
> all the resources (8g and 8vcores)
> 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 
> containers, each of which is 3g and 3vcores in size. At this point both 
> greedyApp and starvingApp have a fair share of 4g (with DRF not in use).
> 3. For the first container requested by starvedApp, we (correctly) preempt 3 
> containers from greedyApp, each of which is 1g and 1vcore.
> 4. For the second container requested by starvedApp, we again (this time 
> incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below 
> its fair share, but happens anyway because all six times that we call 
> {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the 
> value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using 
> debugger).
> So in addition to accounting for {{resourcesToBePreempted}}, we also need to 
> account for containers that we're already planning on preempting in 
> FSPreemptionThread#identifyContainersToPreemptOnNode. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7290) canContainerBePreempted can return true when it shouldn't

2017-11-21 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-7290:
--
Attachment: YARN-7290.004.patch

Thanks for reviewing, [~yufeigu]! I've attached a new patch which addresses the 
comments:

* Agreed that it makes sense to move the logic of computing app resource usage 
after preemption into its own method. I added a new method called 
{{getUsageAfterPreemptingContainer}} below {{canContainerBePreempted}}.
* The unit test does cover the second issue. Actually I hadn't noticed the 
second issue from looking at the code, and only noticed it when the test still 
failed after addressing the fist issue.
* Agreed that {{identifyContainersToPreemptOnNode}} has quite a lot of logic in 
it now. I moved the map from appId to resources we're considering for 
preemption into {{PreemptableContainers}}, which seems like the right place for 
it, and simplifies that method.
* Unused import is now removed -- nice catch.

> canContainerBePreempted can return true when it shouldn't
> -
>
> Key: YARN-7290
> URL: https://issues.apache.org/jira/browse/YARN-7290
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.0.0-beta1
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-7290-failing-test.patch, YARN-7290.001.patch, 
> YARN-7290.002.patch, YARN-7290.003.patch, YARN-7290.004.patch
>
>
> In FSAppAttempt#canContainerBePreempted, we make sure that preempting the 
> given container would not put the app below its fair share:
> {code}
> // Check if the app's allocation will be over its fairshare even
> // after preempting this container
> Resource usageAfterPreemption = Resources.clone(getResourceUsage());
> // Subtract resources of containers already queued for preemption
> synchronized (preemptionVariablesLock) {
>   Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted);
> }
> // Subtract this container's allocation to compute usage after preemption
> Resources.subtractFrom(
> usageAfterPreemption, container.getAllocatedResource());
> return !isUsageBelowShare(usageAfterPreemption, getFairShare());
> {code}
> However, this only considers one container in isolation, and fails to 
> consider containers for the same app that we already added to 
> {{preemptableContainers}} in 
> FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a 
> case where we preempt multiple containers from the same app, none of which by 
> itself puts the app below fair share, but which cumulatively do so.
> I've attached a patch with a test to show this behavior. The flow is:
> 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated 
> all the resources (8g and 8vcores)
> 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 
> containers, each of which is 3g and 3vcores in size. At this point both 
> greedyApp and starvingApp have a fair share of 4g (with DRF not in use).
> 3. For the first container requested by starvedApp, we (correctly) preempt 3 
> containers from greedyApp, each of which is 1g and 1vcore.
> 4. For the second container requested by starvedApp, we again (this time 
> incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below 
> its fair share, but happens anyway because all six times that we call 
> {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the 
> value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using 
> debugger).
> So in addition to accounting for {{resourcesToBePreempted}}, we also need to 
> account for containers that we're already planning on preempting in 
> FSPreemptionThread#identifyContainersToPreemptOnNode. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7290) canContainerBePreempted can return true when it shouldn't

2017-11-19 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-7290:
--
Attachment: YARN-7290.003.patch

Uploaded a new patch to try to make the test a bit nicer.

[~templedf], would it be possible for you or someone else to take a look? This 
bug seems to still exist on trunk, and I think it'd be good to fix it.

> canContainerBePreempted can return true when it shouldn't
> -
>
> Key: YARN-7290
> URL: https://issues.apache.org/jira/browse/YARN-7290
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.0.0-beta1
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-7290-failing-test.patch, YARN-7290.001.patch, 
> YARN-7290.002.patch, YARN-7290.003.patch
>
>
> In FSAppAttempt#canContainerBePreempted, we make sure that preempting the 
> given container would not put the app below its fair share:
> {code}
> // Check if the app's allocation will be over its fairshare even
> // after preempting this container
> Resource usageAfterPreemption = Resources.clone(getResourceUsage());
> // Subtract resources of containers already queued for preemption
> synchronized (preemptionVariablesLock) {
>   Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted);
> }
> // Subtract this container's allocation to compute usage after preemption
> Resources.subtractFrom(
> usageAfterPreemption, container.getAllocatedResource());
> return !isUsageBelowShare(usageAfterPreemption, getFairShare());
> {code}
> However, this only considers one container in isolation, and fails to 
> consider containers for the same app that we already added to 
> {{preemptableContainers}} in 
> FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a 
> case where we preempt multiple containers from the same app, none of which by 
> itself puts the app below fair share, but which cumulatively do so.
> I've attached a patch with a test to show this behavior. The flow is:
> 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated 
> all the resources (8g and 8vcores)
> 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 
> containers, each of which is 3g and 3vcores in size. At this point both 
> greedyApp and starvingApp have a fair share of 4g (with DRF not in use).
> 3. For the first container requested by starvedApp, we (correctly) preempt 3 
> containers from greedyApp, each of which is 1g and 1vcore.
> 4. For the second container requested by starvedApp, we again (this time 
> incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below 
> its fair share, but happens anyway because all six times that we call 
> {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the 
> value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using 
> debugger).
> So in addition to accounting for {{resourcesToBePreempted}}, we also need to 
> account for containers that we're already planning on preempting in 
> FSPreemptionThread#identifyContainersToPreemptOnNode. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7391) Consider square root instead of natural log for size-based weight

2017-10-29 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-7391:
--
Attachment: YARN-7391-001.patch

I know this is still under discussion, but attached a patch just to make the 
intent/scope of the proposed change totally clear.

> Consider square root instead of natural log for size-based weight
> -
>
> Key: YARN-7391
> URL: https://issues.apache.org/jira/browse/YARN-7391
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.0-beta1
>Reporter: Steven Rand
> Attachments: YARN-7391-001.patch
>
>
> Currently for size-based weight, we compute the weight of an app using this 
> code from 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L377:
> {code}
>   if (sizeBasedWeight) {
> // Set weight based on current memory demand
> weight = Math.log1p(app.getDemand().getMemorySize()) / Math.log(2);
>   }
> {code}
> Because the natural log function grows slowly, the weights of two apps with 
> hugely different memory demands can be quite similar. For example, {{weight}} 
> evaluates to 14.3 for an app with a demand of 20 GB, and evaluates to 19.9 
> for an app with a demand of 1000 GB. The app with the much larger demand will 
> still have a higher weight, but not by a large amount relative to the sum of 
> those weights.
> I think it's worth considering a switch to a square root function, which will 
> grow more quickly. In the above example, the app with a demand of 20 GB now 
> has a weight of 143, while the app with a demand of 1000 GB now has a weight 
> of 1012. These weights seem more reasonable relative to each other given the 
> difference in demand between the two apps.
> The above example is admittedly a bit extreme, but I believe that a square 
> root function would also produce reasonable results in general.
> The code I have in mind would look something like:
> {code}
>   if (sizeBasedWeight) {
> // Set weight based on current memory demand
> weight = Math.sqrt(app.getDemand().getMemorySize());
>   }
> {code}
> Would people be comfortable with this change?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7391) Consider square root instead of natural log for size-based weight

2017-10-29 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223996#comment-16223996
 ] 

Steven Rand edited comment on YARN-7391 at 10/29/17 1:27 PM:
-

[~templedf] and [~yufeigu], thanks for commenting. Apologies for not including 
my use case in the original description. We run multiple long-running Spark 
applications, each of which uses Spark's dynamic allocation feature, and 
therefore has a demand which fluctuates over time. At any point, the demand of 
any given app can be quite low (e.g., only an AM container), or quite high 
(e.g., hundreds of executors).

Historically, we've run each app in its own leaf queue, since the Fair 
Scheduler has not always supported preemption inside a leaf queue. We've found 
that since the fair share of a parent queue is split evenly among all of its 
active leaf queues, the fair share of each app is the same, regardless of its 
demand. This causes our apps with higher demand to have fair shares that are 
too low for them to preempt enough resources to even get close to meeting their 
demand. If fair share were based on demand, then our apps with lower demand 
would be unaffected, but our apps with higher demand could have high enough 
weights to preempt a reasonable number of resources away from apps that are 
over their fair shares.

This problem led us to consider running more apps inside the same leaf queue, 
which is no longer an issue now that the Fair Scheduler supports preemption 
between apps in the same leaf queue. We'd hoped to use the size-based weight 
feature to achieve the goal of the more demanding apps having high enough fair 
shares to preempt sufficient resources away from other apps. However, in 
experimenting with this feature, the results were somewhat underwhelming. Yes, 
the more demanding apps now have higher fair shares, but not by enough to 
significantly impact allocation.

Consider, for example, the rather extreme case of 10 apps running in a leaf 
queue, where 9 of them are requesting 20GB each, and 1 of them is requesting 
1024GB. The weight of each of the 9 less demanding apps is about 14.3, and the 
weight of the highly demanding app is about 20.0. So the highly demanding app 
winds up with about 13.5% (20/148) of the queue's fair share, despite having a 
demand that's more than 5x that of the other 9 put together, as opposed to the 
10% it would have with size-based weight turned off. I know the example is a 
bit silly, but I wanted to show that even with huge differences in demand, the 
current behavior of size-based weight doesn't produce major differences in 
weights.

Does that make sense? Happy to provide more info if helpful.


was (Author: steven rand):
[~templedf] and [~yufeigu], thanks for commenting. Apologies for not including 
my use case in the original description. We run multiple long-running Spark 
applications, each of which uses Spark's dynamic allocation feature, and 
therefore has a demand which fluctuates over time. At any point, the demand of 
any given app can be quite low (e.g., only an AM container), or quite high 
(e.g., hundreds of executors).

Historically, we've run each app in its own leaf queue, since the Fair 
Scheduler has not always supported preemption inside a leaf queue. We've found 
that since the fair share of a parent queue is split evenly among all of its 
active leaf queues, the fair share of each app is the same, regardless of its 
demand. This causes our apps with higher demand to have fair shares that are 
too low for them to preempt enough resources to even get close to meeting their 
demand. If fair share were based on demand, then our apps with lower demand 
would be unaffected, but our apps with higher demand could have high enough 
weights to preempt a reasonable number of resources away from apps that over 
their fair shares.

This problem led us to consider running more apps inside the same leaf queue, 
which is no longer an issue now that the Fair Scheduler supports preemption 
inside a leaf queue. We'd hoped to use the size-based weight feature to achieve 
the goal of the more demanding apps having high enough fair shares to preempt 
sufficient resources away from other apps. However, in experimenting with this 
feature, the results were somewhat underwhelming. Yes, the more demanding apps 
now have higher fair shares, but not by enough to significantly impact 
allocation.

Consider, for example, the rather extreme case of 10 apps running in a leaf 
queue, where 9 of them are requesting 20GB each, and 1 of them is requesting 
1024GB. The weight of each of the 9 less demanding apps is about 14.3, and the 
weight of the highly demanding app is about 20.0. So the highly demanding app 
winds up with about 13.5% (20/148) of the queue's fair share, despite having a 
demand that's more than 5x that of the other 9 put together, as opposed to the 
10% it would 

[jira] [Commented] (YARN-7391) Consider square root instead of natural log for size-based weight

2017-10-29 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223996#comment-16223996
 ] 

Steven Rand commented on YARN-7391:
---

[~templedf] and [~yufeigu], thanks for commenting. Apologies for not including 
my use case in the original description. We run multiple long-running Spark 
applications, each of which uses Spark's dynamic allocation feature, and 
therefore has a demand which fluctuates over time. At any point, the demand of 
any given app can be quite low (e.g., only an AM container), or quite high 
(e.g., hundreds of executors).

Historically, we've run each app in its own leaf queue, since the Fair 
Scheduler has not always supported preemption inside a leaf queue. We've found 
that since the fair share of a parent queue is split evenly among all of its 
active leaf queues, the fair share of each app is the same, regardless of its 
demand. This causes our apps with higher demand to have fair shares that are 
too low for them to preempt enough resources to even get close to meeting their 
demand. If fair share were based on demand, then our apps with lower demand 
would be unaffected, but our apps with higher demand could have high enough 
weights to preempt a reasonable number of resources away from apps that over 
their fair shares.

This problem led us to consider running more apps inside the same leaf queue, 
which is no longer an issue now that the Fair Scheduler supports preemption 
inside a leaf queue. We'd hoped to use the size-based weight feature to achieve 
the goal of the more demanding apps having high enough fair shares to preempt 
sufficient resources away from other apps. However, in experimenting with this 
feature, the results were somewhat underwhelming. Yes, the more demanding apps 
now have higher fair shares, but not by enough to significantly impact 
allocation.

Consider, for example, the rather extreme case of 10 apps running in a leaf 
queue, where 9 of them are requesting 20GB each, and 1 of them is requesting 
1024GB. The weight of each of the 9 less demanding apps is about 14.3, and the 
weight of the highly demanding app is about 20.0. So the highly demanding app 
winds up with about 13.5% (20/148) of the queue's fair share, despite having a 
demand that's more than 5x that of the other 9 put together, as opposed to the 
10% it would have with size-based weight turned off. I know the example is a 
bit silly, but I wanted to show that even with huge differences in demand, the 
current behavior of size-based weight doesn't produce major differences in 
weights.

Does that make sense? Happy to provide more info if helpful.

> Consider square root instead of natural log for size-based weight
> -
>
> Key: YARN-7391
> URL: https://issues.apache.org/jira/browse/YARN-7391
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.0-beta1
>Reporter: Steven Rand
>
> Currently for size-based weight, we compute the weight of an app using this 
> code from 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L377:
> {code}
>   if (sizeBasedWeight) {
> // Set weight based on current memory demand
> weight = Math.log1p(app.getDemand().getMemorySize()) / Math.log(2);
>   }
> {code}
> Because the natural log function grows slowly, the weights of two apps with 
> hugely different memory demands can be quite similar. For example, {{weight}} 
> evaluates to 14.3 for an app with a demand of 20 GB, and evaluates to 19.9 
> for an app with a demand of 1000 GB. The app with the much larger demand will 
> still have a higher weight, but not by a large amount relative to the sum of 
> those weights.
> I think it's worth considering a switch to a square root function, which will 
> grow more quickly. In the above example, the app with a demand of 20 GB now 
> has a weight of 143, while the app with a demand of 1000 GB now has a weight 
> of 1012. These weights seem more reasonable relative to each other given the 
> difference in demand between the two apps.
> The above example is admittedly a bit extreme, but I believe that a square 
> root function would also produce reasonable results in general.
> The code I have in mind would look something like:
> {code}
>   if (sizeBasedWeight) {
> // Set weight based on current memory demand
> weight = Math.sqrt(app.getDemand().getMemorySize());
>   }
> {code}
> Would people be comfortable with this change?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, 

[jira] [Created] (YARN-7391) Consider square root instead of natural log for size-based weight

2017-10-25 Thread Steven Rand (JIRA)
Steven Rand created YARN-7391:
-

 Summary: Consider square root instead of natural log for 
size-based weight
 Key: YARN-7391
 URL: https://issues.apache.org/jira/browse/YARN-7391
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 3.0.0-beta1
Reporter: Steven Rand


Currently for size-based weight, we compute the weight of an app using this 
code from 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L377:

{code}
  if (sizeBasedWeight) {
// Set weight based on current memory demand
weight = Math.log1p(app.getDemand().getMemorySize()) / Math.log(2);
  }
{code}

Because the natural log function grows slowly, the weights of two apps with 
hugely different memory demands can be quite similar. For example, {{weight}} 
evaluates to 14.3 for an app with a demand of 20 GB, and evaluates to 19.9 for 
an app with a demand of 1000 GB. The app with the much larger demand will still 
have a higher weight, but not by a large amount relative to the sum of those 
weights.

I think it's worth considering a switch to a square root function, which will 
grow more quickly. In the above example, the app with a demand of 20 GB now has 
a weight of 143, while the app with a demand of 1000 GB now has a weight of 
1012. These weights seem more reasonable relative to each other given the 
difference in demand between the two apps.

The above example is admittedly a bit extreme, but I believe that a square root 
function would also produce reasonable results in general.

The code I have in mind would look something like:

{code}
  if (sizeBasedWeight) {
// Set weight based on current memory demand
weight = Math.sqrt(app.getDemand().getMemorySize());
  }
{code}

Would people be comfortable with this change?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node

2017-10-23 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216201#comment-16216201
 ] 

Steven Rand commented on YARN-4227:
---

Maybe we could have ClusterNodeTracker#getNode check to see if 
{{nodes.get(nodeId)}} returns null, and if it does, instead log a warning and 
return a special subclass of {{FSSchedulerNode}} that overrides all methods to 
be no-ops? I know it's not pretty, but the advantage is that we don't have to 
check for null in a bunch of different places.

Also, after looking more closely, the particular NPE that I'm seeing turns out 
to have been fixed by YARN-6432. However, I still think that we want a generic 
solution so as to be protected against access of unhealthy nodes going forward.

> FairScheduler: RM quits processing expired container from a removed node
> 
>
> Key: YARN-4227
> URL: https://issues.apache.org/jira/browse/YARN-4227
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.3.0, 2.5.0, 2.7.1
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: YARN-4227.2.patch, YARN-4227.3.patch, YARN-4227.4.patch, 
> YARN-4227.patch
>
>
> Under some circumstances the node is removed before an expired container 
> event is processed causing the RM to exit:
> {code}
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
> Expired:container_1436927988321_1307950_01_12 Timed out after 600 secs
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1436927988321_1307950_01_12 Container Transitioned from 
> ACQUIRED to EXPIRED
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: 
> Completed container: container_1436927988321_1307950_01_12 in state: 
> EXPIRED event:EXPIRE
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op   
>OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS  
> APPID=application_1436927988321_1307950 
> CONTAINERID=container_1436927988321_1307950_01_12
> 2015-10-04 21:14:01,063 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type CONTAINER_EXPIRED to the scheduler
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}
> The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 
> and 2.6.0 by different customers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node

2017-10-23 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216148#comment-16216148
 ] 

Steven Rand commented on YARN-4227:
---

Sorry, I was mistaken when I said the patch attached to this JIRA prevents the 
NPE. Unfortunately the FSPreemptionThread accesses nodes at multiple points, 
each of which is a new opportunity for the race condition to occur and cause an 
NPE. It seems impractical to wrap each node access in an {{if (node != null)}} 
block, though admittedly I don't have any better ideas right now. Are there 
alternate solutions that I'm failing to consider that would prevent the race 
condition from happening? Happy to submit a patch if anyone has ideas.

> FairScheduler: RM quits processing expired container from a removed node
> 
>
> Key: YARN-4227
> URL: https://issues.apache.org/jira/browse/YARN-4227
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.3.0, 2.5.0, 2.7.1
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: YARN-4227.2.patch, YARN-4227.3.patch, YARN-4227.4.patch, 
> YARN-4227.patch
>
>
> Under some circumstances the node is removed before an expired container 
> event is processed causing the RM to exit:
> {code}
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
> Expired:container_1436927988321_1307950_01_12 Timed out after 600 secs
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1436927988321_1307950_01_12 Container Transitioned from 
> ACQUIRED to EXPIRED
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: 
> Completed container: container_1436927988321_1307950_01_12 in state: 
> EXPIRED event:EXPIRE
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op   
>OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS  
> APPID=application_1436927988321_1307950 
> CONTAINERID=container_1436927988321_1307950_01_12
> 2015-10-04 21:14:01,063 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type CONTAINER_EXPIRED to the scheduler
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}
> The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 
> and 2.6.0 by different customers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7290) canContainerBePreempted can return true when it shouldn't

2017-10-18 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210418#comment-16210418
 ] 

Steven Rand commented on YARN-7290:
---

Thanks [~templedf]. For what it's worth, I was able to repro this on a live 
cluster as well as in the test. I let one spark-shell use the entire cluster, 
and then started a second spark-shell. The second-spark shell was able to 
preempt all of the first one's containers, including the Application Master. 
After I applied the patch, the second spark-shell was only able to preempt half 
of the cluster's resources away from the first one.

> canContainerBePreempted can return true when it shouldn't
> -
>
> Key: YARN-7290
> URL: https://issues.apache.org/jira/browse/YARN-7290
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.0.0-beta1
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-7290-failing-test.patch, YARN-7290.001.patch, 
> YARN-7290.002.patch
>
>
> In FSAppAttempt#canContainerBePreempted, we make sure that preempting the 
> given container would not put the app below its fair share:
> {code}
> // Check if the app's allocation will be over its fairshare even
> // after preempting this container
> Resource usageAfterPreemption = Resources.clone(getResourceUsage());
> // Subtract resources of containers already queued for preemption
> synchronized (preemptionVariablesLock) {
>   Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted);
> }
> // Subtract this container's allocation to compute usage after preemption
> Resources.subtractFrom(
> usageAfterPreemption, container.getAllocatedResource());
> return !isUsageBelowShare(usageAfterPreemption, getFairShare());
> {code}
> However, this only considers one container in isolation, and fails to 
> consider containers for the same app that we already added to 
> {{preemptableContainers}} in 
> FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a 
> case where we preempt multiple containers from the same app, none of which by 
> itself puts the app below fair share, but which cumulatively do so.
> I've attached a patch with a test to show this behavior. The flow is:
> 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated 
> all the resources (8g and 8vcores)
> 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 
> containers, each of which is 3g and 3vcores in size. At this point both 
> greedyApp and starvingApp have a fair share of 4g (with DRF not in use).
> 3. For the first container requested by starvedApp, we (correctly) preempt 3 
> containers from greedyApp, each of which is 1g and 1vcore.
> 4. For the second container requested by starvedApp, we again (this time 
> incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below 
> its fair share, but happens anyway because all six times that we call 
> {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the 
> value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using 
> debugger).
> So in addition to accounting for {{resourcesToBePreempted}}, we also need to 
> account for containers that we're already planning on preempting in 
> FSPreemptionThread#identifyContainersToPreemptOnNode. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7290) canContainerBePreempted can return true when it shouldn't

2017-10-04 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-7290:
--
Attachment: YARN-7290.002.patch

Adding a new patch to make checkstyles happy. The tests in 
TestOpportunisticContainerAllocatorAMService all pass for me locally despite 
the failure in the last Jenkins run.

> canContainerBePreempted can return true when it shouldn't
> -
>
> Key: YARN-7290
> URL: https://issues.apache.org/jira/browse/YARN-7290
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.0.0-beta1
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-7290.001.patch, YARN-7290.002.patch, 
> YARN-7290-failing-test.patch
>
>
> In FSAppAttempt#canContainerBePreempted, we make sure that preempting the 
> given container would not put the app below its fair share:
> {code}
> // Check if the app's allocation will be over its fairshare even
> // after preempting this container
> Resource usageAfterPreemption = Resources.clone(getResourceUsage());
> // Subtract resources of containers already queued for preemption
> synchronized (preemptionVariablesLock) {
>   Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted);
> }
> // Subtract this container's allocation to compute usage after preemption
> Resources.subtractFrom(
> usageAfterPreemption, container.getAllocatedResource());
> return !isUsageBelowShare(usageAfterPreemption, getFairShare());
> {code}
> However, this only considers one container in isolation, and fails to 
> consider containers for the same app that we already added to 
> {{preemptableContainers}} in 
> FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a 
> case where we preempt multiple containers from the same app, none of which by 
> itself puts the app below fair share, but which cumulatively do so.
> I've attached a patch with a test to show this behavior. The flow is:
> 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated 
> all the resources (8g and 8vcores)
> 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 
> containers, each of which is 3g and 3vcores in size. At this point both 
> greedyApp and starvingApp have a fair share of 4g (with DRF not in use).
> 3. For the first container requested by starvedApp, we (correctly) preempt 3 
> containers from greedyApp, each of which is 1g and 1vcore.
> 4. For the second container requested by starvedApp, we again (this time 
> incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below 
> its fair share, but happens anyway because all six times that we call 
> {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the 
> value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using 
> debugger).
> So in addition to accounting for {{resourcesToBePreempted}}, we also need to 
> account for containers that we're already planning on preempting in 
> FSPreemptionThread#identifyContainersToPreemptOnNode. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7290) canContainerBePreempted can return true when it shouldn't

2017-10-04 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-7290:
--
Attachment: YARN-7290.001.patch

Added a patch which I _think_ fixes both issues. All tests in 
{{TestFairSchedulerPreemption}} pass for me locally, including the new one, but 
the details here are tricky.

> canContainerBePreempted can return true when it shouldn't
> -
>
> Key: YARN-7290
> URL: https://issues.apache.org/jira/browse/YARN-7290
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.0.0-beta1
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-7290.001.patch, YARN-7290-failing-test.patch
>
>
> In FSAppAttempt#canContainerBePreempted, we make sure that preempting the 
> given container would not put the app below its fair share:
> {code}
> // Check if the app's allocation will be over its fairshare even
> // after preempting this container
> Resource usageAfterPreemption = Resources.clone(getResourceUsage());
> // Subtract resources of containers already queued for preemption
> synchronized (preemptionVariablesLock) {
>   Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted);
> }
> // Subtract this container's allocation to compute usage after preemption
> Resources.subtractFrom(
> usageAfterPreemption, container.getAllocatedResource());
> return !isUsageBelowShare(usageAfterPreemption, getFairShare());
> {code}
> However, this only considers one container in isolation, and fails to 
> consider containers for the same app that we already added to 
> {{preemptableContainers}} in 
> FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a 
> case where we preempt multiple containers from the same app, none of which by 
> itself puts the app below fair share, but which cumulatively do so.
> I've attached a patch with a test to show this behavior. The flow is:
> 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated 
> all the resources (8g and 8vcores)
> 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 
> containers, each of which is 3g and 3vcores in size. At this point both 
> greedyApp and starvingApp have a fair share of 4g (with DRF not in use).
> 3. For the first container requested by starvedApp, we (correctly) preempt 3 
> containers from greedyApp, each of which is 1g and 1vcore.
> 4. For the second container requested by starvedApp, we again (this time 
> incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below 
> its fair share, but happens anyway because all six times that we call 
> {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the 
> value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using 
> debugger).
> So in addition to accounting for {{resourcesToBePreempted}}, we also need to 
> account for containers that we're already planning on preempting in 
> FSPreemptionThread#identifyContainersToPreemptOnNode. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7290) canContainerBePreempted can return true when it shouldn't

2017-10-04 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192324#comment-16192324
 ] 

Steven Rand commented on YARN-7290:
---

An additional problem is that we call {{app.trackContainerForPreemption}} in 
{{preemptContainers}}, so after {{identifyContainersToPreempt}} has returned. 
Therefore after we've finished iterating through one container in the value of 
{{rr.getNumContainers()}}, we will have added some containers to 
{{containersToPreempt}}, but {{resourcesToBePreempted}} will not have been 
updated for any app. This allows subsequent calls to 
{{canContainerBePreempted}} in the same for loop to return {{true}} 
incorrectly, since we've already decided to preempt some containers, but the 
apps aren't aware of it yet.

> canContainerBePreempted can return true when it shouldn't
> -
>
> Key: YARN-7290
> URL: https://issues.apache.org/jira/browse/YARN-7290
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.0.0-beta1
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-7290-failing-test.patch
>
>
> In FSAppAttempt#canContainerBePreempted, we make sure that preempting the 
> given container would not put the app below its fair share:
> {code}
> // Check if the app's allocation will be over its fairshare even
> // after preempting this container
> Resource usageAfterPreemption = Resources.clone(getResourceUsage());
> // Subtract resources of containers already queued for preemption
> synchronized (preemptionVariablesLock) {
>   Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted);
> }
> // Subtract this container's allocation to compute usage after preemption
> Resources.subtractFrom(
> usageAfterPreemption, container.getAllocatedResource());
> return !isUsageBelowShare(usageAfterPreemption, getFairShare());
> {code}
> However, this only considers one container in isolation, and fails to 
> consider containers for the same app that we already added to 
> {{preemptableContainers}} in 
> FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a 
> case where we preempt multiple containers from the same app, none of which by 
> itself puts the app below fair share, but which cumulatively do so.
> I've attached a patch with a test to show this behavior. The flow is:
> 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated 
> all the resources (8g and 8vcores)
> 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 
> containers, each of which is 3g and 3vcores in size. At this point both 
> greedyApp and starvingApp have a fair share of 4g (with DRF not in use).
> 3. For the first container requested by starvedApp, we (correctly) preempt 3 
> containers from greedyApp, each of which is 1g and 1vcore.
> 4. For the second container requested by starvedApp, we again (this time 
> incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below 
> its fair share, but happens anyway because all six times that we call 
> {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the 
> value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using 
> debugger).
> So in addition to accounting for {{resourcesToBePreempted}}, we also need to 
> account for containers that we're already planning on preempting in 
> FSPreemptionThread#identifyContainersToPreemptOnNode. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7290) canContainerBePreempted can return true when it shouldn't

2017-10-04 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand reassigned YARN-7290:
-

Assignee: Steven Rand

> canContainerBePreempted can return true when it shouldn't
> -
>
> Key: YARN-7290
> URL: https://issues.apache.org/jira/browse/YARN-7290
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.0.0-beta1
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-7290-failing-test.patch
>
>
> In FSAppAttempt#canContainerBePreempted, we make sure that preempting the 
> given container would not put the app below its fair share:
> {code}
> // Check if the app's allocation will be over its fairshare even
> // after preempting this container
> Resource usageAfterPreemption = Resources.clone(getResourceUsage());
> // Subtract resources of containers already queued for preemption
> synchronized (preemptionVariablesLock) {
>   Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted);
> }
> // Subtract this container's allocation to compute usage after preemption
> Resources.subtractFrom(
> usageAfterPreemption, container.getAllocatedResource());
> return !isUsageBelowShare(usageAfterPreemption, getFairShare());
> {code}
> However, this only considers one container in isolation, and fails to 
> consider containers for the same app that we already added to 
> {{preemptableContainers}} in 
> FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a 
> case where we preempt multiple containers from the same app, none of which by 
> itself puts the app below fair share, but which cumulatively do so.
> I've attached a patch with a test to show this behavior. The flow is:
> 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated 
> all the resources (8g and 8vcores)
> 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 
> containers, each of which is 3g and 3vcores in size. At this point both 
> greedyApp and starvingApp have a fair share of 4g (with DRF not in use).
> 3. For the first container requested by starvedApp, we (correctly) preempt 3 
> containers from greedyApp, each of which is 1g and 1vcore.
> 4. For the second container requested by starvedApp, we again (this time 
> incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below 
> its fair share, but happens anyway because all six times that we call 
> {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the 
> value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using 
> debugger).
> So in addition to accounting for {{resourcesToBePreempted}}, we also need to 
> account for containers that we're already planning on preempting in 
> FSPreemptionThread#identifyContainersToPreemptOnNode. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7290) canContainerBePreempted can return true when it shouldn't

2017-10-04 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-7290:
--
Attachment: YARN-7290-failing-test.patch

> canContainerBePreempted can return true when it shouldn't
> -
>
> Key: YARN-7290
> URL: https://issues.apache.org/jira/browse/YARN-7290
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.0.0-beta1
>Reporter: Steven Rand
> Attachments: YARN-7290-failing-test.patch
>
>
> In FSAppAttempt#canContainerBePreempted, we make sure that preempting the 
> given container would not put the app below its fair share:
> {code}
> // Check if the app's allocation will be over its fairshare even
> // after preempting this container
> Resource usageAfterPreemption = Resources.clone(getResourceUsage());
> // Subtract resources of containers already queued for preemption
> synchronized (preemptionVariablesLock) {
>   Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted);
> }
> // Subtract this container's allocation to compute usage after preemption
> Resources.subtractFrom(
> usageAfterPreemption, container.getAllocatedResource());
> return !isUsageBelowShare(usageAfterPreemption, getFairShare());
> {code}
> However, this only considers one container in isolation, and fails to 
> consider containers for the same app that we already added to 
> {{preemptableContainers}} in 
> FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a 
> case where we preempt multiple containers from the same app, none of which by 
> itself puts the app below fair share, but which cumulatively do so.
> I've attached a patch with a test to show this behavior. The flow is:
> 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated 
> all the resources (8g and 8vcores)
> 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 
> containers, each of which is 3g and 3vcores in size. At this point both 
> greedyApp and starvingApp have a fair share of 4g (with DRF not in use).
> 3. For the first container requested by starvedApp, we (correctly) preempt 3 
> containers from greedyApp, each of which is 1g and 1vcore.
> 4. For the second container requested by starvedApp, we again (this time 
> incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below 
> its fair share, but happens anyway because all six times that we call 
> {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the 
> value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using 
> debugger).
> So in addition to accounting for {{resourcesToBePreempted}}, we also need to 
> account for containers that we're already planning on preempting in 
> FSPreemptionThread#identifyContainersToPreemptOnNode. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7290) canContainerBePreempted can return true when it shouldn't

2017-10-04 Thread Steven Rand (JIRA)
Steven Rand created YARN-7290:
-

 Summary: canContainerBePreempted can return true when it shouldn't
 Key: YARN-7290
 URL: https://issues.apache.org/jira/browse/YARN-7290
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.0.0-beta1
Reporter: Steven Rand


In FSAppAttempt#canContainerBePreempted, we make sure that preempting the given 
container would not put the app below its fair share:

{code}
// Check if the app's allocation will be over its fairshare even
// after preempting this container
Resource usageAfterPreemption = Resources.clone(getResourceUsage());

// Subtract resources of containers already queued for preemption
synchronized (preemptionVariablesLock) {
  Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted);
}

// Subtract this container's allocation to compute usage after preemption
Resources.subtractFrom(
usageAfterPreemption, container.getAllocatedResource());
return !isUsageBelowShare(usageAfterPreemption, getFairShare());
{code}

However, this only considers one container in isolation, and fails to consider 
containers for the same app that we already added to {{preemptableContainers}} 
in FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have 
a case where we preempt multiple containers from the same app, none of which by 
itself puts the app below fair share, but which cumulatively do so.

I've attached a patch with a test to show this behavior. The flow is:

1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated 
all the resources (8g and 8vcores)
2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 
containers, each of which is 3g and 3vcores in size. At this point both 
greedyApp and starvingApp have a fair share of 4g (with DRF not in use).
3. For the first container requested by starvedApp, we (correctly) preempt 3 
containers from greedyApp, each of which is 1g and 1vcore.
4. For the second container requested by starvedApp, we again (this time 
incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below its 
fair share, but happens anyway because all six times that we call {{return 
!isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the value of 
{{usageAfterPreemption}} is 7g and 7vcores (confirmed using debugger).

So in addition to accounting for {{resourcesToBePreempted}}, we also need to 
account for containers that we're already planning on preempting in 
FSPreemptionThread#identifyContainersToPreemptOnNode. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5742) Serve aggregated logs of historical apps from timeline service

2017-09-12 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162609#comment-16162609
 ] 

Steven Rand commented on YARN-5742:
---

Would it also be reasonable for the Timeline Service to enforce retention on 
aggregated logs? As YARN-2985 points out, there's currently no retention unless 
the MR JHS is deployed. I was going to try to write a patch that moves 
retention into the Application History Server, but wasn't sure whether it 
belongs there or in the Timeline Service.

> Serve aggregated logs of historical apps from timeline service
> --
>
> Key: YARN-5742
> URL: https://issues.apache.org/jira/browse/YARN-5742
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Varun Saxena
>Assignee: Rohith Sharma K S
> Attachments: YARN-5742-POC-v0.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6956) preemption may only consider resource requests for one node

2017-09-05 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154671#comment-16154671
 ] 

Steven Rand commented on YARN-6956:
---

Friendly ping [~kasha] and/or [~templedf]. I'll fix the checkstyle issues in 
the next patch, but wanted to gather other feedback as well.

> preemption may only consider resource requests for one node
> ---
>
> Key: YARN-6956
> URL: https://issues.apache.org/jira/browse/YARN-6956
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.9.0, 3.0.0-beta1
> Environment: CDH 5.11.0
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-6956.001.patch
>
>
> I'm observing the following series of events on a CDH 5.11.0 cluster, which 
> seem to be possible after YARN-6163:
> 1. An application is considered to be starved, so {{FSPreemptionThread}} 
> calls {{identifyContainersToPreempt}}, and that calls 
> {{FSAppAttempt#getStarvedResourceRequests}} to get a list of 
> {{ResourceRequest}} instances that are enough to address the app's starvation.
> 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is 
> enough to address the app's starvation, so we break out of the loop over 
> {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180.
>  We return only this one {{ResourceRequest}} back to the 
> {{identifyContainersToPreempt}} method.
> 3. It turns out that this particular {{ResourceRequest}} happens to have a 
> value for {{getResourceName}} that identifies a specific node in the cluster. 
> This causes preemption to only consider containers on that node, and not the 
> rest of the cluster.
> [~kasha], does that make sense? I'm happy to submit a patch if I'm 
> understanding the problem correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node

2017-09-05 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16153423#comment-16153423
 ] 

Steven Rand commented on YARN-4227:
---

[~wilfreds], I can rebase the patch if you like. It seems to be working quite 
nicely by the way -- we applied it to a cluster which was periodically 
exhibiting this problem and haven't seen it since.

> FairScheduler: RM quits processing expired container from a removed node
> 
>
> Key: YARN-4227
> URL: https://issues.apache.org/jira/browse/YARN-4227
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.3.0, 2.5.0, 2.7.1
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: YARN-4227.2.patch, YARN-4227.3.patch, YARN-4227.4.patch, 
> YARN-4227.patch
>
>
> Under some circumstances the node is removed before an expired container 
> event is processed causing the RM to exit:
> {code}
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
> Expired:container_1436927988321_1307950_01_12 Timed out after 600 secs
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1436927988321_1307950_01_12 Container Transitioned from 
> ACQUIRED to EXPIRED
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: 
> Completed container: container_1436927988321_1307950_01_12 in state: 
> EXPIRED event:EXPIRE
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op   
>OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS  
> APPID=application_1436927988321_1307950 
> CONTAINERID=container_1436927988321_1307950_01_12
> 2015-10-04 21:14:01,063 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type CONTAINER_EXPIRED to the scheduler
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}
> The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 
> and 2.6.0 by different customers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6960) definition of active queue allows idle long-running apps to distort fair shares

2017-08-22 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16137446#comment-16137446
 ] 

Steven Rand commented on YARN-6960:
---

Thanks, Daniel. Having thought about this some more, I don't think that either 
of the two patches I've posted is a good solution. In the first patch, inactive 
queues have fair shares of zero, and AM containers are subject to preemption 
even when running in high-priority queues. And in the second patch, 
applications running in idle queues define what their fair shares are 
irrespective of cluster-side settings, which doesn't make sense.

I'll think about this some more and try to come up with a better idea, but I'd 
also be quite interested in hearing your opinion and those of others. 

> definition of active queue allows idle long-running apps to distort fair 
> shares
> ---
>
> Key: YARN-6960
> URL: https://issues.apache.org/jira/browse/YARN-6960
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.1, 3.0.0-alpha4
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-6960.001.patch, YARN-6960.002.patch
>
>
> YARN-2026 introduced the notion of only considering active queues when 
> computing the fair share of each queue. The definition of an active queue is 
> a queue with at least one runnable app:
> {code}
>   public boolean isActive() {
> return getNumRunnableApps() > 0;
>   }
> {code}
> One case that this definition of activity doesn't account for is that of 
> long-running applications that scale dynamically. Such an application might 
> request many containers when jobs are running, but scale down to very few 
> containers, or only the AM container, when no jobs are running.
> Even when such an application has scaled down to a negligible amount of 
> demand and utilization, the queue that it's in is still considered to be 
> active, which defeats the purpose of YARN-2026. For example, consider this 
> scenario:
> 1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of 
> which have the same weight.
> 2. Queues {{root.a}} and {{root.b}} contain long-running applications that 
> currently have only one container each (the AM).
> 3. An application in queue {{root.c}} starts, and uses the whole cluster 
> except for the small amount in use by {{root.a}} and {{root.b}}. An 
> application in {{root.d}} starts, and has a high enough demand to be able to 
> use half of the cluster. Because all four queues are active, the app in 
> {{root.d}} can only preempt the app in {{root.c}} up to roughly 25% of the 
> cluster's resources, while the app in {{root.c}} keeps about 75%.
> Ideally in this example, the app in {{root.d}} would be able to preempt the 
> app in {{root.c}} up to 50% of the cluster, which would be possible if the 
> idle apps in {{root.a}} and {{root.b}} didn't cause those queues to be 
> considered active.
> One way to address this is to update the definition of an active queue to be 
> a queue containing 1 or more non-AM containers. This way if all apps in a 
> queue scale down to only the AM, other queues' fair shares aren't affected.
> The benefit of this approach is that it's quite simple. The downside is that 
> it doesn't account for apps that are idle and using almost no resources, but 
> still have at least one non-AM container.
> There are a couple of other options that seem plausible to me, but they're 
> much more complicated, and it seems to me that this proposal makes good 
> progress while adding minimal extra complexity.
> Does this seem like a reasonable change? I'm certainly open to better ideas 
> as well.
> Thanks,
> Steve



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node

2017-08-21 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16135938#comment-16135938
 ] 

Steven Rand commented on YARN-4227:
---

I'm seeing a similar issue on what's roughly branch-2 (CDH 5.11.0), with the 
error being:

{code}
2017-06-27 16:32:39,381 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Preemption 
Timer,5,main] threw an Exception.
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:687)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread$PreemptContainersTask.run(FSPreemptionThread.java:230)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
{code}

This error, which causes the FSPreemptionThead to die, and thereby crashes the 
RM, seems to be correlated with NodeManagers being marked unhealthy due to lack 
of local disk space during large shuffles. I haven't confirmed, but presumably 
the unhealthy nodes are removed while we're waiting for the lock, and no longer 
exist when we call {{releaseContainer}}.

I'm curious as to whether others are seeing this as well on recent versions, in 
which case maybe this is worth reopening?

> FairScheduler: RM quits processing expired container from a removed node
> 
>
> Key: YARN-4227
> URL: https://issues.apache.org/jira/browse/YARN-4227
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.3.0, 2.5.0, 2.7.1
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: YARN-4227.2.patch, YARN-4227.3.patch, YARN-4227.4.patch, 
> YARN-4227.patch
>
>
> Under some circumstances the node is removed before an expired container 
> event is processed causing the RM to exit:
> {code}
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
> Expired:container_1436927988321_1307950_01_12 Timed out after 600 secs
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1436927988321_1307950_01_12 Container Transitioned from 
> ACQUIRED to EXPIRED
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: 
> Completed container: container_1436927988321_1307950_01_12 in state: 
> EXPIRED event:EXPIRE
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op   
>OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS  
> APPID=application_1436927988321_1307950 
> CONTAINERID=container_1436927988321_1307950_01_12
> 2015-10-04 21:14:01,063 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type CONTAINER_EXPIRED to the scheduler
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}
> The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 
> and 2.6.0 by different customers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6960) definition of active queue allows idle long-running apps to distort fair shares

2017-08-20 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-6960:
--
Attachment: YARN-6960.002.patch

Attaching a slightly modified patch that sets the fair share of an inactive 
queue equal to its current utilization. This doesn't change the behavior for 
queues with no running applications, since the fair share before the patch and 
with the patch are both equal to zero. It does protect AM containers in queues 
that are inactive by the new definition from being preempted though, since 
queues containing those AMs are no longer over their fair shares.

> definition of active queue allows idle long-running apps to distort fair 
> shares
> ---
>
> Key: YARN-6960
> URL: https://issues.apache.org/jira/browse/YARN-6960
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.1, 3.0.0-alpha4
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-6960.001.patch, YARN-6960.002.patch
>
>
> YARN-2026 introduced the notion of only considering active queues when 
> computing the fair share of each queue. The definition of an active queue is 
> a queue with at least one runnable app:
> {code}
>   public boolean isActive() {
> return getNumRunnableApps() > 0;
>   }
> {code}
> One case that this definition of activity doesn't account for is that of 
> long-running applications that scale dynamically. Such an application might 
> request many containers when jobs are running, but scale down to very few 
> containers, or only the AM container, when no jobs are running.
> Even when such an application has scaled down to a negligible amount of 
> demand and utilization, the queue that it's in is still considered to be 
> active, which defeats the purpose of YARN-2026. For example, consider this 
> scenario:
> 1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of 
> which have the same weight.
> 2. Queues {{root.a}} and {{root.b}} contain long-running applications that 
> currently have only one container each (the AM).
> 3. An application in queue {{root.c}} starts, and uses the whole cluster 
> except for the small amount in use by {{root.a}} and {{root.b}}. An 
> application in {{root.d}} starts, and has a high enough demand to be able to 
> use half of the cluster. Because all four queues are active, the app in 
> {{root.d}} can only preempt the app in {{root.c}} up to roughly 25% of the 
> cluster's resources, while the app in {{root.c}} keeps about 75%.
> Ideally in this example, the app in {{root.d}} would be able to preempt the 
> app in {{root.c}} up to 50% of the cluster, which would be possible if the 
> idle apps in {{root.a}} and {{root.b}} didn't cause those queues to be 
> considered active.
> One way to address this is to update the definition of an active queue to be 
> a queue containing 1 or more non-AM containers. This way if all apps in a 
> queue scale down to only the AM, other queues' fair shares aren't affected.
> The benefit of this approach is that it's quite simple. The downside is that 
> it doesn't account for apps that are idle and using almost no resources, but 
> still have at least one non-AM container.
> There are a couple of other options that seem plausible to me, but they're 
> much more complicated, and it seems to me that this proposal makes good 
> progress while adding minimal extra complexity.
> Does this seem like a reasonable change? I'm certainly open to better ideas 
> as well.
> Thanks,
> Steve



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6960) definition of active queue allows idle long-running apps to distort fair shares

2017-08-20 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16134416#comment-16134416
 ] 

Steven Rand commented on YARN-6960:
---

[~dan...@cloudera.com], I've uploaded a patch proposing a new definition of 
queue activity. It also needs tests, but I wanted to first see how the 
community feels about this change, and revise it as necessary based on feedback 
before writing tests for it.

My understanding of a queue's demand is that it's the cumulative current usage 
of all apps in the queue plus the cumulative requested additional resources for 
all apps in the queue. Therefore if no apps are requesting additional 
resources, the demand will be equal to the usage of the AMs. Then, as soon as 
any app attempts to do anything, it's demand will be greater than the AM usage, 
and the queue will become active.

I've tested this patch and it seems to have the desired effect. Going back to 
the example in the description, {{root.c}} and {{root.d}} have equal fair 
shares despite the idle applications in {{root.a}} and {{root.b}}.

> definition of active queue allows idle long-running apps to distort fair 
> shares
> ---
>
> Key: YARN-6960
> URL: https://issues.apache.org/jira/browse/YARN-6960
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.1, 3.0.0-alpha4
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-6960.001.patch
>
>
> YARN-2026 introduced the notion of only considering active queues when 
> computing the fair share of each queue. The definition of an active queue is 
> a queue with at least one runnable app:
> {code}
>   public boolean isActive() {
> return getNumRunnableApps() > 0;
>   }
> {code}
> One case that this definition of activity doesn't account for is that of 
> long-running applications that scale dynamically. Such an application might 
> request many containers when jobs are running, but scale down to very few 
> containers, or only the AM container, when no jobs are running.
> Even when such an application has scaled down to a negligible amount of 
> demand and utilization, the queue that it's in is still considered to be 
> active, which defeats the purpose of YARN-2026. For example, consider this 
> scenario:
> 1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of 
> which have the same weight.
> 2. Queues {{root.a}} and {{root.b}} contain long-running applications that 
> currently have only one container each (the AM).
> 3. An application in queue {{root.c}} starts, and uses the whole cluster 
> except for the small amount in use by {{root.a}} and {{root.b}}. An 
> application in {{root.d}} starts, and has a high enough demand to be able to 
> use half of the cluster. Because all four queues are active, the app in 
> {{root.d}} can only preempt the app in {{root.c}} up to roughly 25% of the 
> cluster's resources, while the app in {{root.c}} keeps about 75%.
> Ideally in this example, the app in {{root.d}} would be able to preempt the 
> app in {{root.c}} up to 50% of the cluster, which would be possible if the 
> idle apps in {{root.a}} and {{root.b}} didn't cause those queues to be 
> considered active.
> One way to address this is to update the definition of an active queue to be 
> a queue containing 1 or more non-AM containers. This way if all apps in a 
> queue scale down to only the AM, other queues' fair shares aren't affected.
> The benefit of this approach is that it's quite simple. The downside is that 
> it doesn't account for apps that are idle and using almost no resources, but 
> still have at least one non-AM container.
> There are a couple of other options that seem plausible to me, but they're 
> much more complicated, and it seems to me that this proposal makes good 
> progress while adding minimal extra complexity.
> Does this seem like a reasonable change? I'm certainly open to better ideas 
> as well.
> Thanks,
> Steve



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6960) definition of active queue allows idle long-running apps to distort fair shares

2017-08-20 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-6960:
--
Attachment: YARN-6960.001.patch

> definition of active queue allows idle long-running apps to distort fair 
> shares
> ---
>
> Key: YARN-6960
> URL: https://issues.apache.org/jira/browse/YARN-6960
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.1, 3.0.0-alpha4
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-6960.001.patch
>
>
> YARN-2026 introduced the notion of only considering active queues when 
> computing the fair share of each queue. The definition of an active queue is 
> a queue with at least one runnable app:
> {code}
>   public boolean isActive() {
> return getNumRunnableApps() > 0;
>   }
> {code}
> One case that this definition of activity doesn't account for is that of 
> long-running applications that scale dynamically. Such an application might 
> request many containers when jobs are running, but scale down to very few 
> containers, or only the AM container, when no jobs are running.
> Even when such an application has scaled down to a negligible amount of 
> demand and utilization, the queue that it's in is still considered to be 
> active, which defeats the purpose of YARN-2026. For example, consider this 
> scenario:
> 1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of 
> which have the same weight.
> 2. Queues {{root.a}} and {{root.b}} contain long-running applications that 
> currently have only one container each (the AM).
> 3. An application in queue {{root.c}} starts, and uses the whole cluster 
> except for the small amount in use by {{root.a}} and {{root.b}}. An 
> application in {{root.d}} starts, and has a high enough demand to be able to 
> use half of the cluster. Because all four queues are active, the app in 
> {{root.d}} can only preempt the app in {{root.c}} up to roughly 25% of the 
> cluster's resources, while the app in {{root.c}} keeps about 75%.
> Ideally in this example, the app in {{root.d}} would be able to preempt the 
> app in {{root.c}} up to 50% of the cluster, which would be possible if the 
> idle apps in {{root.a}} and {{root.b}} didn't cause those queues to be 
> considered active.
> One way to address this is to update the definition of an active queue to be 
> a queue containing 1 or more non-AM containers. This way if all apps in a 
> queue scale down to only the AM, other queues' fair shares aren't affected.
> The benefit of this approach is that it's quite simple. The downside is that 
> it doesn't account for apps that are idle and using almost no resources, but 
> still have at least one non-AM container.
> There are a couple of other options that seem plausible to me, but they're 
> much more complicated, and it seems to me that this proposal makes good 
> progress while adding minimal extra complexity.
> Does this seem like a reasonable change? I'm certainly open to better ideas 
> as well.
> Thanks,
> Steve



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6956) preemption may only consider resource requests for one node

2017-08-13 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125041#comment-16125041
 ] 

Steven Rand edited comment on YARN-6956 at 8/13/17 8:30 PM:


Thanks for the clarifications. All three of those suggestions make sense to me.

I've attached a patch for considering a configurable number of RRs. It seems 
simplest to me to create separate JIRAs for prioritizing the RR(s) to check and 
honoring delay scheduling in preemption -- does that seem reasonable?

EDIT: A couple of questions I had about the patch:

* I don't have a good sense for how to pick the default number of RRs to look 
at, and the choice of 10 for {{MIN_RESOURCE_REQUESTS_FOR_PREEMPTION_DEFAULT}} 
was fairly arbitrary. Happy to change that to something more reasonable if 
someone else has better intuition there.
* If adding a new configuration point as in the patch makes sense, where should 
I add docs for it? My guess is {{yarn-default.xml}}, but I wasn't completely 
sure.


was (Author: steven rand):
Thanks for the clarifications. All three of those suggestions make sense to me.

I've attached a patch for considering a configurable number of RRs. It seems 
simplest to me to create separate JIRAs for prioritizing the RR(s) to check and 
honoring delay scheduling in preemption -- does that seem reasonable?

> preemption may only consider resource requests for one node
> ---
>
> Key: YARN-6956
> URL: https://issues.apache.org/jira/browse/YARN-6956
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.9.0, 3.0.0-beta1
> Environment: CDH 5.11.0
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-6956.001.patch
>
>
> I'm observing the following series of events on a CDH 5.11.0 cluster, which 
> seem to be possible after YARN-6163:
> 1. An application is considered to be starved, so {{FSPreemptionThread}} 
> calls {{identifyContainersToPreempt}}, and that calls 
> {{FSAppAttempt#getStarvedResourceRequests}} to get a list of 
> {{ResourceRequest}} instances that are enough to address the app's starvation.
> 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is 
> enough to address the app's starvation, so we break out of the loop over 
> {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180.
>  We return only this one {{ResourceRequest}} back to the 
> {{identifyContainersToPreempt}} method.
> 3. It turns out that this particular {{ResourceRequest}} happens to have a 
> value for {{getResourceName}} that identifies a specific node in the cluster. 
> This causes preemption to only consider containers on that node, and not the 
> rest of the cluster.
> [~kasha], does that make sense? I'm happy to submit a patch if I'm 
> understanding the problem correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6956) preemption may only consider resource requests for one node

2017-08-13 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125041#comment-16125041
 ] 

Steven Rand commented on YARN-6956:
---

Thanks for the clarifications. All three of those suggestions make sense to me.

I've attached a patch for considering a configurable number of RRs. It seems 
simplest to me to create separate JIRAs for prioritizing the RR(s) to check and 
honoring delay scheduling in preemption -- does that seem reasonable?

> preemption may only consider resource requests for one node
> ---
>
> Key: YARN-6956
> URL: https://issues.apache.org/jira/browse/YARN-6956
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.9.0, 3.0.0-beta1
> Environment: CDH 5.11.0
>Reporter: Steven Rand
> Attachments: YARN-6956.001.patch
>
>
> I'm observing the following series of events on a CDH 5.11.0 cluster, which 
> seem to be possible after YARN-6163:
> 1. An application is considered to be starved, so {{FSPreemptionThread}} 
> calls {{identifyContainersToPreempt}}, and that calls 
> {{FSAppAttempt#getStarvedResourceRequests}} to get a list of 
> {{ResourceRequest}} instances that are enough to address the app's starvation.
> 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is 
> enough to address the app's starvation, so we break out of the loop over 
> {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180.
>  We return only this one {{ResourceRequest}} back to the 
> {{identifyContainersToPreempt}} method.
> 3. It turns out that this particular {{ResourceRequest}} happens to have a 
> value for {{getResourceName}} that identifies a specific node in the cluster. 
> This causes preemption to only consider containers on that node, and not the 
> rest of the cluster.
> [~kasha], does that make sense? I'm happy to submit a patch if I'm 
> understanding the problem correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6956) preemption may only consider resource requests for one node

2017-08-13 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand reassigned YARN-6956:
-

Assignee: Steven Rand

> preemption may only consider resource requests for one node
> ---
>
> Key: YARN-6956
> URL: https://issues.apache.org/jira/browse/YARN-6956
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.9.0, 3.0.0-beta1
> Environment: CDH 5.11.0
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-6956.001.patch
>
>
> I'm observing the following series of events on a CDH 5.11.0 cluster, which 
> seem to be possible after YARN-6163:
> 1. An application is considered to be starved, so {{FSPreemptionThread}} 
> calls {{identifyContainersToPreempt}}, and that calls 
> {{FSAppAttempt#getStarvedResourceRequests}} to get a list of 
> {{ResourceRequest}} instances that are enough to address the app's starvation.
> 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is 
> enough to address the app's starvation, so we break out of the loop over 
> {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180.
>  We return only this one {{ResourceRequest}} back to the 
> {{identifyContainersToPreempt}} method.
> 3. It turns out that this particular {{ResourceRequest}} happens to have a 
> value for {{getResourceName}} that identifies a specific node in the cluster. 
> This causes preemption to only consider containers on that node, and not the 
> rest of the cluster.
> [~kasha], does that make sense? I'm happy to submit a patch if I'm 
> understanding the problem correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6956) preemption may only consider resource requests for one node

2017-08-13 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-6956:
--
Attachment: YARN-6956.001.patch

> preemption may only consider resource requests for one node
> ---
>
> Key: YARN-6956
> URL: https://issues.apache.org/jira/browse/YARN-6956
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.9.0, 3.0.0-beta1
> Environment: CDH 5.11.0
>Reporter: Steven Rand
> Attachments: YARN-6956.001.patch
>
>
> I'm observing the following series of events on a CDH 5.11.0 cluster, which 
> seem to be possible after YARN-6163:
> 1. An application is considered to be starved, so {{FSPreemptionThread}} 
> calls {{identifyContainersToPreempt}}, and that calls 
> {{FSAppAttempt#getStarvedResourceRequests}} to get a list of 
> {{ResourceRequest}} instances that are enough to address the app's starvation.
> 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is 
> enough to address the app's starvation, so we break out of the loop over 
> {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180.
>  We return only this one {{ResourceRequest}} back to the 
> {{identifyContainersToPreempt}} method.
> 3. It turns out that this particular {{ResourceRequest}} happens to have a 
> value for {{getResourceName}} that identifies a specific node in the cluster. 
> This causes preemption to only consider containers on that node, and not the 
> rest of the cluster.
> [~kasha], does that make sense? I'm happy to submit a patch if I'm 
> understanding the problem correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6956) preemption may only consider resource requests for one node

2017-08-08 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119135#comment-16119135
 ] 

Steven Rand commented on YARN-6956:
---

Hi [~kasha], thanks for the suggestions. I would definitely like to contribute. 
A couple questions to make sure I understand:

* For prioritizing the RR to check, does that mean sorting the RRs for an app 
by the value of {{getPriority()}}, and checking the highest priority one first? 
And if there are multiple RRs with the same priority, the suggestion is to 
choose the one that's requesting the least number of resources? If so, how do 
we avoid preempting small amounts at a time, and taking a long time to satisfy 
starvation? Or is it the responsibility of the app to not prioritize many small 
RRs?
* Considering more than one RR definitely seems like a good idea. Is it 
reasonable to make sure to include at least one RR for which locality is 
relaxed, and/or the RR is for a rack or {{*}} in the list of RRs that we check, 
even if that means checking a lower-priority RR? (Assuming of course that there 
is at least one such RR.)
* Honoring delay scheduling for preemption makes sense -- I don't have any 
questions about that one.

> preemption may only consider resource requests for one node
> ---
>
> Key: YARN-6956
> URL: https://issues.apache.org/jira/browse/YARN-6956
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.9.0, 3.0.0-beta1
> Environment: CDH 5.11.0
>Reporter: Steven Rand
>
> I'm observing the following series of events on a CDH 5.11.0 cluster, which 
> seem to be possible after YARN-6163:
> 1. An application is considered to be starved, so {{FSPreemptionThread}} 
> calls {{identifyContainersToPreempt}}, and that calls 
> {{FSAppAttempt#getStarvedResourceRequests}} to get a list of 
> {{ResourceRequest}} instances that are enough to address the app's starvation.
> 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is 
> enough to address the app's starvation, so we break out of the loop over 
> {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180.
>  We return only this one {{ResourceRequest}} back to the 
> {{identifyContainersToPreempt}} method.
> 3. It turns out that this particular {{ResourceRequest}} happens to have a 
> value for {{getResourceName}} that identifies a specific node in the cluster. 
> This causes preemption to only consider containers on that node, and not the 
> rest of the cluster.
> [~kasha], does that make sense? I'm happy to submit a patch if I'm 
> understanding the problem correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6960) definition of active queue allows idle long-running apps to distort fair shares

2017-08-08 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118146#comment-16118146
 ] 

Steven Rand commented on YARN-6960:
---

Yep, that concern is definitely valid. I wrote a patch that implements this 
definition of activity, and ran into exactly the problem you're describing 
while testing it. A new proposal then would be that a leaf queue is active if 
either of these conditions is met:

* There is at least one non-AM container running in the queue
* The cumulative demand of applications in the queue is greater than zero

That way, in the example you give above, the fair share of {{root.a}} becomes 
1/3 as soon as it attempts to run another job.

Backing up a step to the use case, we have interactive Spark applications the 
expectation for which is that results are returned to the user on the order of 
seconds, or at worst a few minutes (assuming that the query is reasonable). We 
don't want to have to create a new {{SparkContext}} and upload + localize JARs 
for each query, since that would inflate query execution time, so one of these 
applications will keep the same {{SparkContext}} around indefinitely, and will 
thus be a long-running YARN application. When one of these apps isn't running 
any queries/jobs, it'll scale down its executor count to make room for other 
YARN applications. So sometimes we wind up with multiple YARN applications with 
minimal resource usage and no demand, and we've observed that this causes 
unequal distribution of resources between other running applications, even 
though they're in equally weighted queues. The example in the description is 
kind of silly/simplistic, but it's essentially what we see happen.

> definition of active queue allows idle long-running apps to distort fair 
> shares
> ---
>
> Key: YARN-6960
> URL: https://issues.apache.org/jira/browse/YARN-6960
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.1, 3.0.0-alpha4
>Reporter: Steven Rand
>Assignee: Steven Rand
>
> YARN-2026 introduced the notion of only considering active queues when 
> computing the fair share of each queue. The definition of an active queue is 
> a queue with at least one runnable app:
> {code}
>   public boolean isActive() {
> return getNumRunnableApps() > 0;
>   }
> {code}
> One case that this definition of activity doesn't account for is that of 
> long-running applications that scale dynamically. Such an application might 
> request many containers when jobs are running, but scale down to very few 
> containers, or only the AM container, when no jobs are running.
> Even when such an application has scaled down to a negligible amount of 
> demand and utilization, the queue that it's in is still considered to be 
> active, which defeats the purpose of YARN-2026. For example, consider this 
> scenario:
> 1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of 
> which have the same weight.
> 2. Queues {{root.a}} and {{root.b}} contain long-running applications that 
> currently have only one container each (the AM).
> 3. An application in queue {{root.c}} starts, and uses the whole cluster 
> except for the small amount in use by {{root.a}} and {{root.b}}. An 
> application in {{root.d}} starts, and has a high enough demand to be able to 
> use half of the cluster. Because all four queues are active, the app in 
> {{root.d}} can only preempt the app in {{root.c}} up to roughly 25% of the 
> cluster's resources, while the app in {{root.c}} keeps about 75%.
> Ideally in this example, the app in {{root.d}} would be able to preempt the 
> app in {{root.c}} up to 50% of the cluster, which would be possible if the 
> idle apps in {{root.a}} and {{root.b}} didn't cause those queues to be 
> considered active.
> One way to address this is to update the definition of an active queue to be 
> a queue containing 1 or more non-AM containers. This way if all apps in a 
> queue scale down to only the AM, other queues' fair shares aren't affected.
> The benefit of this approach is that it's quite simple. The downside is that 
> it doesn't account for apps that are idle and using almost no resources, but 
> still have at least one non-AM container.
> There are a couple of other options that seem plausible to me, but they're 
> much more complicated, and it seems to me that this proposal makes good 
> progress while adding minimal extra complexity.
> Does this seem like a reasonable change? I'm certainly open to better ideas 
> as well.
> Thanks,
> Steve



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional 

[jira] [Created] (YARN-6960) definition of active queue allows idle long-running apps to distort fair shares

2017-08-07 Thread Steven Rand (JIRA)
Steven Rand created YARN-6960:
-

 Summary: definition of active queue allows idle long-running apps 
to distort fair shares
 Key: YARN-6960
 URL: https://issues.apache.org/jira/browse/YARN-6960
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.0.0-alpha4, 2.8.1
Reporter: Steven Rand
Assignee: Steven Rand


YARN-2026 introduced the notion of only considering active queues when 
computing the fair share of each queue. The definition of an active queue is a 
queue with at least one runnable app:

{code}
  public boolean isActive() {
return getNumRunnableApps() > 0;
  }
{code}

One case that this definition of activity doesn't account for is that of 
long-running applications that scale dynamically. Such an application might 
request many containers when jobs are running, but scale down to very few 
containers, or only the AM container, when no jobs are running.

Even when such an application has scaled down to a negligible amount of demand 
and utilization, the queue that it's in is still considered to be active, which 
defeats the purpose of YARN-2026. For example, consider this scenario:

1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of 
which have the same weight.
2. Queues {{root.a}} and {{root.b}} contain long-running applications that 
currently have only one container each (the AM).
3. An application in queue {{root.c}} starts, and uses the whole cluster except 
for the small amount in use by {{root.a}} and {{root.b}}. An application in 
{{root.d}} starts, and has a high enough demand to be able to use half of the 
cluster. Because all four queues are active, the app in {{root.d}} can only 
preempt the app in {{root.c}} up to roughly 25% of the cluster's resources, 
while the app in {{root.c}} keeps about 75%.

Ideally in this example, the app in {{root.d}} would be able to preempt the app 
in {{root.c}} up to 50% of the cluster, which would be possible if the idle 
apps in {{root.a}} and {{root.b}} didn't cause those queues to be considered 
active.

One way to address this is to update the definition of an active queue to be a 
queue containing 1 or more non-AM containers. This way if all apps in a queue 
scale down to only the AM, other queues' fair shares aren't affected.

The benefit of this approach is that it's quite simple. The downside is that it 
doesn't account for apps that are idle and using almost no resources, but still 
have at least one non-AM container.

There are a couple of other options that seem plausible to me, but they're much 
more complicated, and it seems to me that this proposal makes good progress 
while adding minimal extra complexity.

Does this seem like a reasonable change? I'm certainly open to better ideas as 
well.

Thanks,
Steve



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6956) preemption may only consider resource requests for one node

2017-08-06 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16115847#comment-16115847
 ] 

Steven Rand edited comment on YARN-6956 at 8/6/17 4:35 PM:
---

Hi [~dan...@cloudera.com], thanks for the quick reply and explanation. That 
concern definitely makes sense, and in general YARN-6163 seems like a good 
change.

However, what I'm seeing is that only considering RRs for one node actually 
causes some of my apps to remain starved for quite a long time. The series of 
events that happens in a loop is:

1. The app is correctly considered to be starved
2. The app has many RRs, several of which can be satisfied, but only one RR is 
actually considered for preemption as per this JIRA's description
3. That particular RR happens to be for a node on which no containers can be 
preempted for the app, so the app remains starved

Since the order of the list of RRs is the same each time through the loop, the 
same RR is always considered, no containers are preempted, and the app remains 
starved, even though it has other RRs that could be satisfied.

I haven't thought enough yet about what a solution would look like, but it 
seems like we should be able to keep the benefits of YARN-6163 while also 
avoiding this issue. I'll try to have a patch within the next few days if 
people agree that we should change the behavior.


was (Author: steven rand):
Hi [~dan...@cloudera.com], thanks for the quick reply and explanation. That 
concern definitely makes sense, and in general YARN-6163 seems like a good 
change.

However, what I'm seeing is that only considering RRs for one node actually 
causes some of my apps to remain starved for quite a long time. The series of 
events that happens in a loop is:

1. The app is correctly considered to be starved
2. The app has many RRs, several of which can be satisfied, but only one RR is 
actually considered for preemption as per this JIRA's description
3. That particular RR happens to be for a node on which the no containers can 
be preempted for the app, so the app remains starved

Since the order of the list of RRs is the same each time through the loop, the 
same RR is always considered, no containers are preempted, and the app remains 
starved, even though it has other RRs that could be satisfied.

I haven't thought enough yet about what a solution would look like, but it 
seems like we should be able to keep the benefits of YARN-6163 while also 
avoiding this issue. I'll try to have a patch within the next few days if 
people agree that we should change the behavior.

> preemption may only consider resource requests for one node
> ---
>
> Key: YARN-6956
> URL: https://issues.apache.org/jira/browse/YARN-6956
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.9.0, 3.0.0-beta1
> Environment: CDH 5.11.0
>Reporter: Steven Rand
>
> I'm observing the following series of events on a CDH 5.11.0 cluster, which 
> seem to be possible after YARN-6163:
> 1. An application is considered to be starved, so {{FSPreemptionThread}} 
> calls {{identifyContainersToPreempt}}, and that calls 
> {{FSAppAttempt#getStarvedResourceRequests}} to get a list of 
> {{ResourceRequest}} instances that are enough to address the app's starvation.
> 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is 
> enough to address the app's starvation, so we break out of the loop over 
> {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180.
>  We return only this one {{ResourceRequest}} back to the 
> {{identifyContainersToPreempt}} method.
> 3. It turns out that this particular {{ResourceRequest}} happens to have a 
> value for {{getResourceName}} that identifies a specific node in the cluster. 
> This causes preemption to only consider containers on that node, and not the 
> rest of the cluster.
> [~kasha], does that make sense? I'm happy to submit a patch if I'm 
> understanding the problem correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6956) preemption may only consider resource requests for one node

2017-08-06 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16115847#comment-16115847
 ] 

Steven Rand commented on YARN-6956:
---

Hi [~dan...@cloudera.com], thanks for the quick reply and explanation. That 
concern definitely makes sense, and in general YARN-6163 seems like a good 
change.

However, what I'm seeing is that only considering RRs for one node actually 
causes some of my apps to remain starved for quite a long time. The series of 
events that happens in a loop is:

1. The app is correctly considered to be starved
2. The app has many RRs, several of which can be satisfied, but only one RR is 
actually considered for preemption as per this JIRA's description
3. That particular RR happens to be for a node on which the no containers can 
be preempted for the app, so the app remains starved

Since the order of the list of RRs is the same each time through the loop, the 
same RR is always considered, no containers are preempted, and the app remains 
starved, even though it has other RRs that could be satisfied.

I haven't thought enough yet about what a solution would look like, but it 
seems like we should be able to keep the benefits of YARN-6163 while also 
avoiding this issue. I'll try to have a patch within the next few days if 
people agree that we should change the behavior.

> preemption may only consider resource requests for one node
> ---
>
> Key: YARN-6956
> URL: https://issues.apache.org/jira/browse/YARN-6956
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.9.0, 3.0.0-beta1
> Environment: CDH 5.11.0
>Reporter: Steven Rand
>
> I'm observing the following series of events on a CDH 5.11.0 cluster, which 
> seem to be possible after YARN-6163:
> 1. An application is considered to be starved, so {{FSPreemptionThread}} 
> calls {{identifyContainersToPreempt}}, and that calls 
> {{FSAppAttempt#getStarvedResourceRequests}} to get a list of 
> {{ResourceRequest}} instances that are enough to address the app's starvation.
> 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is 
> enough to address the app's starvation, so we break out of the loop over 
> {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180.
>  We return only this one {{ResourceRequest}} back to the 
> {{identifyContainersToPreempt}} method.
> 3. It turns out that this particular {{ResourceRequest}} happens to have a 
> value for {{getResourceName}} that identifies a specific node in the cluster. 
> This causes preemption to only consider containers on that node, and not the 
> rest of the cluster.
> [~kasha], does that make sense? I'm happy to submit a patch if I'm 
> understanding the problem correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6956) preemption may only consider resource requests for one node

2017-08-05 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-6956:
--
Description: 
I'm observing the following series of events on a CDH 5.11.0 cluster, which 
seem to be possible after YARN-6163:

1. An application is considered to be starved, so {{FSPreemptionThread}} calls 
{{identifyContainersToPreempt}}, and that calls 
{{FSAppAttempt#getStarvedResourceRequests}} to get a list of 
{{ResourceRequest}} instances that are enough to address the app's starvation.

2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is 
enough to address the app's starvation, so we break out of the loop over 
{{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180.
 We return only this one {{ResourceRequest}} back to the 
{{identifyContainersToPreempt}} method.

3. It turns out that this particular {{ResourceRequest}} happens to have a 
value for {{getResourceName}} that identifies a specific node in the cluster. 
This causes preemption to only consider containers on that node, and not the 
rest of the cluster.

[~kasha], does that make sense? I'm happy to submit a patch if I'm 
understanding the problem correctly.

  was:
I'm observing the following series of events on a CDH 5.11.0 cluster, which 
seem to be possible after https://issues.apache.org/jira/browse/YARN-6163:

1. An application is considered to be starved, so {{FSPreemptionThread}} calls 
{{identifyContainersToPreempt}}, and that calls 
{{FSAppAttempt#getStarvedResourceRequests}} to get a list of 
{{ResourceRequest}} instances that are enough to address the app's starvation.

2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is 
enough to address the app's starvation, so we break out of the loop over 
{{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180.
 We return only this one {{ResourceRequest}} back to the 
{{identifyContainersToPreempt}} method.

3. It turns out that this particular {{ResourceRequest}} happens to have a 
value for {{getResourceName}} that identifies a specific node in the cluster. 
This causes preemption to only consider containers on that node, and not the 
rest of the cluster.

[~kasha], does that make sense? I'm happy to submit a patch if I'm 
understanding the problem correctly.


> preemption may only consider resource requests for one node
> ---
>
> Key: YARN-6956
> URL: https://issues.apache.org/jira/browse/YARN-6956
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.9.0, 3.0.0-beta1
> Environment: CDH 5.11.0
>Reporter: Steven Rand
>
> I'm observing the following series of events on a CDH 5.11.0 cluster, which 
> seem to be possible after YARN-6163:
> 1. An application is considered to be starved, so {{FSPreemptionThread}} 
> calls {{identifyContainersToPreempt}}, and that calls 
> {{FSAppAttempt#getStarvedResourceRequests}} to get a list of 
> {{ResourceRequest}} instances that are enough to address the app's starvation.
> 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is 
> enough to address the app's starvation, so we break out of the loop over 
> {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180.
>  We return only this one {{ResourceRequest}} back to the 
> {{identifyContainersToPreempt}} method.
> 3. It turns out that this particular {{ResourceRequest}} happens to have a 
> value for {{getResourceName}} that identifies a specific node in the cluster. 
> This causes preemption to only consider containers on that node, and not the 
> rest of the cluster.
> [~kasha], does that make sense? I'm happy to submit a patch if I'm 
> understanding the problem correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6956) preemption may only consider resource requests for one node

2017-08-05 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-6956:
--
Description: 
I'm observing the following series of events on a CDH 5.11.0 cluster, which 
seem to be possible after https://issues.apache.org/jira/browse/YARN-6163:

1. An application is considered to be starved, so {{FSPreemptionThread}} calls 
{{identifyContainersToPreempt}}, and that calls 
{{FSAppAttempt#getStarvedResourceRequests}} to get a list of 
{{ResourceRequest}} instances that are enough to address the app's starvation.

2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is 
enough to address the app's starvation, so we break out of the loop over 
{{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180.
 We return only this one {{ResourceRequest}} back to the 
{{identifyContainersToPreempt}} method.

3. It turns out that this particular {{ResourceRequest}} happens to have a 
value for {{getResourceName}} that identifies a specific node in the cluster. 
This causes preemption to only consider containers on that node, and not the 
rest of the cluster.

[~kasha], does that make sense? I'm happy to submit a patch if I'm 
understanding the problem correctly.

  was:
I'm observing the following series of events on a CDH 5.11.0 cluster, which 
seem to be possible after https://issues.apache.org/jira/browse/YARN-6163:

1. An application is considered to be starved, so {{FSPreemptionThread}} calls 
{{identifyContainersToPreempt}}, and that calls 
{{FSAppAttempt#getStarvedResourceRequests}} to get a list of 
{{ResourceRequest}} instances that are enough to address the app's starvation.

2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is 
enough to address the app's starvation, so we break out of the loop over 
{{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180.
 We return only this one {{ResourceRequest}} back to the 
{{identifyContainersToPreempt}} method.

3. It turns out that this particular {{ResourceRequest}} happens to have a 
value for {{getResourceName}} that identifies a specific node in the cluster. 
This cause preemption to only consider containers on that node, and not the 
rest of the cluster.

[~kasha], does that make sense? I'm happy to submit a patch if I'm 
understanding the problem correctly.


> preemption may only consider resource requests for one node
> ---
>
> Key: YARN-6956
> URL: https://issues.apache.org/jira/browse/YARN-6956
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.9.0, 3.0.0-beta1
> Environment: CDH 5.11.0
>Reporter: Steven Rand
>
> I'm observing the following series of events on a CDH 5.11.0 cluster, which 
> seem to be possible after https://issues.apache.org/jira/browse/YARN-6163:
> 1. An application is considered to be starved, so {{FSPreemptionThread}} 
> calls {{identifyContainersToPreempt}}, and that calls 
> {{FSAppAttempt#getStarvedResourceRequests}} to get a list of 
> {{ResourceRequest}} instances that are enough to address the app's starvation.
> 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is 
> enough to address the app's starvation, so we break out of the loop over 
> {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180.
>  We return only this one {{ResourceRequest}} back to the 
> {{identifyContainersToPreempt}} method.
> 3. It turns out that this particular {{ResourceRequest}} happens to have a 
> value for {{getResourceName}} that identifies a specific node in the cluster. 
> This causes preemption to only consider containers on that node, and not the 
> rest of the cluster.
> [~kasha], does that make sense? I'm happy to submit a patch if I'm 
> understanding the problem correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6956) preemption may only consider resource requests for one node

2017-08-05 Thread Steven Rand (JIRA)
Steven Rand created YARN-6956:
-

 Summary: preemption may only consider resource requests for one 
node
 Key: YARN-6956
 URL: https://issues.apache.org/jira/browse/YARN-6956
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.9.0, 3.0.0-beta1
 Environment: CDH 5.11.0
Reporter: Steven Rand


I'm observing the following series of events on a CDH 5.11.0 cluster, which 
seem to be possible after https://issues.apache.org/jira/browse/YARN-6163:

1. An application is considered to be starved, so {{FSPreemptionThread}} calls 
{{identifyContainersToPreempt}}, and that calls 
{{FSAppAttempt#getStarvedResourceRequests}} to get a list of 
{{ResourceRequest}} instances that are enough to address the app's starvation.

2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is 
enough to address the app's starvation, so we break out of the loop over 
{{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180.
 We return only this one {{ResourceRequest}} back to the 
{{identifyContainersToPreempt}} method.

3. It turns out that this particular {{ResourceRequest}} happens to have a 
value for {{getResourceName}} that identifies a specific node in the cluster. 
This cause preemption to only consider containers on that node, and not the 
rest of the cluster.

[~kasha], does that make sense? I'm happy to submit a patch if I'm 
understanding the problem correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2985) YARN should support to delete the aggregated logs for Non-MapReduce applications

2017-04-13 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968609#comment-15968609
 ] 

Steven Rand commented on YARN-2985:
---

[~jlowe], thanks for the thoughtful response. Based on that information, it 
seems like the most straightforward way to proceed, at least for branch-2, is 
to add a configuration option for running the deletion service in only the 
timeline server, and not the JHS. Something like 
{{yarn.log-aggregation.run-in-timeline-server}} that defaults to {{false}} for 
backcompat, but when set to {{true}}, prevents the JHS from performing 
retention, and tells the timeline server to do it instead. Does that seem 
reasonable? If so I'll update the patch to do that, but certainly open to 
alternatives if there's a better way.

For trunk, I imagine it might be worth just removing retention from the JHS and 
moving it to the timeline server entirely, since my understanding is that the 
timeline server is supposed to replace the JHS, even for deployments that only 
run MR jobs, and 3.0 seems like a reasonable enough point at which to require 
the switch from JHS to timeline server. I might be misunderstanding the 
relationship between the two though, so please correct me if that doesn't make 
sense.

> YARN should support to delete the aggregated logs for Non-MapReduce 
> applications
> 
>
> Key: YARN-2985
> URL: https://issues.apache.org/jira/browse/YARN-2985
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: log-aggregation, nodemanager
>Affects Versions: 2.8.0
>Reporter: Xu Yang
>Assignee: Steven Rand
> Attachments: YARN-2985-branch-2-001.patch
>
>
> Before Hadoop 2.6, the LogAggregationService is started in NodeManager. But 
> the AggregatedLogDeletionService is started in mapreduce`s JobHistoryServer. 
> Therefore, the Non-MapReduce application can aggregate their logs to HDFS, 
> but can not delete those logs. Need the NodeManager take over the function of 
> aggregated log deletion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-2985) YARN should support to delete the aggregated logs for Non-MapReduce applications

2017-03-30 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated YARN-2985:
--
Attachment: YARN-2985-branch-2-001.patch

Attaching a patch for branch-2. I've tested this experimentally by deploying a 
patched Timeline Server to a cluster, running a Spark job on that cluster, and 
validating that the aggregated logs disappeared from HDFS after the configured 
amount of time had elapsed. The Timeline Server's logs confirm that it 
performed the deletion.

I'm not sure how to add tests though. The existing tests for the 
{{TestAggregatedLogDeletionService}} are good enough to test that the service 
works -- the more interesting thing is verifying that when a Timeline Server is 
deployed, log aggregation is enforced for non-MR applications. I don't know how 
to test non-MR applications from the hadoop-yarn project tests though.

> YARN should support to delete the aggregated logs for Non-MapReduce 
> applications
> 
>
> Key: YARN-2985
> URL: https://issues.apache.org/jira/browse/YARN-2985
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: log-aggregation, nodemanager
>Reporter: Xu Yang
>Assignee: Steven Rand
> Attachments: YARN-2985-branch-2-001.patch
>
>
> Before Hadoop 2.6, the LogAggregationService is started in NodeManager. But 
> the AggregatedLogDeletionService is started in mapreduce`s JobHistoryServer. 
> Therefore, the Non-MapReduce application can aggregate their logs to HDFS, 
> but can not delete those logs. Need the NodeManager take over the function of 
> aggregated log deletion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-2985) YARN should support to delete the aggregated logs for Non-MapReduce applications

2017-03-08 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand reassigned YARN-2985:
-

Assignee: Steven Rand

> YARN should support to delete the aggregated logs for Non-MapReduce 
> applications
> 
>
> Key: YARN-2985
> URL: https://issues.apache.org/jira/browse/YARN-2985
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: log-aggregation, nodemanager
>Reporter: Xu Yang
>Assignee: Steven Rand
>
> Before Hadoop 2.6, the LogAggregationService is started in NodeManager. But 
> the AggregatedLogDeletionService is started in mapreduce`s JobHistoryServer. 
> Therefore, the Non-MapReduce application can aggregate their logs to HDFS, 
> but can not delete those logs. Need the NodeManager take over the function of 
> aggregated log deletion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-6120) add retention of aggregated logs to Timeline Server

2017-03-08 Thread Steven Rand (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand resolved YARN-6120.
---
Resolution: Duplicate

I now have the ability to submit a patch for YARN-2985, so this duplicate JIRA 
is unnecessary. 

> add retention of aggregated logs to Timeline Server
> ---
>
> Key: YARN-6120
> URL: https://issues.apache.org/jira/browse/YARN-6120
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation, timelineserver
>Affects Versions: 2.7.3
>Reporter: Steven Rand
> Attachments: YARN-6120.001.patch
>
>
> The MR History Server performs retention of aggregated logs for MapReduce 
> applications. However, there is no way of enforcing retention on aggregated 
> logs for other types of applications. This JIRA proposes to add log retention 
> to the Timeline Server.
> Also, this is arguably a duplicate of 
> https://issues.apache.org/jira/browse/YARN-2985, but I could not find a way 
> to attach a patch for that issue. If someone closes this as a duplicate, 
> could you please assign that issue to me?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6308) Fix TestAMRMClient compilation errors

2017-03-08 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902269#comment-15902269
 ] 

Steven Rand commented on YARN-6308:
---

Attached a new patch to HADOOP-14062, though I think this issue should have 
been fixed by the previous patch being reverted.

> Fix TestAMRMClient compilation errors
> -
>
> Key: YARN-6308
> URL: https://issues.apache.org/jira/browse/YARN-6308
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha3
>Reporter: Manoj Govindassamy
>
> Looks like fixes committed for HADOOP-14062 and YARN-6218 had conflicts and 
> left TestAMRMClient in a dangling state with compilation errors. 
> TestAMRMClient needs a fix.
> {code}
> [ERROR] COMPILATION ERROR : 
> [ERROR] 
> /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[145,5]
>  non-static variable yarnCluster cannot be referenced from a static context
> [ERROR] 
> /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[145,71]
>  non-static variable nodeCount cannot be referenced from a static context
> [ERROR] 
> /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[146,5]
>  non-static variable yarnCluster cannot be referenced from a static context
> ..
> ..
> [ERROR] 
> /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[204,9]
>  non-static variable attemptId cannot be referenced from a static context
> [ERROR] 
> /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[207,20]
>  non-static variable attemptId cannot be referenced from a static context
> [ERROR] 
> /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[206,13]
>  non-static variable yarnCluster cannot be referenced from a static context
> [ERROR] 
> /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[874,5]
>  cannot find symbol
> [ERROR] symbol:   method tearDown()
> [ERROR] location: class org.apache.hadoop.yarn.client.api.impl.TestAMRMClient
> [ERROR] 
> /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[876,5]
>  cannot find symbol
> [ERROR] symbol:   method startApp()
> [ERROR] location: class org.apache.hadoop.yarn.client.api.impl.TestAMRMClient
> [ERROR] 
> /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[881,5]
>  cannot find symbol
> [ERROR] symbol:   method tearDown()
> [ERROR] location: class org.apache.hadoop.yarn.client.api.impl.TestAMRMClient
> [ERROR] 
> /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[885,5]
>  cannot find symbol
> [ERROR] symbol:   method startApp()
> [ERROR] location: class org.apache.hadoop.yarn.client.api.impl.TestAMRMClient
> [ERROR] -> [Help 1]
> [ERROR] 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   >