[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034876#comment-14034876 ] Bikas Saha commented on YARN-2052: -- With 32 bits for epoch number we have 4 billion restarts before it overflows. We are probably safe without any handling. > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit
[ https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034812#comment-14034812 ] Yi Tian commented on YARN-2083: --- [~ywskycn], thanks for your advice, YARN-2083-3.patch works fine in thunk ,YARN-2083-2.patch works fine in branch-2.4.1. is it possible to apply this patch into yarn-project? > In fair scheduler, Queue should not been assigned more containers when its > usedResource had reach the maxResource limit > --- > > Key: YARN-2083 > URL: https://issues.apache.org/jira/browse/YARN-2083 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.3.0 >Reporter: Yi Tian > Labels: assignContainer, fair, scheduler > Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083-3.patch, > YARN-2083.patch > > > In fair scheduler, FSParentQueue and FSLeafQueue do an > assignContainerPreCheck to guaranty this queue is not over its limit. > But the fitsIn function in Resource.java did not return false when the > usedResource equals the maxResource. > I think we should create a new Function "fitsInWithoutEqual" instead of > "fitsIn" in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit
[ https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Tian updated YARN-2083: -- Fix Version/s: (was: 2.4.1) > In fair scheduler, Queue should not been assigned more containers when its > usedResource had reach the maxResource limit > --- > > Key: YARN-2083 > URL: https://issues.apache.org/jira/browse/YARN-2083 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.3.0 >Reporter: Yi Tian > Labels: assignContainer, fair, scheduler > Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083-3.patch, > YARN-2083.patch > > > In fair scheduler, FSParentQueue and FSLeafQueue do an > assignContainerPreCheck to guaranty this queue is not over its limit. > But the fitsIn function in Resource.java did not return false when the > usedResource equals the maxResource. > I think we should create a new Function "fitsInWithoutEqual" instead of > "fitsIn" in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit
[ https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034777#comment-14034777 ] Hadoop QA commented on YARN-2083: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650950/YARN-2083-3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4019//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4019//console This message is automatically generated. > In fair scheduler, Queue should not been assigned more containers when its > usedResource had reach the maxResource limit > --- > > Key: YARN-2083 > URL: https://issues.apache.org/jira/browse/YARN-2083 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.3.0 >Reporter: Yi Tian > Labels: assignContainer, fair, scheduler > Fix For: 2.4.1 > > Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083-3.patch, > YARN-2083.patch > > > In fair scheduler, FSParentQueue and FSLeafQueue do an > assignContainerPreCheck to guaranty this queue is not over its limit. > But the fitsIn function in Resource.java did not return false when the > usedResource equals the maxResource. > I think we should create a new Function "fitsInWithoutEqual" instead of > "fitsIn" in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034746#comment-14034746 ] Tsuyoshi OZAWA commented on YARN-2052: -- {quote} We should make it a long in the same release as the epoch number addition so that we dont have to worry about that. {quote} +1 to do this in the same release. We'll plan to do the improvement on another JIRA. It's OK, but I think it's important for us that we decide the behavior when the overflow happens. We have 2 options: just aborting RM for now or starting apps from a clean state after the restart. We're planning to make id long just after this JIRA, so we can take aborting approach to prevent unexpected behavior for the simplicity. [~bikassaha], [~jianhe], what do you think about this? > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit
[ https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Tian updated YARN-2083: -- Attachment: YARN-2083-3.patch little change for YARN-1474. Make schedulers services. > In fair scheduler, Queue should not been assigned more containers when its > usedResource had reach the maxResource limit > --- > > Key: YARN-2083 > URL: https://issues.apache.org/jira/browse/YARN-2083 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.3.0 >Reporter: Yi Tian > Labels: assignContainer, fair, scheduler > Fix For: 2.4.1 > > Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083-3.patch, > YARN-2083.patch > > > In fair scheduler, FSParentQueue and FSLeafQueue do an > assignContainerPreCheck to guaranty this queue is not over its limit. > But the fitsIn function in Resource.java did not return false when the > usedResource equals the maxResource. > I think we should create a new Function "fitsInWithoutEqual" instead of > "fitsIn" in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034732#comment-14034732 ] Bikas Saha commented on YARN-2052: -- Ah. I did not see the rest of the comment. Yes. Integer overflow is a problem. We should make it a long in the same release as the epoch number addition so that we dont have to worry about that. > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034731#comment-14034731 ] Bikas Saha commented on YARN-2052: -- Why would ContainerId#compareTo fail? Existing containerId's should remain unchanged after RM restart. Only new container ids should have a different epoch number. > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2144) Add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034725#comment-14034725 ] Hadoop QA commented on YARN-2144: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650937/YARN-2144.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4018//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4018//console This message is automatically generated. > Add logs when preemption occurs > --- > > Key: YARN-2144 > URL: https://issues.apache.org/jira/browse/YARN-2144 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.5.0 >Reporter: Tassapol Athiapinya >Assignee: Wangda Tan > Attachments: AM-page-preemption-info.png, YARN-2144.patch, > YARN-2144.patch, YARN-2144.patch, YARN-2144.patch > > > There should be easy-to-read logs when preemption does occur. > 1. For debugging purpose, RM should log this. > 2. For administrative purpose, RM webpage should have a page to show recent > preemption events. > RM logs should have following properties: > * Logs are retrievable when an application is still running and often flushed. > * Can distinguish between AM container preemption and task container > preemption with container ID shown. > * Should be INFO level log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034722#comment-14034722 ] Tsuyoshi OZAWA commented on YARN-2052: -- I meant starting apps from a clean state after the restart like RM restart phase 1. If the sequence numbers are reset to zero, some applications can do unexpected behavior because the {{ContainerId#compareTo}} doesn't work correctly. If the apps start from a clean state, we can avoid the situation. > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034716#comment-14034716 ] Jian He commented on YARN-2052: --- bq. One simple way is to fallback to RM-restart implemented in YARN-128 Can you clarify more what you mean? > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034702#comment-14034702 ] Tsuyoshi OZAWA commented on YARN-2052: -- [~bikassaha], Yes, I think it's same. > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2144) Add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2144: - Attachment: YARN-2144.patch Rebased patch to latest trunk. > Add logs when preemption occurs > --- > > Key: YARN-2144 > URL: https://issues.apache.org/jira/browse/YARN-2144 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.5.0 >Reporter: Tassapol Athiapinya >Assignee: Wangda Tan > Attachments: AM-page-preemption-info.png, YARN-2144.patch, > YARN-2144.patch, YARN-2144.patch, YARN-2144.patch > > > There should be easy-to-read logs when preemption does occur. > 1. For debugging purpose, RM should log this. > 2. For administrative purpose, RM webpage should have a page to show recent > preemption events. > RM logs should have following properties: > * Logs are retrievable when an application is still running and often flushed. > * Can distinguish between AM container preemption and task container > preemption with container ID shown. > * Should be INFO level log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1373) Transition RMApp and RMAppAttempt state to RUNNING after restart for recovered running apps
[ https://issues.apache.org/jira/browse/YARN-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034700#comment-14034700 ] Bikas Saha commented on YARN-1373: -- Sorry I am not clear how this is a dup. This jira is tracking new behavior in the RM that will transition a recovered RMAppImpl/RMAppAttemptImpl (and still running for real) app to a RUNNING state instead of a terminal recovered state. This is to ensure that the state machines are in the correct state for the running AM to resync and continue as running. This is not related to killing the app master process on the NM. > Transition RMApp and RMAppAttempt state to RUNNING after restart for > recovered running apps > --- > > Key: YARN-1373 > URL: https://issues.apache.org/jira/browse/YARN-1373 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Omkar Vinit Joshi > > Currently the RM moves recovered app attempts to the a terminal recovered > state and starts a new attempt. Instead, it will have to transition the last > attempt to a running state such that it can proceed as normal once the > running attempt has resynced with the ApplicationMasterService (YARN-1365 and > YARN-1366). If the RM had started the application container before dying then > the AM would be up and trying to contact the RM. The RM may have had died > before launching the container. For this case, the RM should wait for AM > liveliness period and issue a kill container for the stored master container. > It should transition this attempt to some RECOVER_ERROR state and proceed to > start a new attempt. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034691#comment-14034691 ] Bikas Saha commented on YARN-2052: -- bq. Had an offline discussion with Vinod. Maybe it's still better to persist some sequence number to indicate the number of RM restarts when RM starts up. Is this the same as the epoch number that was mentioned earlier in this jira? https://issues.apache.org/jira/browse/YARN-2052?focusedCommentId=13996675&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13996675. Seems to me that its the same with epoch number changed to num-rm-restarts. > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034637#comment-14034637 ] Tsuyoshi OZAWA commented on YARN-2052: -- Basically, I agree with the approach. If we take the sequence-number approach, we should define the behavior when sequence number overflows. One simple way is to fallback to RM-restart implemented in YARN-128. After changing the containerId/appId from integer to long, it'll happen very rarely. [~jianhe], what do you think about the behavior? > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2144) Add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034624#comment-14034624 ] Jian He commented on YARN-2144: --- the patch needs rebase, can you update please? thx > Add logs when preemption occurs > --- > > Key: YARN-2144 > URL: https://issues.apache.org/jira/browse/YARN-2144 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.5.0 >Reporter: Tassapol Athiapinya >Assignee: Wangda Tan > Attachments: AM-page-preemption-info.png, YARN-2144.patch, > YARN-2144.patch, YARN-2144.patch > > > There should be easy-to-read logs when preemption does occur. > 1. For debugging purpose, RM should log this. > 2. For administrative purpose, RM webpage should have a page to show recent > preemption events. > RM logs should have following properties: > * Logs are retrievable when an application is still running and often flushed. > * Can distinguish between AM container preemption and task container > preemption with container ID shown. > * Should be INFO level log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2147) client lacks delegation token exception details when application submit fails
[ https://issues.apache.org/jira/browse/YARN-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034612#comment-14034612 ] Daryn Sharp commented on YARN-2147: --- I don't think the patch handles the use case it's designed for. If job submission failed with a bland "Read timed out", then logging all the tokens in the RM log doesn't help the end user, nor does the RM log even answer the question "which token timed out"? What you really want to do is change {{DelegationTokenRenewer#handleAppSubmitEvent}} to trap exceptions from {{renewToken}}. Wrap the exception with a more descriptive exception that stringifies to the user as "Can't renew token : Read timed out". > client lacks delegation token exception details when application submit fails > - > > Key: YARN-2147 > URL: https://issues.apache.org/jira/browse/YARN-2147 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Jason Lowe >Assignee: Chen He >Priority: Minor > Attachments: YARN-2147-v2.patch, YARN-2147.patch > > > When an client submits an application and the delegation token process fails > the client can lack critical details needed to understand the nature of the > error. Only the message of the error exception is conveyed to the client, > which sometimes isn't enough to debug. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034588#comment-14034588 ] Hadoop QA commented on YARN-1341: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650914/YARN-1341v5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4017//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4017//console This message is automatically generated. > Recover NMTokens upon nodemanager restart > - > > Key: YARN-1341 > URL: https://issues.apache.org/jira/browse/YARN-1341 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, > YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034541#comment-14034541 ] Jian He commented on YARN-2052: --- Seems more problem with the randomId approach if user wants to the kill the container, user has to be aware of the random ID.. Had an offline discussion with Vinod. Maybe it's still better to persist some sequence number to indicate the number of RM restarts when RM starts up. Today containerId#id is int (32 bits), we reserve some bits in the front for the number of RM restarts. e.g. 32bits divided as 8bits for the number of RM restarts and 24 bits for the number of containers. Each time RM restarts, we increase the RM sequence number. Also, We should have a followup jira to change the containerId/appId from integer to long and deprecate the old one. [~ozawa], do you agree? > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1341: - Attachment: YARN-1341v5.patch Thanks for taking a look, Junping! I've updated the patch to trunk. > Recover NMTokens upon nodemanager restart > - > > Key: YARN-1341 > URL: https://issues.apache.org/jira/browse/YARN-1341 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, > YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034474#comment-14034474 ] Tsuyoshi OZAWA commented on YARN-2052: -- Vinod, OK. I'll create new JIRA to address it. {quote} Another question is how are we going to show the containerId string? specifically the toString() method. If we just say "original containerId string+UUID", it'll be inconvenient for debugging as the UUID has no meaning. {quote} >From developer's point of view, you're right. One idea is showing RM_ID >instead of UUID. Validating RM_ID and confirming not to include underscore at >startup time. One concern of this approach is that we'll break backward >compatibility of yarn-site.xml. If we can accept it, it's better approach. > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()
[ https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034456#comment-14034456 ] Hadoop QA commented on YARN-2171: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650880/YARN-2171v2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4016//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4016//console This message is automatically generated. > AMs block on the CapacityScheduler lock during allocate() > - > > Key: YARN-2171 > URL: https://issues.apache.org/jira/browse/YARN-2171 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 0.23.10, 2.4.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: YARN-2171.patch, YARN-2171v2.patch > > > When AMs heartbeat into the RM via the allocate() call they are blocking on > the CapacityScheduler lock when trying to get the number of nodes in the > cluster via getNumClusterNodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034452#comment-14034452 ] Jian He commented on YARN-2052: --- Another question is how are we going to show the containerId string? specifically the toString() method. If we just say "original containerId string+UUID", it'll be inconvenient for debugging as the UUID has no meaning. > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034448#comment-14034448 ] Vinod Kumar Vavilapalli commented on YARN-2052: --- bq. BTW, I think we should update CheckpointAMPreemptionPolicy after this JIRA. Ideally this should be container-allocation timestamp and we should depend on that instead of comparing container-IDs. IAC, let's fix it separately.. > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2173) Enabling HTTPS for the reader REST APIs of TimelineServer
[ https://issues.apache.org/jira/browse/YARN-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2173: -- Summary: Enabling HTTPS for the reader REST APIs of TimelineServer (was: Enabling HTTPS for the reader REST APIs) > Enabling HTTPS for the reader REST APIs of TimelineServer > - > > Key: YARN-2173 > URL: https://issues.apache.org/jira/browse/YARN-2173 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2174) Enabling HTTPs for the writer REST API of TimelineServer
[ https://issues.apache.org/jira/browse/YARN-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2174: -- Summary: Enabling HTTPs for the writer REST API of TimelineServer (was: Enabling HTTPs for the writer REST API) > Enabling HTTPs for the writer REST API of TimelineServer > > > Key: YARN-2174 > URL: https://issues.apache.org/jira/browse/YARN-2174 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > Since we'd like to allow the application to put the timeline data at the > client, the AM and even the containers, we need to provide the way to > distribute the keystore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1373) Transition RMApp and RMAppAttempt state to RUNNING after restart for recovered running apps
[ https://issues.apache.org/jira/browse/YARN-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli resolved YARN-1373. --- Resolution: Duplicate Assignee: Omkar Vinit Joshi (was: Anubhav Dhoot) bq. Currently the RM moves recovered app attempts to the a terminal recovered state and starts a new attempt. This is no longer an issue - never been since YARN-1210. Even in non-work-preserving RM restart, RM explicitly never kills the AMs, it's the nodes that kill all containers - this was done in YARN-1210. The state-machines are already setup correctly and so no changes are needed here. Closing as duplicate of YARN-1210. > Transition RMApp and RMAppAttempt state to RUNNING after restart for > recovered running apps > --- > > Key: YARN-1373 > URL: https://issues.apache.org/jira/browse/YARN-1373 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Omkar Vinit Joshi > > Currently the RM moves recovered app attempts to the a terminal recovered > state and starts a new attempt. Instead, it will have to transition the last > attempt to a running state such that it can proceed as normal once the > running attempt has resynced with the ApplicationMasterService (YARN-1365 and > YARN-1366). If the RM had started the application container before dying then > the AM would be up and trying to contact the RM. The RM may have had died > before launching the container. For this case, the RM should wait for AM > liveliness period and issue a kill container for the stored master container. > It should transition this attempt to some RECOVER_ERROR state and proceed to > start a new attempt. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers
[ https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034405#comment-14034405 ] Anubhav Dhoot commented on YARN-1367: - I am still working on it and will have it ready soon. > After restart NM should resync with the RM without killing containers > - > > Key: YARN-1367 > URL: https://issues.apache.org/jira/browse/YARN-1367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Anubhav Dhoot > Attachments: YARN-1367.prototype.patch > > > After RM restart, the RM sends a resync response to NMs that heartbeat to it. > Upon receiving the resync response, the NM kills all containers and > re-registers with the RM. The NM should be changed to not kill the container > and instead inform the RM about all currently running containers including > their allocations etc. After the re-register, the NM should send all pending > container completions to the RM as usual. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2176) CapacityScheduler loops over all running applications rather than actively requesting apps
Jason Lowe created YARN-2176: Summary: CapacityScheduler loops over all running applications rather than actively requesting apps Key: YARN-2176 URL: https://issues.apache.org/jira/browse/YARN-2176 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.4.0 Reporter: Jason Lowe The capacity scheduler performance is primarily dominated by LeafQueue.assignContainers, and that currently loops over all applications that are running in the queue. It would be more efficient if we looped over just the applications that are actively asking for resources rather than all applications, as there could be thousands of applications running but only a few hundred that are currently asking for resources. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time
[ https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot reassigned YARN-2175: --- Assignee: Anubhav Dhoot > Container localization has no timeouts and tasks can be stuck there for a > long time > --- > > Key: YARN-2175 > URL: https://issues.apache.org/jira/browse/YARN-2175 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot > > There are no timeouts that can be used to limit the time taken by various > container startup operations. Localization for example could take a long time > and there is no way to kill an task if its stuck in these states. These may > have nothing to do with the task itself and could be an issue within the > platform. > Ideally there should be configurable limits for various states within the > NodeManager to limit various states. The RM does not care about most of these > and its only between AM and the NM. We can start by making these global > configurable defaults and in future we can make it fancier by letting AM > override them in the start container request. > This jira will be used to limit localization time and we open others if we > feel we need to limit other operations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time
[ https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2175: Affects Version/s: 2.4.0 > Container localization has no timeouts and tasks can be stuck there for a > long time > --- > > Key: YARN-2175 > URL: https://issues.apache.org/jira/browse/YARN-2175 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Anubhav Dhoot > > There are no timeouts that can be used to limit the time taken by various > container startup operations. Localization for example could take a long time > and there is no way to kill an task if its stuck in these states. These may > have nothing to do with the task itself and could be an issue within the > platform. > Ideally there should be configurable limits for various states within the > NodeManager to limit various states. The RM does not care about most of these > and its only between AM and the NM. We can start by making these global > configurable defaults and in future we can make it fancier by letting AM > override them in the start container request. > This jira will be used to limit localization time and we open others if we > feel we need to limit other operations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time
Anubhav Dhoot created YARN-2175: --- Summary: Container localization has no timeouts and tasks can be stuck there for a long time Key: YARN-2175 URL: https://issues.apache.org/jira/browse/YARN-2175 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot There are no timeouts that can be used to limit the time taken by various container startup operations. Localization for example could take a long time and there is no way to kill an task if its stuck in these states. These may have nothing to do with the task itself and could be an issue within the platform. Ideally there should be configurable limits for various states within the NodeManager to limit various states. The RM does not care about most of these and its only between AM and the NM. We can start by making these global configurable defaults and in future we can make it fancier by letting AM override them in the start container request. This jira will be used to limit localization time and we open others if we feel we need to limit other operations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()
[ https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-2171: - Attachment: YARN-2171v2.patch The point of the unit test was to catch regressions at a high level. If anyone changes the code such that calling allocate() will grab the scheduler lock then the test will fail, whether that's a regression in this particular method or some new method that's added that ApplicationMasterService or CapacityScheduler itself calls and grabs the lock. I added a separate unit test to exercise the getNumClusterNodes method. The AHS unit test failure seems unrelated, and it passes for me locally even with this change. > AMs block on the CapacityScheduler lock during allocate() > - > > Key: YARN-2171 > URL: https://issues.apache.org/jira/browse/YARN-2171 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 0.23.10, 2.4.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: YARN-2171.patch, YARN-2171v2.patch > > > When AMs heartbeat into the RM via the allocate() call they are blocking on > the CapacityScheduler lock when trying to get the number of nodes in the > cluster via getNumClusterNodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers
[ https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034359#comment-14034359 ] Anubhav Dhoot commented on YARN-1367: - I am still working on it. Will have an update soon > After restart NM should resync with the RM without killing containers > - > > Key: YARN-1367 > URL: https://issues.apache.org/jira/browse/YARN-1367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Anubhav Dhoot > Attachments: YARN-1367.prototype.patch > > > After RM restart, the RM sends a resync response to NMs that heartbeat to it. > Upon receiving the resync response, the NM kills all containers and > re-registers with the RM. The NM should be changed to not kill the container > and instead inform the RM about all currently running containers including > their allocations etc. After the re-register, the NM should send all pending > container completions to the RM as usual. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor
[ https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034268#comment-14034268 ] Vinod Kumar Vavilapalli commented on YARN-1972: --- That looks fine. I was suggesting we create one more document at hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/. You can create that doc and add it to the patch together with addressing my review in the last comment. Tx again for working on this, it's almost there.. > Implement secure Windows Container Executor > --- > > Key: YARN-1972 > URL: https://issues.apache.org/jira/browse/YARN-1972 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: YARN-1972.1.patch, YARN-1972.2.patch > > > h1. Windows Secure Container Executor (WCE) > YARN-1063 adds the necessary infrasturcture to launch a process as a domain > user as a solution for the problem of having a security boundary between > processes executed in YARN containers and the Hadoop services. The WCE is a > container executor that leverages the winutils capabilities introduced in > YARN-1063 and launches containers as an OS process running as the job > submitter user. A description of the S4U infrastructure used by YARN-1063 > alternatives considered can be read on that JIRA. > The WCE is based on the DefaultContainerExecutor. It relies on the DCE to > drive the flow of execution, but it overwrrides some emthods to the effect of: > * change the DCE created user cache directories to be owned by the job user > and by the nodemanager group. > * changes the actual container run command to use the 'createAsUser' command > of winutils task instead of 'create' > * runs the localization as standalone process instead of an in-process Java > method call. This in turn relies on the winutil createAsUser feature to run > the localization as the job user. > > When compared to LinuxContainerExecutor (LCE), the WCE has some minor > differences: > * it does no delegate the creation of the user cache directories to the > native implementation. > * it does no require special handling to be able to delete user files > The approach on the WCE came from a practical trial-and-error approach. I had > to iron out some issues around the Windows script shell limitations (command > line length) to get it to work, the biggest issue being the huge CLASSPATH > that is commonplace in Hadoop environment container executions. The job > container itself is already dealing with this via a so called 'classpath > jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch > as a separate container the same issue had to be resolved and I used the same > 'classpath jar' approach. > h2. Deployment Requirements > To use the WCE one needs to set the > `yarn.nodemanager.container-executor.class` to > `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` > and set the `yarn.nodemanager.windows-secure-container-executor.group` to a > Windows security group name that is the nodemanager service principal is a > member of (equivalent of LCE > `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE > does not require any configuration outside of the Hadoop own's yar-site.xml. > For WCE to work the nodemanager must run as a service principal that is > member of the local Administrators group or LocalSystem. this is derived from > the need to invoke LoadUserProfile API which mention these requirements in > the specifications. This is in addition to the SE_TCB privilege mentioned in > YARN-1063, but this requirement will automatically imply that the SE_TCB > privilege is held by the nodemanager. For the Linux speakers in the audience, > the requirement is basically to run NM as root. > h2. Dedicated high privilege Service > Due to the high privilege required by the WCE we had discussed the need to > isolate the high privilege operations into a separate process, an 'executor' > service that is solely responsible to start the containers (incloding the > localizer). The NM would have to authenticate, authorize and communicate with > this service via an IPC mechanism and use this service to launch the > containers. I still believe we'll end up deploying such a service, but the > effort to onboard such a new platfrom specific new service on the project are > not trivial. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034201#comment-14034201 ] Jian He commented on YARN-1365: --- bq. allocateresponse would also use exceptions instead of AM commands. right, please open a new jira for that. For my other comment "My point was we can do the same for both addApplication and addApplicationAttempt to not send dup events", I can open a new jira for this too. We can keep this patch minimal. > ApplicationMasterService to allow Register and Unregister of an app that was > running before restart > --- > > Key: YARN-1365 > URL: https://issues.apache.org/jira/browse/YARN-1365 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Anubhav Dhoot > Attachments: YARN-1365.001.patch, YARN-1365.002.patch, > YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, > YARN-1365.005.patch, YARN-1365.initial.patch > > > For an application that was running before restart, the > ApplicationMasterService currently throws an exception when the app tries to > make the initial register or final unregister call. These should succeed and > the RMApp state machine should transition to completed like normal. > Unregistration should succeed for an app that the RM considers complete since > the RM may have died after saving completion in the store but before > notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers
[ https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034186#comment-14034186 ] Jian He commented on YARN-1367: --- [~adhoot], mind updating the patch please? I'm happy to work on it if you are busy. > After restart NM should resync with the RM without killing containers > - > > Key: YARN-1367 > URL: https://issues.apache.org/jira/browse/YARN-1367 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Anubhav Dhoot > Attachments: YARN-1367.prototype.patch > > > After RM restart, the RM sends a resync response to NMs that heartbeat to it. > Upon receiving the resync response, the NM kills all containers and > re-registers with the RM. The NM should be changed to not kill the container > and instead inform the RM about all currently running containers including > their allocations etc. After the re-register, the NM should send all pending > container completions to the RM as usual. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor
[ https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034179#comment-14034179 ] Remus Rusanu commented on YARN-1972: Thanks for the update Vinod. I have updated the item description to act as documentation. Do you think anything more is needed? > Implement secure Windows Container Executor > --- > > Key: YARN-1972 > URL: https://issues.apache.org/jira/browse/YARN-1972 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: YARN-1972.1.patch, YARN-1972.2.patch > > > h1. Windows Secure Container Executor (WCE) > YARN-1063 adds the necessary infrasturcture to launch a process as a domain > user as a solution for the problem of having a security boundary between > processes executed in YARN containers and the Hadoop services. The WCE is a > container executor that leverages the winutils capabilities introduced in > YARN-1063 and launches containers as an OS process running as the job > submitter user. A description of the S4U infrastructure used by YARN-1063 > alternatives considered can be read on that JIRA. > The WCE is based on the DefaultContainerExecutor. It relies on the DCE to > drive the flow of execution, but it overwrrides some emthods to the effect of: > * change the DCE created user cache directories to be owned by the job user > and by the nodemanager group. > * changes the actual container run command to use the 'createAsUser' command > of winutils task instead of 'create' > * runs the localization as standalone process instead of an in-process Java > method call. This in turn relies on the winutil createAsUser feature to run > the localization as the job user. > > When compared to LinuxContainerExecutor (LCE), the WCE has some minor > differences: > * it does no delegate the creation of the user cache directories to the > native implementation. > * it does no require special handling to be able to delete user files > The approach on the WCE came from a practical trial-and-error approach. I had > to iron out some issues around the Windows script shell limitations (command > line length) to get it to work, the biggest issue being the huge CLASSPATH > that is commonplace in Hadoop environment container executions. The job > container itself is already dealing with this via a so called 'classpath > jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch > as a separate container the same issue had to be resolved and I used the same > 'classpath jar' approach. > h2. Deployment Requirements > To use the WCE one needs to set the > `yarn.nodemanager.container-executor.class` to > `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` > and set the `yarn.nodemanager.windows-secure-container-executor.group` to a > Windows security group name that is the nodemanager service principal is a > member of (equivalent of LCE > `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE > does not require any configuration outside of the Hadoop own's yar-site.xml. > For WCE to work the nodemanager must run as a service principal that is > member of the local Administrators group or LocalSystem. this is derived from > the need to invoke LoadUserProfile API which mention these requirements in > the specifications. This is in addition to the SE_TCB privilege mentioned in > YARN-1063, but this requirement will automatically imply that the SE_TCB > privilege is held by the nodemanager. For the Linux speakers in the audience, > the requirement is basically to run NM as root. > h2. Dedicated high privilege Service > Due to the high privilege required by the WCE we had discussed the need to > isolate the high privilege operations into a separate process, an 'executor' > service that is solely responsible to start the containers (incloding the > localizer). The NM would have to authenticate, authorize and communicate with > this service via an IPC mechanism and use this service to launch the > containers. I still believe we'll end up deploying such a service, but the > effort to onboard such a new platfrom specific new service on the project are > not trivial. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor
[ https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034160#comment-14034160 ] Vinod Kumar Vavilapalli commented on YARN-1972: --- bq. All in all a very high privilege required for NM. We are considering a future iteration in which we extract the privileged operations into a dedicated NT service (=daemon) and bestow the high privileges only to this service. Thanks. Let's document this in a Windows specific docs page. bq. You are launching so many commands for every container - to chown files, to copy files etc. bq. We'll measure. [..] I don't think that moving the localization into native code would result in much benefit over a proper Java implementation. I'd file an investigation ticket to track this. bq. DCE and WCE no longer create user file cache, this is done solely by the localizer initDirs. DCE Test modified to reflect this. DCE.createUserCacheDirs renamed to createUserAppCacheDirs accordingly The division of responsibility between launching multiple commands before starting the localizer and the stuff that happens inside the localizer: Unfortunately, this still isn't ideal. Having userCache created by the ContainerExecutor but not file-cache is assymetric and confusing. I propose that we split this refactoring into a separate JIRA and stick to your original code. Apologies for the back-and-forth on this one. bq. There is more feedback to address (DRY between LCE and WCE localization launch, proper place for localization classpath jar). So, you will work on them here itself, right? Looks fine otherwise, exception for the above comments and a request for some basic documentation. > Implement secure Windows Container Executor > --- > > Key: YARN-1972 > URL: https://issues.apache.org/jira/browse/YARN-1972 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: YARN-1972.1.patch, YARN-1972.2.patch > > > h1. Windows Secure Container Executor (WCE) > YARN-1063 adds the necessary infrasturcture to launch a process as a domain > user as a solution for the problem of having a security boundary between > processes executed in YARN containers and the Hadoop services. The WCE is a > container executor that leverages the winutils capabilities introduced in > YARN-1063 and launches containers as an OS process running as the job > submitter user. A description of the S4U infrastructure used by YARN-1063 > alternatives considered can be read on that JIRA. > The WCE is based on the DefaultContainerExecutor. It relies on the DCE to > drive the flow of execution, but it overwrrides some emthods to the effect of: > * change the DCE created user cache directories to be owned by the job user > and by the nodemanager group. > * changes the actual container run command to use the 'createAsUser' command > of winutils task instead of 'create' > * runs the localization as standalone process instead of an in-process Java > method call. This in turn relies on the winutil createAsUser feature to run > the localization as the job user. > > When compared to LinuxContainerExecutor (LCE), the WCE has some minor > differences: > * it does no delegate the creation of the user cache directories to the > native implementation. > * it does no require special handling to be able to delete user files > The approach on the WCE came from a practical trial-and-error approach. I had > to iron out some issues around the Windows script shell limitations (command > line length) to get it to work, the biggest issue being the huge CLASSPATH > that is commonplace in Hadoop environment container executions. The job > container itself is already dealing with this via a so called 'classpath > jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch > as a separate container the same issue had to be resolved and I used the same > 'classpath jar' approach. > h2. Deployment Requirements > To use the WCE one needs to set the > `yarn.nodemanager.container-executor.class` to > `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` > and set the `yarn.nodemanager.windows-secure-container-executor.group` to a > Windows security group name that is the nodemanager service principal is a > member of (equivalent of LCE > `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE > does not require any configuration outside of the Hadoop own's yar-site.xml. > For WCE to work the nodemanager must run as a service principal that is > member of the local Administrators group or LocalSystem. this is derived from > the need to invoke LoadUserProfile API which mention these re
[jira] [Commented] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit
[ https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034154#comment-14034154 ] Hadoop QA commented on YARN-2083: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650834/YARN-2083-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSQueue {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4015//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4015//console This message is automatically generated. > In fair scheduler, Queue should not been assigned more containers when its > usedResource had reach the maxResource limit > --- > > Key: YARN-2083 > URL: https://issues.apache.org/jira/browse/YARN-2083 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.3.0 >Reporter: Yi Tian > Labels: assignContainer, fair, scheduler > Fix For: 2.4.1 > > Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083.patch > > > In fair scheduler, FSParentQueue and FSLeafQueue do an > assignContainerPreCheck to guaranty this queue is not over its limit. > But the fitsIn function in Resource.java did not return false when the > usedResource equals the maxResource. > I think we should create a new Function "fitsInWithoutEqual" instead of > "fitsIn" in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-365) Each NM heartbeat should not generate an event for the Scheduler
[ https://issues.apache.org/jira/browse/YARN-365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-365: Attachment: YARN-365.branch-0.23.patch Patch for branch-0.23. RM unit tests pass, and I manually tested it as well on a single-node cluster forcing the scheduler to run slower than the heartbeat interval. > Each NM heartbeat should not generate an event for the Scheduler > > > Key: YARN-365 > URL: https://issues.apache.org/jira/browse/YARN-365 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Affects Versions: 0.23.5 >Reporter: Siddharth Seth >Assignee: Xuan Gong > Fix For: 2.1.0-beta > > Attachments: Prototype2.txt, Prototype3.txt, YARN-365.1.patch, > YARN-365.10.patch, YARN-365.2.patch, YARN-365.3.patch, YARN-365.4.patch, > YARN-365.5.patch, YARN-365.6.patch, YARN-365.7.patch, YARN-365.8.patch, > YARN-365.9.patch, YARN-365.branch-0.23.patch > > > Follow up from YARN-275 > https://issues.apache.org/jira/secure/attachment/12567075/Prototype.txt -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()
[ https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034060#comment-14034060 ] Vinod Kumar Vavilapalli commented on YARN-2171: --- The code changes look fine enough to me. The test is not so useful beyond validating this ticket, but that's okay. I see that we don't have any test validating the number of nodes itself explicitly, shall we add that here? > AMs block on the CapacityScheduler lock during allocate() > - > > Key: YARN-2171 > URL: https://issues.apache.org/jira/browse/YARN-2171 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 0.23.10, 2.4.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: YARN-2171.patch > > > When AMs heartbeat into the RM via the allocate() call they are blocking on > the CapacityScheduler lock when trying to get the number of nodes in the > cluster via getNumClusterNodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-868) YarnClient should set the service address in tokens returned by getRMDelegationToken()
[ https://issues.apache.org/jira/browse/YARN-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-868: - Target Version/s: 2.5.0 (was: 2.1.0-beta) > YarnClient should set the service address in tokens returned by > getRMDelegationToken() > -- > > Key: YARN-868 > URL: https://issues.apache.org/jira/browse/YARN-868 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Hitesh Shah > > Either the client should set this information into the token or the client > layer should expose an api that returns the service address. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit
[ https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Tian updated YARN-2083: -- Attachment: YARN-2083-2.patch move test code to TestFSQueue.java > In fair scheduler, Queue should not been assigned more containers when its > usedResource had reach the maxResource limit > --- > > Key: YARN-2083 > URL: https://issues.apache.org/jira/browse/YARN-2083 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.3.0 >Reporter: Yi Tian > Labels: assignContainer, fair, scheduler > Fix For: 2.4.1 > > Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083.patch > > > In fair scheduler, FSParentQueue and FSLeafQueue do an > assignContainerPreCheck to guaranty this queue is not over its limit. > But the fitsIn function in Resource.java did not return false when the > usedResource equals the maxResource. > I think we should create a new Function "fitsInWithoutEqual" instead of > "fitsIn" in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()
[ https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034034#comment-14034034 ] Hadoop QA commented on YARN-2171: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650819/YARN-2171.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4014//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4014//console This message is automatically generated. > AMs block on the CapacityScheduler lock during allocate() > - > > Key: YARN-2171 > URL: https://issues.apache.org/jira/browse/YARN-2171 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 0.23.10, 2.4.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: YARN-2171.patch > > > When AMs heartbeat into the RM via the allocate() call they are blocking on > the CapacityScheduler lock when trying to get the number of nodes in the > cluster via getNumClusterNodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2102) More generalized timeline ACLs
[ https://issues.apache.org/jira/browse/YARN-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2102: -- Description: We need to differentiate the access controls of reading and writing operations, and we need to think about cross-entity access control. For example, if we are executing a workflow of MR jobs, which writing the timeline data of this workflow, we don't want other user to pollute the timeline data of the workflow by putting something under it. (was: Like ApplicationACLsManager, we should also allow configured user/group to access the timeline data.) > More generalized timeline ACLs > -- > > Key: YARN-2102 > URL: https://issues.apache.org/jira/browse/YARN-2102 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > We need to differentiate the access controls of reading and writing > operations, and we need to think about cross-entity access control. For > example, if we are executing a workflow of MR jobs, which writing the > timeline data of this workflow, we don't want other user to pollute the > timeline data of the workflow by putting something under it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2102) More generalized timeline ACLs
[ https://issues.apache.org/jira/browse/YARN-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2102: -- Summary: More generalized timeline ACLs (was: Extend access control for configured user/group list) > More generalized timeline ACLs > -- > > Key: YARN-2102 > URL: https://issues.apache.org/jira/browse/YARN-2102 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > Like ApplicationACLsManager, we should also allow configured user/group to > access the timeline data. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034021#comment-14034021 ] Junping Du commented on YARN-1341: -- [~jlowe], Thanks for the patch here. I am currently reviewing it and looks like some code like: LeveldbIterator, NMStateStoreService already get committed in other patches. Would you resync the patch here against trunk? Thanks! > Recover NMTokens upon nodemanager restart > - > > Key: YARN-1341 > URL: https://issues.apache.org/jira/browse/YARN-1341 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, > YARN-1341v4-and-YARN-1987.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2174) Enabling HTTPs for the writer REST API
[ https://issues.apache.org/jira/browse/YARN-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2174: -- Description: Since we'd like to allow the application to put the timeline data at the client, the AM and even the containers, we need to provide the way to distribute the keystore. > Enabling HTTPs for the writer REST API > -- > > Key: YARN-2174 > URL: https://issues.apache.org/jira/browse/YARN-2174 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > Since we'd like to allow the application to put the timeline data at the > client, the AM and even the containers, we need to provide the way to > distribute the keystore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2174) Enabling HTTPs for the writer REST API
[ https://issues.apache.org/jira/browse/YARN-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen reassigned YARN-2174: - Assignee: Zhijie Shen > Enabling HTTPs for the writer REST API > -- > > Key: YARN-2174 > URL: https://issues.apache.org/jira/browse/YARN-2174 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2162) Fair Scheduler :ability to optionally configure minResources and maxResources in terms of percentage
[ https://issues.apache.org/jira/browse/YARN-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034019#comment-14034019 ] Ashwin Shankar commented on YARN-2162: -- [~maysamyabandeh], yes that was the intention. Changed title and description to make it clear. > Fair Scheduler :ability to optionally configure minResources and maxResources > in terms of percentage > > > Key: YARN-2162 > URL: https://issues.apache.org/jira/browse/YARN-2162 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Ashwin Shankar > Labels: scheduler > > minResources and maxResources in fair scheduler configs are expressed in > terms of absolute numbers X mb, Y vcores. > As a result, when we expand or shrink our hadoop cluster, we need to > recalculate and change minResources/maxResources accordingly, which is pretty > inconvenient. > We can circumvent this problem if we can optionally configure these > properties in terms of percentage of cluster capacity. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2174) Enabling HTTPs for the writer REST API
Zhijie Shen created YARN-2174: - Summary: Enabling HTTPs for the writer REST API Key: YARN-2174 URL: https://issues.apache.org/jira/browse/YARN-2174 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2162) Fair Scheduler :ability to optionally configure minResources and maxResources in terms of percentage
[ https://issues.apache.org/jira/browse/YARN-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashwin Shankar updated YARN-2162: - Summary: Fair Scheduler :ability to optionally configure minResources and maxResources in terms of percentage (was: Fair Scheduler :ability to configure minResources and maxResources in terms of percentage) > Fair Scheduler :ability to optionally configure minResources and maxResources > in terms of percentage > > > Key: YARN-2162 > URL: https://issues.apache.org/jira/browse/YARN-2162 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Ashwin Shankar > Labels: scheduler > > minResources and maxResources in fair scheduler configs are expressed in > terms of absolute numbers X mb, Y vcores. > As a result, when we expand or shrink our hadoop cluster, we need to > recalculate and change minResources/maxResources accordingly, which is pretty > inconvenient. > We can circumvent this problem if we can optionally configure these > properties in terms of percentage of cluster capacity. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2173) Enabling HTTPS for the reader REST APIs
Zhijie Shen created YARN-2173: - Summary: Enabling HTTPS for the reader REST APIs Key: YARN-2173 URL: https://issues.apache.org/jira/browse/YARN-2173 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2162) Fair Scheduler :ability to configure minResources and maxResources in terms of percentage
[ https://issues.apache.org/jira/browse/YARN-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashwin Shankar updated YARN-2162: - Description: minResources and maxResources in fair scheduler configs are expressed in terms of absolute numbers X mb, Y vcores. As a result, when we expand or shrink our hadoop cluster, we need to recalculate and change minResources/maxResources accordingly, which is pretty inconvenient. We can circumvent this problem if we can optionally configure these properties in terms of percentage of cluster capacity. was: minResources and maxResources in fair scheduler configs are expressed in terms of absolute numbers X mb, Y vcores. As a result, when we expand or shrink our hadoop cluster, we need to recalculate and change minResources/maxResources accordingly, which is pretty inconvenient. We can circumvent this problem if we can (optionally) configure these properties in terms of percentage of cluster capacity. > Fair Scheduler :ability to configure minResources and maxResources in terms > of percentage > - > > Key: YARN-2162 > URL: https://issues.apache.org/jira/browse/YARN-2162 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Ashwin Shankar > Labels: scheduler > > minResources and maxResources in fair scheduler configs are expressed in > terms of absolute numbers X mb, Y vcores. > As a result, when we expand or shrink our hadoop cluster, we need to > recalculate and change minResources/maxResources accordingly, which is pretty > inconvenient. > We can circumvent this problem if we can optionally configure these > properties in terms of percentage of cluster capacity. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-409) Allow apps to be killed via the RM REST API
[ https://issues.apache.org/jira/browse/YARN-409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved YARN-409. - Resolution: Duplicate > Allow apps to be killed via the RM REST API > --- > > Key: YARN-409 > URL: https://issues.apache.org/jira/browse/YARN-409 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, resourcemanager >Affects Versions: 2.0.3-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > > The RM REST API currently allows getting information about running > applications. Adding the capability to kill applications would allow systems > like Hue to perform their functions over HTTP. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-409) Allow apps to be killed via the RM REST API
[ https://issues.apache.org/jira/browse/YARN-409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033950#comment-14033950 ] Sandy Ryza commented on YARN-409: - definitely. will close this because there seems to be more activity there. > Allow apps to be killed via the RM REST API > --- > > Key: YARN-409 > URL: https://issues.apache.org/jira/browse/YARN-409 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, resourcemanager >Affects Versions: 2.0.3-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > > The RM REST API currently allows getting information about running > applications. Adding the capability to kill applications would allow systems > like Hue to perform their functions over HTTP. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()
[ https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-2171: - Attachment: YARN-2171.patch Patch to use AtomicInteger for the number of nodes so we can avoid grabbing the lock to access the value. I also added a unit test to verify allocate doesn't try to grab the capacity scheduler lock. > AMs block on the CapacityScheduler lock during allocate() > - > > Key: YARN-2171 > URL: https://issues.apache.org/jira/browse/YARN-2171 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 0.23.10, 2.4.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: YARN-2171.patch > > > When AMs heartbeat into the RM via the allocate() call they are blocking on > the CapacityScheduler lock when trying to get the number of nodes in the > cluster via getNumClusterNodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1339) Recover DeletionService state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033909#comment-14033909 ] Hudson commented on YARN-1339: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1804 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1804/]) YARN-1339. Recover DeletionService state upon nodemanager restart. (Contributed by Jason Lowe) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603036) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DeletionService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDeletionService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java > Recover DeletionService state upon nodemanager restart > -- > > Key: YARN-1339 > URL: https://issues.apache.org/jira/browse/YARN-1339 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Fix For: 2.5.0 > > Attachments: YARN-1339.patch, YARN-1339v2.patch, > YARN-1339v3-and-YARN-1987.patch, YARN-1339v4.patch, YARN-1339v5.patch, > YARN-1339v6.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2159) Better logging in SchedulerNode#allocateContainer
[ https://issues.apache.org/jira/browse/YARN-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033905#comment-14033905 ] Hudson commented on YARN-2159: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1804 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1804/]) YARN-2159. Better logging in SchedulerNode#allocateContainer. (Ray Chiang via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603003) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java > Better logging in SchedulerNode#allocateContainer > - > > Key: YARN-2159 > URL: https://issues.apache.org/jira/browse/YARN-2159 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Ray Chiang >Assignee: Ray Chiang >Priority: Trivial > Labels: newbie, supportability > Fix For: 2.5.0 > > Attachments: YARN2159-01.patch > > > This bit of code: > {quote} > LOG.info("Assigned container " + container.getId() + " of capacity " > + container.getResource() + " on host " + rmNode.getNodeAddress() > + ", which currently has " + numContainers + " containers, " > + getUsedResource() + " used and " + getAvailableResource() > + " available"); > {quote} > results in a line like: > {quote} > 2014-05-30 16:17:43,573 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: > Assigned container container_14000_0009_01_00 of capacity > on host machine.host.domain.com:8041, which currently > has 18 containers, used and > available > {quote} > That message is fine in most cases, but looks pretty bad after the last > available allocation, since it says something like "vCores:0 available". > Here is one suggested phrasing > - "which has 18 containers, used and > available after allocation" -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1885) RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts
[ https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033911#comment-14033911 ] Hudson commented on YARN-1885: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1804 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1804/]) YARN-1885. Fixed a bug that RM may not send application-clean-up signal to NMs where the completed applications previously ran in case of RM restart. Contributed by Wangda Tan (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603028) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestResourceTrackerOnHA.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/RegisterNodeManagerRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestProtocolRecords.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestRegisterNodeManagerRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppEventType.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppRunningOnNodeEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttempt.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptEventType.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/event/RMAppAttemptContainerAcquiredEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStartedEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java * /hadoop/common/trunk/hadoop-yarn-project/hadoo
[jira] [Commented] (YARN-2167) LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block
[ https://issues.apache.org/jira/browse/YARN-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033906#comment-14033906 ] Hudson commented on YARN-2167: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1804 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1804/]) YARN-2167. LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block. Contributed by Junping Du (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603039) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java > LeveldbIterator should get closed in > NMLeveldbStateStoreService#loadLocalizationState() within finally block > > > Key: YARN-2167 > URL: https://issues.apache.org/jira/browse/YARN-2167 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Junping Du >Assignee: Junping Du > Fix For: 3.0.0, 2.5.0 > > Attachments: YARN-2167.patch > > > In NMLeveldbStateStoreService#loadLocalizationState(), we have > LeveldbIterator to read NM's localization state but it is not get closed in > finally block. We should close this connection to DB as a common practice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs
[ https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Chen updated YARN-2172: --- Description: In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. My team has completed its implementation and our tests showed it is working in a rather solid way. was: In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. > Suspend/Resume Hadoop Jobs > -- > > Key: YARN-2172 > URL: https://issues.apache.org/jira/browse/YARN-2172 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager, webapp >Affects Versions: 2.2.0 > Environment: CentOS 6.5, Hadoop 2.2.0 >Reporter: Richard Chen > Labels: hadoop, jobs, resume, suspend > Fix For: 2.2.0 > > Original Estimate: 336h > Remaining Estimate: 336h > > In a multi-application cluster environment, jobs running inside Hadoop YARN > may be of lower-priority than jobs running outside Hadoop YARN like HBase. To > give way to other higher-priority jobs inside Hadoop, a user or some > cluster-level resource scheduling service should be able to suspend and/or > resume some particular jobs within Hadoop YARN. > When target jobs inside Hadoop are suspended, those already allocated and > running task containers will continue to run until their completion or active > preemption by other ways. But no more new containers would be allocated to > the target jobs. In contrast, when suspended jobs are put into resume mode, > they will continue to run from the previous job progress and have new task > containers allocated to complete the rest of the jobs. > My team has completed its implementation and our tests showed it is working > in a rather solid way. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs
[ https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Chen updated YARN-2172: --- Description: In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. My team has completed its implementation and our tests showed it works in a rather solid way. was: In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. My team has completed its implementation and our tests showed it is working in a rather solid way. > Suspend/Resume Hadoop Jobs > -- > > Key: YARN-2172 > URL: https://issues.apache.org/jira/browse/YARN-2172 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager, webapp >Affects Versions: 2.2.0 > Environment: CentOS 6.5, Hadoop 2.2.0 >Reporter: Richard Chen > Labels: hadoop, jobs, resume, suspend > Fix For: 2.2.0 > > Original Estimate: 336h > Remaining Estimate: 336h > > In a multi-application cluster environment, jobs running inside Hadoop YARN > may be of lower-priority than jobs running outside Hadoop YARN like HBase. To > give way to other higher-priority jobs inside Hadoop, a user or some > cluster-level resource scheduling service should be able to suspend and/or > resume some particular jobs within Hadoop YARN. > When target jobs inside Hadoop are suspended, those already allocated and > running task containers will continue to run until their completion or active > preemption by other ways. But no more new containers would be allocated to > the target jobs. In contrast, when suspended jobs are put into resume mode, > they will continue to run from the previous job progress and have new task > containers allocated to complete the rest of the jobs. > My team has completed its implementation and our tests showed it works in a > rather solid way. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs
[ https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Chen updated YARN-2172: --- Description: In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop application. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. was: In a multi-application cluster environment, jobs running inside Hadoop application may be of lower-priority than jobs running inside other applications like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop application. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. > Suspend/Resume Hadoop Jobs > -- > > Key: YARN-2172 > URL: https://issues.apache.org/jira/browse/YARN-2172 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager, webapp >Affects Versions: 2.2.0 > Environment: CentOS 6.5, Hadoop 2.2.0 >Reporter: Richard Chen > Labels: hadoop, jobs, resume, suspend > Fix For: 2.2.0 > > Original Estimate: 336h > Remaining Estimate: 336h > > In a multi-application cluster environment, jobs running inside Hadoop YARN > may be of lower-priority than jobs running outside Hadoop YARN like HBase. To > give way to other higher-priority jobs inside Hadoop, a user or some > cluster-level resource scheduling service should be able to suspend and/or > resume some particular jobs within Hadoop application. > When target jobs inside Hadoop are suspended, those already allocated and > running task containers will continue to run until their completion or active > preemption by other ways. But no more new containers would be allocated to > the target jobs. In contrast, when suspended jobs are put into resume mode, > they will continue to run from the previous job progress and have new task > containers allocated to complete the rest of the jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs
[ https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Chen updated YARN-2172: --- Description: In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. was: In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop application. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. > Suspend/Resume Hadoop Jobs > -- > > Key: YARN-2172 > URL: https://issues.apache.org/jira/browse/YARN-2172 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager, webapp >Affects Versions: 2.2.0 > Environment: CentOS 6.5, Hadoop 2.2.0 >Reporter: Richard Chen > Labels: hadoop, jobs, resume, suspend > Fix For: 2.2.0 > > Original Estimate: 336h > Remaining Estimate: 336h > > In a multi-application cluster environment, jobs running inside Hadoop YARN > may be of lower-priority than jobs running outside Hadoop YARN like HBase. To > give way to other higher-priority jobs inside Hadoop, a user or some > cluster-level resource scheduling service should be able to suspend and/or > resume some particular jobs within Hadoop YARN. > When target jobs inside Hadoop are suspended, those already allocated and > running task containers will continue to run until their completion or active > preemption by other ways. But no more new containers would be allocated to > the target jobs. In contrast, when suspended jobs are put into resume mode, > they will continue to run from the previous job progress and have new task > containers allocated to complete the rest of the jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2172) Suspend/Resume Hadoop Jobs
Richard Chen created YARN-2172: -- Summary: Suspend/Resume Hadoop Jobs Key: YARN-2172 URL: https://issues.apache.org/jira/browse/YARN-2172 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager, webapp Affects Versions: 2.2.0 Environment: CentOS 6.5, Hadoop 2.2.0 Reporter: Richard Chen Fix For: 2.2.0 In a multi-application cluster environment, jobs running inside Hadoop application may be of lower-priority than jobs running inside other applications like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop application. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-409) Allow apps to be killed via the RM REST API
[ https://issues.apache.org/jira/browse/YARN-409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033874#comment-14033874 ] Romain Rigaux commented on YARN-409: dup of https://issues.apache.org/jira/browse/YARN-1702? > Allow apps to be killed via the RM REST API > --- > > Key: YARN-409 > URL: https://issues.apache.org/jira/browse/YARN-409 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, resourcemanager >Affects Versions: 2.0.3-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > > The RM REST API currently allows getting information about running > applications. Adding the capability to kill applications would allow systems > like Hue to perform their functions over HTTP. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()
[ https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033864#comment-14033864 ] Jason Lowe commented on YARN-2171: -- When the CapacityScheduler scheduler thread is running full-time due to a constant stream of events (e.g.: large number of running applications with a large number of cluster nodes) then the CapacityScheduler lock is held by that scheduler loop most of the time. As AMs heartbeat into the RM to try to get their resources, the capacity scheduler code goes out of its way to try to avoid having the AMs grab the scheduler lock. Unfortunately this one was missed to get this one integer value. Therefore they end up piling up on the scheduler lock, filling all of the IPC handlers of the ApplicationMasterService and the others back up on the call queue. Once the scheduler releases the lock it will quickly try to grab it again, so only a few AMs end up getting through the "gate" and the IPC handlers fill again with the next batch of AMs blocking on the scheduler lock. This causes the average RPC response times to skyrocket for AMs. AMs experience large delays getting their allocations which in turn leads to lower cluster utilization and increased application runtimes. > AMs block on the CapacityScheduler lock during allocate() > - > > Key: YARN-2171 > URL: https://issues.apache.org/jira/browse/YARN-2171 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 0.23.10, 2.4.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > > When AMs heartbeat into the RM via the allocate() call they are blocking on > the CapacityScheduler lock when trying to get the number of nodes in the > cluster via getNumClusterNodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()
Jason Lowe created YARN-2171: Summary: AMs block on the CapacityScheduler lock during allocate() Key: YARN-2171 URL: https://issues.apache.org/jira/browse/YARN-2171 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.4.0, 0.23.10 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical When AMs heartbeat into the RM via the allocate() call they are blocking on the CapacityScheduler lock when trying to get the number of nodes in the cluster via getNumClusterNodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1885) RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts
[ https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033831#comment-14033831 ] Hudson commented on YARN-1885: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1777 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1777/]) YARN-1885. Fixed a bug that RM may not send application-clean-up signal to NMs where the completed applications previously ran in case of RM restart. Contributed by Wangda Tan (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603028) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestResourceTrackerOnHA.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/RegisterNodeManagerRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestProtocolRecords.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestRegisterNodeManagerRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppEventType.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppRunningOnNodeEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttempt.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptEventType.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/event/RMAppAttemptContainerAcquiredEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStartedEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/had
[jira] [Commented] (YARN-2159) Better logging in SchedulerNode#allocateContainer
[ https://issues.apache.org/jira/browse/YARN-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033825#comment-14033825 ] Hudson commented on YARN-2159: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1777 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1777/]) YARN-2159. Better logging in SchedulerNode#allocateContainer. (Ray Chiang via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603003) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java > Better logging in SchedulerNode#allocateContainer > - > > Key: YARN-2159 > URL: https://issues.apache.org/jira/browse/YARN-2159 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Ray Chiang >Assignee: Ray Chiang >Priority: Trivial > Labels: newbie, supportability > Fix For: 2.5.0 > > Attachments: YARN2159-01.patch > > > This bit of code: > {quote} > LOG.info("Assigned container " + container.getId() + " of capacity " > + container.getResource() + " on host " + rmNode.getNodeAddress() > + ", which currently has " + numContainers + " containers, " > + getUsedResource() + " used and " + getAvailableResource() > + " available"); > {quote} > results in a line like: > {quote} > 2014-05-30 16:17:43,573 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: > Assigned container container_14000_0009_01_00 of capacity > on host machine.host.domain.com:8041, which currently > has 18 containers, used and > available > {quote} > That message is fine in most cases, but looks pretty bad after the last > available allocation, since it says something like "vCores:0 available". > Here is one suggested phrasing > - "which has 18 containers, used and > available after allocation" -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2167) LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block
[ https://issues.apache.org/jira/browse/YARN-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033826#comment-14033826 ] Hudson commented on YARN-2167: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1777 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1777/]) YARN-2167. LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block. Contributed by Junping Du (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603039) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java > LeveldbIterator should get closed in > NMLeveldbStateStoreService#loadLocalizationState() within finally block > > > Key: YARN-2167 > URL: https://issues.apache.org/jira/browse/YARN-2167 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Junping Du >Assignee: Junping Du > Fix For: 3.0.0, 2.5.0 > > Attachments: YARN-2167.patch > > > In NMLeveldbStateStoreService#loadLocalizationState(), we have > LeveldbIterator to read NM's localization state but it is not get closed in > finally block. We should close this connection to DB as a common practice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1339) Recover DeletionService state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033829#comment-14033829 ] Hudson commented on YARN-1339: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1777 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1777/]) YARN-1339. Recover DeletionService state upon nodemanager restart. (Contributed by Jason Lowe) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603036) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DeletionService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDeletionService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java > Recover DeletionService state upon nodemanager restart > -- > > Key: YARN-1339 > URL: https://issues.apache.org/jira/browse/YARN-1339 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Fix For: 2.5.0 > > Attachments: YARN-1339.patch, YARN-1339v2.patch, > YARN-1339v3-and-YARN-1987.patch, YARN-1339v4.patch, YARN-1339v5.patch, > YARN-1339v6.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2170) Fix components' version information in the web page 'About the Cluster'
[ https://issues.apache.org/jira/browse/YARN-2170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-2170: --- Attachment: YARN-2170.patch > Fix components' version information in the web page 'About the Cluster' > --- > > Key: YARN-2170 > URL: https://issues.apache.org/jira/browse/YARN-2170 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jun Gong >Priority: Minor > Attachments: YARN-2170.patch > > > In the web page 'About the Cluster', YARN's component's build version(e.g. > ResourceManager) is the same as Hadoop version now. It is caused by calling > getVersion() instead of _getVersion() in VersionInfo.java by mistake. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2170) Fix components' version information in the web page 'About the Cluster'
Jun Gong created YARN-2170: -- Summary: Fix components' version information in the web page 'About the Cluster' Key: YARN-2170 URL: https://issues.apache.org/jira/browse/YARN-2170 Project: Hadoop YARN Issue Type: Bug Reporter: Jun Gong Priority: Minor In the web page 'About the Cluster', YARN's component's build version(e.g. ResourceManager) is the same as Hadoop version now. It is caused by calling getVersion() instead of _getVersion() in VersionInfo.java by mistake. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2169) NMSimulator of sls should catch more Exception
[ https://issues.apache.org/jira/browse/YARN-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Beckham007 updated YARN-2169: - Attachment: YARN-2169.patch > NMSimulator of sls should catch more Exception > -- > > Key: YARN-2169 > URL: https://issues.apache.org/jira/browse/YARN-2169 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Beckham007 > Attachments: YARN-2169.patch > > > In the method middleStep() of NMSimulator , sending heart beat may cause > InterruptedException or other Exception if the load is heavily. If not > handler these exceptions, the task of NMSimulator cloud not add to the > executor queue again. So the NM will lost. > In my situation, the pool size is 4000, nm size is 2000, and am is 1500. Some > NMs will lost. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2169) NMSimulator of sls should catch more Exception
Beckham007 created YARN-2169: Summary: NMSimulator of sls should catch more Exception Key: YARN-2169 URL: https://issues.apache.org/jira/browse/YARN-2169 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Beckham007 In the method middleStep() of NMSimulator , sending heart beat may cause InterruptedException or other Exception if the load is heavily. If not handler these exceptions, the task of NMSimulator cloud not add to the executor queue again. So the NM will lost. In my situation, the pool size is 4000, nm size is 2000, and am is 1500. Some NMs will lost. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1339) Recover DeletionService state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033677#comment-14033677 ] Hudson commented on YARN-1339: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #586 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/586/]) YARN-1339. Recover DeletionService state upon nodemanager restart. (Contributed by Jason Lowe) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603036) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DeletionService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDeletionService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java > Recover DeletionService state upon nodemanager restart > -- > > Key: YARN-1339 > URL: https://issues.apache.org/jira/browse/YARN-1339 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Fix For: 2.5.0 > > Attachments: YARN-1339.patch, YARN-1339v2.patch, > YARN-1339v3-and-YARN-1987.patch, YARN-1339v4.patch, YARN-1339v5.patch, > YARN-1339v6.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1885) RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts
[ https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033679#comment-14033679 ] Hudson commented on YARN-1885: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #586 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/586/]) YARN-1885. Fixed a bug that RM may not send application-clean-up signal to NMs where the completed applications previously ran in case of RM restart. Contributed by Wangda Tan (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603028) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestResourceTrackerOnHA.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/RegisterNodeManagerRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestProtocolRecords.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestRegisterNodeManagerRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppEventType.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppRunningOnNodeEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttempt.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptEventType.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/event/RMAppAttemptContainerAcquiredEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStartedEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoo
[jira] [Commented] (YARN-2159) Better logging in SchedulerNode#allocateContainer
[ https://issues.apache.org/jira/browse/YARN-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033673#comment-14033673 ] Hudson commented on YARN-2159: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #586 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/586/]) YARN-2159. Better logging in SchedulerNode#allocateContainer. (Ray Chiang via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603003) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java > Better logging in SchedulerNode#allocateContainer > - > > Key: YARN-2159 > URL: https://issues.apache.org/jira/browse/YARN-2159 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Ray Chiang >Assignee: Ray Chiang >Priority: Trivial > Labels: newbie, supportability > Fix For: 2.5.0 > > Attachments: YARN2159-01.patch > > > This bit of code: > {quote} > LOG.info("Assigned container " + container.getId() + " of capacity " > + container.getResource() + " on host " + rmNode.getNodeAddress() > + ", which currently has " + numContainers + " containers, " > + getUsedResource() + " used and " + getAvailableResource() > + " available"); > {quote} > results in a line like: > {quote} > 2014-05-30 16:17:43,573 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: > Assigned container container_14000_0009_01_00 of capacity > on host machine.host.domain.com:8041, which currently > has 18 containers, used and > available > {quote} > That message is fine in most cases, but looks pretty bad after the last > available allocation, since it says something like "vCores:0 available". > Here is one suggested phrasing > - "which has 18 containers, used and > available after allocation" -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2167) LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block
[ https://issues.apache.org/jira/browse/YARN-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033674#comment-14033674 ] Hudson commented on YARN-2167: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #586 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/586/]) YARN-2167. LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block. Contributed by Junping Du (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603039) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java > LeveldbIterator should get closed in > NMLeveldbStateStoreService#loadLocalizationState() within finally block > > > Key: YARN-2167 > URL: https://issues.apache.org/jira/browse/YARN-2167 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Junping Du >Assignee: Junping Du > Fix For: 3.0.0, 2.5.0 > > Attachments: YARN-2167.patch > > > In NMLeveldbStateStoreService#loadLocalizationState(), we have > LeveldbIterator to read NM's localization state but it is not get closed in > finally block. We should close this connection to DB as a common practice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033662#comment-14033662 ] Hadoop QA commented on YARN-2052: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650774/YARN-2052.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4013//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4013//console This message is automatically generated. > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2088) Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder
[ https://issues.apache.org/jira/browse/YARN-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033654#comment-14033654 ] Tsuyoshi OZAWA commented on YARN-2088: -- LGTM(non-binding). > Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder > > > Key: YARN-2088 > URL: https://issues.apache.org/jira/browse/YARN-2088 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Binglin Chang >Assignee: Binglin Chang > Attachments: YARN-2088.v1.patch > > > Some fields(set,list) are added to proto builders many times, we need to > clear those fields before add, otherwise the result proto contains more > contents. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033640#comment-14033640 ] Tsuyoshi OZAWA commented on YARN-2052: -- {quote} BTW, I think we should update CheckpointAMPreemptionPolicy after this JIRA. {quote} It means that we should use {{compareTo}} instead of calculating the value directly. > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2052) ContainerId creation after work preserving restart is broken
[ https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2052: - Attachment: YARN-2052.3.patch [~jianhe], thank you for the comment. {code} Application itself may possibly use Container.getId to differentiate the containers, two containers allocated by two RMs may have the same id integer, then the application logic will break. will this be fine? {code} Good point. Added doc to {{ContainerId#getId}}. In addition to it, implemented {{compareTo}} and {{equals}} to distinguish containers. I think this alternative is acceptable. What do you think? BTW, I think we should update CheckpointAMPreemptionPolicy after this JIRA. {code} Collections.sort(listOfCont, new Comparator() { @Override public int compare(final Container o1, final Container o2) { return o2.getId().getId() - o1.getId().getId(); } }); {code} > ContainerId creation after work preserving restart is broken > > > Key: YARN-2052 > URL: https://issues.apache.org/jira/browse/YARN-2052 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch > > > Container ids are made unique by using the app identifier and appending a > monotonically increasing sequence number to it. Since container creation is a > high churn activity the RM does not store the sequence number per app. So > after restart it does not know what the new sequence number should be for new > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2088) Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder
[ https://issues.apache.org/jira/browse/YARN-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033586#comment-14033586 ] Binglin Chang commented on YARN-2088: - Hi [~djp], could you help review this patch? I am doing YARN-2051, and it depend on this code, else the test is failed. > Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder > > > Key: YARN-2088 > URL: https://issues.apache.org/jira/browse/YARN-2088 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Binglin Chang >Assignee: Binglin Chang > Attachments: YARN-2088.v1.patch > > > Some fields(set,list) are added to proto builders many times, we need to > clear those fields before add, otherwise the result proto contains more > contents. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2142) Add one service to check the nodes' TRUST status
[ https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033570#comment-14033570 ] Hadoop QA commented on YARN-2142: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650759/trust.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4012//console This message is automatically generated. > Add one service to check the nodes' TRUST status > - > > Key: YARN-2142 > URL: https://issues.apache.org/jira/browse/YARN-2142 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager, scheduler >Affects Versions: 2.2.0 > Environment: OS:Ubuntu 13.04; > JAVA:OpenJDK 7u51-2.4.4-0 >Reporter: anders >Priority: Minor > Labels: patch > Fix For: 2.2.0 > > Attachments: trust.patch, trust.patch, trust.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > Because of critical computing environment ,we must test every node's TRUST > status in the cluster (We can get the TRUST status by the API of OAT > sever),So I add this feature into hadoop's schedule . > By the TRUST check service ,node can get the TRUST status of itself, > then through the heartbeat ,send the TRUST status to resource manager for > scheduling. > In the scheduling step,if the node's TRUST status is 'false', it will be > abandoned until it's TRUST status turn to 'true'. > ***The logic of this feature is similar to node's healthcheckservice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status
[ https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anders updated YARN-2142: - Attachment: trust.patch Test weather this patch can work > Add one service to check the nodes' TRUST status > - > > Key: YARN-2142 > URL: https://issues.apache.org/jira/browse/YARN-2142 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager, scheduler >Affects Versions: 2.2.0 > Environment: OS:Ubuntu 13.04; > JAVA:OpenJDK 7u51-2.4.4-0 >Reporter: anders >Priority: Minor > Labels: patch > Fix For: 2.2.0 > > Attachments: trust.patch, trust.patch, trust.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > Because of critical computing environment ,we must test every node's TRUST > status in the cluster (We can get the TRUST status by the API of OAT > sever),So I add this feature into hadoop's schedule . > By the TRUST check service ,node can get the TRUST status of itself, > then through the heartbeat ,send the TRUST status to resource manager for > scheduling. > In the scheduling step,if the node's TRUST status is 'false', it will be > abandoned until it's TRUST status turn to 'true'. > ***The logic of this feature is similar to node's healthcheckservice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status
[ https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anders updated YARN-2142: - Attachment: trust.patch Test weather this patch can wrok > Add one service to check the nodes' TRUST status > - > > Key: YARN-2142 > URL: https://issues.apache.org/jira/browse/YARN-2142 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager, scheduler >Affects Versions: 2.2.0 > Environment: OS:Ubuntu 13.04; > JAVA:OpenJDK 7u51-2.4.4-0 >Reporter: anders >Priority: Minor > Labels: patch > Fix For: 2.2.0 > > Attachments: trust.patch, trust.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > Because of critical computing environment ,we must test every node's TRUST > status in the cluster (We can get the TRUST status by the API of OAT > sever),So I add this feature into hadoop's schedule . > By the TRUST check service ,node can get the TRUST status of itself, > then through the heartbeat ,send the TRUST status to resource manager for > scheduling. > In the scheduling step,if the node's TRUST status is 'false', it will be > abandoned until it's TRUST status turn to 'true'. > ***The logic of this feature is similar to node's healthcheckservice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033531#comment-14033531 ] Hadoop QA commented on YARN-2074: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650742/YARN-2074.7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4011//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4011//console This message is automatically generated. > Preemption of AM containers shouldn't count towards AM failures > --- > > Key: YARN-2074 > URL: https://issues.apache.org/jira/browse/YARN-2074 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Jian He > Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, > YARN-2074.4.patch, YARN-2074.5.patch, YARN-2074.6.patch, YARN-2074.6.patch, > YARN-2074.7.patch, YARN-2074.7.patch > > > One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM > containers getting preempted shouldn't count towards AM failures and thus > shouldn't eventually fail applications. > We should explicitly handle AM container preemption/kill as a separate issue > and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1480) RM web services getApps() accepts many more filters than ApplicationCLI "list" command
[ https://issues.apache.org/jira/browse/YARN-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033502#comment-14033502 ] Zhijie Shen commented on YARN-1480: --- Hi [~kj-ki], thanks for the patch. Here're some meta comments on it: 1. I looked into the current RMWebServices#getApps(), and below is the list of missing options in ApplicationCLI. "queue" (current "queue" option is for the "movetoqueue" command) and "tags" are not covered in the patch. If it's not a big addition, is it better to include these two options into the option list? {code} @QueryParam("finalStatus") String finalStatusQuery, @QueryParam("user") String userQuery, @QueryParam("queue") String queueQuery, @QueryParam("limit") String count, @QueryParam("startedTimeBegin") String startedBegin, @QueryParam("startedTimeEnd") String startedEnd, @QueryParam("finishedTimeBegin") String finishBegin, @QueryParam("finishedTimeEnd") String finishEnd, @QueryParam("applicationTags") Set applicationTags {code} 2. ApplicationClientProtocol#getApplications already support full filters, while YarnClient seems not to support the full options now. IMHO, the right way here is to make YarnClient to support full filters, and ApplicationCLI simply calls the API. It is an inefficient way to pull a long app list from RM and do local filtering. > RM web services getApps() accepts many more filters than ApplicationCLI > "list" command > -- > > Key: YARN-1480 > URL: https://issues.apache.org/jira/browse/YARN-1480 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Kenji Kikushima > Attachments: YARN-1480-2.patch, YARN-1480-3.patch, YARN-1480-4.patch, > YARN-1480-5.patch, YARN-1480.patch > > > Nowadays RM web services getApps() accepts many more filters than > ApplicationCLI "list" command, which only accepts "state" and "type". IMHO, > ideally, different interfaces should provide consistent functionality. Is it > better to allow more filters in ApplicationCLI? -- This message was sent by Atlassian JIRA (v6.2#6252)