[jira] [Updated] (YARN-1885) RM may not send the finished signal to some nodes where the application ran after RM restarts
[ https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-1885: - Attachment: YARN-1885.patch Attached new patch addressed discussions above, 1) Included integration tests 2) Removed ContainerAcquiredEvent in RMAppAttempt 3) Added NodedAddedEvent in RMApp RM may not send the finished signal to some nodes where the application ran after RM restarts - Key: YARN-1885 URL: https://issues.apache.org/jira/browse/YARN-1885 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Wangda Tan Attachments: YARN-1885.patch, YARN-1885.patch, YARN-1885.patch During our HA testing we have seen cases where yarn application logs are not available through the cli but i can look at AM logs through the UI. RM was also being restarted in the background as the application was running. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984151#comment-13984151 ] Devaraj K commented on YARN-1408: - bq. So in some race conditions, it is possible that a container can get KILLED by preemption even before it reach RUNNING state. This scenario can be avoided if we can skip such containers which didnt reach the RUNNING state during preemption. May be in the following cycles this container will reach RUNNING state and the can be considered for preemption. I think we don't need to wait for the container to move to RUNNING state for preemption even if it is eligible. If the container is eligible for preemption, the resources can be released with the current preemption cycle instead of waiting for the next preemption cycle to change the container state to RUNNING, so that it could save the wastage of the container launching and then killing. Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Fix For: 2.5.0 Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984152#comment-13984152 ] Devaraj K commented on YARN-1408: - Correction to the above comment. Sorry for the delay Sunil. {quote} So in some race conditions, it is possible that a container can get KILLED by preemption even before it reach RUNNING state. This scenario can be avoided if we can skip such containers which didnt reach the RUNNING state during preemption. May be in the following cycles this container will reach RUNNING state and the can be considered for preemption. {quote} I think we don't need to wait for the container to move to RUNNING state for preemption even if it is eligible. If the container is eligible for preemption, the resources can be released with the current preemption cycle instead of waiting for the next preemption cycle to change the container state to RUNNING, so that it could save the wastage of the container launching and then killing. Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Fix For: 2.5.0 Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1996) Provide alternative policies for UNHEALTHY nodes.
Gera Shegalov created YARN-1996: --- Summary: Provide alternative policies for UNHEALTHY nodes. Key: YARN-1996 URL: https://issues.apache.org/jira/browse/YARN-1996 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, scheduler Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Currently, UNHEALTHY nodes can significantly prolong execution of large expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster health even further due to [positive feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set that might have deemed the node unhealthy in the first place starts spreading across the cluster because the current node is declared unusable and all its containers are killed and rescheduled on different nodes. To mitigate this, we experiment with a patch that allows containers already running on a node turning UNHEALTHY to complete (drain) whereas no new container can be assigned to it until it turns healthy again. This mechanism can also be used for graceful decommissioning of NM. To this end, we have to write a health script such that it can deterministically report UNHEALTHY. For example with {code} if [ -e $1 ] ; then echo ERROR Node decommmissioning via health script hack fi {code} In the current version patch, the behavior is controlled by a boolean property {{yarn.nodemanager.unheathy.drain.containers}}. More versatile policies are possible in the future work. Currently, the health state of a node is binary determined based on the disk checker and the health script ERROR outputs. However, we can as well interpret health script output similar to java logging levels (one of which is ERROR) such as WARN, FATAL. Each level can then be treated differently. E.g., - FATAL: unusable like today - ERROR: drain - WARN: halve the node capacity. complimented with some equivalence rules such as 3 WARN messages == ERROR, 2*ERROR == FATAL, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1996) Provide alternative policies for UNHEALTHY nodes.
[ https://issues.apache.org/jira/browse/YARN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-1996: Attachment: YARN-1996.v01.patch Provide alternative policies for UNHEALTHY nodes. - Key: YARN-1996 URL: https://issues.apache.org/jira/browse/YARN-1996 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, scheduler Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1996.v01.patch Currently, UNHEALTHY nodes can significantly prolong execution of large expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster health even further due to [positive feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set that might have deemed the node unhealthy in the first place starts spreading across the cluster because the current node is declared unusable and all its containers are killed and rescheduled on different nodes. To mitigate this, we experiment with a patch that allows containers already running on a node turning UNHEALTHY to complete (drain) whereas no new container can be assigned to it until it turns healthy again. This mechanism can also be used for graceful decommissioning of NM. To this end, we have to write a health script such that it can deterministically report UNHEALTHY. For example with {code} if [ -e $1 ] ; then echo ERROR Node decommmissioning via health script hack fi {code} In the current version patch, the behavior is controlled by a boolean property {{yarn.nodemanager.unheathy.drain.containers}}. More versatile policies are possible in the future work. Currently, the health state of a node is binary determined based on the disk checker and the health script ERROR outputs. However, we can as well interpret health script output similar to java logging levels (one of which is ERROR) such as WARN, FATAL. Each level can then be treated differently. E.g., - FATAL: unusable like today - ERROR: drain - WARN: halve the node capacity. complimented with some equivalence rules such as 3 WARN messages == ERROR, 2*ERROR == FATAL, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1996) Provide alternative policies for UNHEALTHY nodes.
[ https://issues.apache.org/jira/browse/YARN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984167#comment-13984167 ] Steve Loughran commented on YARN-1996: -- This sounds good -not just as failure handling, but for cluster management This may be a duplicate of YARN-914, graceful decommission of NM, and/or YARN-671 For long-lived services # it'd be nice to have a notification from the NM to the AM that it's draining and that they should react : YARN-1394 # the drain process must have a (configurable?) timeout then kill all outstanding containers -without adding them as any kind of failure (i.e. container loss event from NM - AM should indicate this) # AM itself needs to receive a your own node is being drained event and do any best-effort pre-restart operations (e.g. transition to passive), and RM not count AM termination/restart as an AM failure Provide alternative policies for UNHEALTHY nodes. - Key: YARN-1996 URL: https://issues.apache.org/jira/browse/YARN-1996 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, scheduler Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1996.v01.patch Currently, UNHEALTHY nodes can significantly prolong execution of large expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster health even further due to [positive feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set that might have deemed the node unhealthy in the first place starts spreading across the cluster because the current node is declared unusable and all its containers are killed and rescheduled on different nodes. To mitigate this, we experiment with a patch that allows containers already running on a node turning UNHEALTHY to complete (drain) whereas no new container can be assigned to it until it turns healthy again. This mechanism can also be used for graceful decommissioning of NM. To this end, we have to write a health script such that it can deterministically report UNHEALTHY. For example with {code} if [ -e $1 ] ; then echo ERROR Node decommmissioning via health script hack fi {code} In the current version patch, the behavior is controlled by a boolean property {{yarn.nodemanager.unheathy.drain.containers}}. More versatile policies are possible in the future work. Currently, the health state of a node is binary determined based on the disk checker and the health script ERROR outputs. However, we can as well interpret health script output similar to java logging levels (one of which is ERROR) such as WARN, FATAL. Each level can then be treated differently. E.g., - FATAL: unusable like today - ERROR: drain - WARN: halve the node capacity. complimented with some equivalence rules such as 3 WARN messages == ERROR, 2*ERROR == FATAL, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1980) Possible NPE in KillAMPreemptionPolicy related to ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984188#comment-13984188 ] Devaraj K commented on YARN-1980: - Changes involve in mapreduce project, moving to mapreduce. Possible NPE in KillAMPreemptionPolicy related to ProportionalCapacityPreemptionPolicy -- Key: YARN-1980 URL: https://issues.apache.org/jira/browse/YARN-1980 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Sunil G Attachments: Yarn-1980.1.patch I configured KillAMPreemptionPolicy for My Application Master and tried to check preemption of queues. In one scenario I have seen below NPE in my AM 014-04-24 15:11:08,860 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.mapreduce.v2.app.rm.preemption.KillAMPreemptionPolicy.preempt(KillAMPreemptionPolicy.java:57) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:662) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:246) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:267) at java.lang.Thread.run(Thread.java:662) I was using 2.2.0 and merged MAPREDUCE-5189 to see how AM preemption works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1997) input split size
rrim created YARN-1997: -- Summary: input split size Key: YARN-1997 URL: https://issues.apache.org/jira/browse/YARN-1997 Project: Hadoop YARN Issue Type: Test Components: api Affects Versions: 2.2.0 Reporter: rrim Hi, I am using hadoop 2.2, and don't know how to set max input split size I would like to decrease this value, in order to create more mappers I tried updating yarn-site.xml, and but it does not work indeed, hadoop 2.2 /yarn does not take of none the following settings property namemapreduce.input.fileinputformat.split.minsize/name value1/value /property property namemapreduce.input.fileinputformat.split.maxsize/name value16777216/value /property property namemapred.min.split.size/name value1/value /property property namemapred.max.split.size/name value16777216/value /property best, -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1982) Rename the daemon name to timelineserver
[ https://issues.apache.org/jira/browse/YARN-1982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984268#comment-13984268 ] Junping Du commented on YARN-1982: -- +1. Patch looks good to me. Rename the daemon name to timelineserver Key: YARN-1982 URL: https://issues.apache.org/jira/browse/YARN-1982 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Labels: cli Attachments: YARN-1982.1.patch Nowadays, it's confusing that we call the new component timeline server, but we use {code} yarn historyserver yarn-daemon.sh start historyserver {code} to start the daemon. Before the confusion keeps being propagated, we'd better to modify command line asap. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1912) ResourceLocalizer started without any jvm memory control
[ https://issues.apache.org/jira/browse/YARN-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated YARN-1912: --- Attachment: YARN-1912-1.patch updated the patch based on findbugs warnings. ResourceLocalizer started without any jvm memory control Key: YARN-1912 URL: https://issues.apache.org/jira/browse/YARN-1912 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: stanley shi Attachments: YARN-1912-0.patch, YARN-1912-1.patch In the LinuxContainerExecutor.java#startLocalizer, it does not specify any -Xmx configurations in the command, this caused the ResourceLocalizer to be started with default memory setting. In an server-level hardware, it will use 25% of the system memory as the max heap size, this will cause memory issue in some cases. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1987) Wrapper for leveldb DBIterator to aid in handling database exceptions
[ https://issues.apache.org/jira/browse/YARN-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1987: - Attachment: YARN-1987.patch Wrapper for leveldb DBIterator to aid in handling database exceptions - Key: YARN-1987 URL: https://issues.apache.org/jira/browse/YARN-1987 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1987.patch Per discussions in YARN-1984 and MAPREDUCE-5652, it would be nice to have a utility wrapper around leveldb's DBIterator to translate the raw RuntimeExceptions it can throw into DBExceptions to make it easier to handle database errors while iterating. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1996) Provide alternative policies for UNHEALTHY nodes.
[ https://issues.apache.org/jira/browse/YARN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984432#comment-13984432 ] Hadoop QA commented on YARN-1996: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642433/YARN-1996.v01.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3654//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3654//console This message is automatically generated. Provide alternative policies for UNHEALTHY nodes. - Key: YARN-1996 URL: https://issues.apache.org/jira/browse/YARN-1996 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, scheduler Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1996.v01.patch Currently, UNHEALTHY nodes can significantly prolong execution of large expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster health even further due to [positive feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set that might have deemed the node unhealthy in the first place starts spreading across the cluster because the current node is declared unusable and all its containers are killed and rescheduled on different nodes. To mitigate this, we experiment with a patch that allows containers already running on a node turning UNHEALTHY to complete (drain) whereas no new container can be assigned to it until it turns healthy again. This mechanism can also be used for graceful decommissioning of NM. To this end, we have to write a health script such that it can deterministically report UNHEALTHY. For example with {code} if [ -e $1 ] ; then echo ERROR Node decommmissioning via health script hack fi {code} In the current version patch, the behavior is controlled by a boolean property {{yarn.nodemanager.unheathy.drain.containers}}. More versatile policies are possible in the future work. Currently, the health state of a node is binary determined based on the disk checker and the health script ERROR outputs. However, we can as well interpret health script output similar to java logging levels (one of which is ERROR) such as WARN, FATAL. Each level can then be treated differently. E.g., - FATAL: unusable like today - ERROR: drain - WARN: halve the node capacity. complimented with some equivalence rules such as 3 WARN messages == ERROR, 2*ERROR == FATAL, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1987) Wrapper for leveldb DBIterator to aid in handling database exceptions
[ https://issues.apache.org/jira/browse/YARN-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984490#comment-13984490 ] Hadoop QA commented on YARN-1987: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642479/YARN-1987.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3657//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3657//console This message is automatically generated. Wrapper for leveldb DBIterator to aid in handling database exceptions - Key: YARN-1987 URL: https://issues.apache.org/jira/browse/YARN-1987 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1987.patch Per discussions in YARN-1984 and MAPREDUCE-5652, it would be nice to have a utility wrapper around leveldb's DBIterator to translate the raw RuntimeExceptions it can throw into DBExceptions to make it easier to handle database errors while iterating. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1912) ResourceLocalizer started without any jvm memory control
[ https://issues.apache.org/jira/browse/YARN-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984485#comment-13984485 ] Hadoop QA commented on YARN-1912: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642478/YARN-1912-1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3656//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3656//console This message is automatically generated. ResourceLocalizer started without any jvm memory control Key: YARN-1912 URL: https://issues.apache.org/jira/browse/YARN-1912 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: stanley shi Attachments: YARN-1912-0.patch, YARN-1912-1.patch In the LinuxContainerExecutor.java#startLocalizer, it does not specify any -Xmx configurations in the command, this caused the ResourceLocalizer to be started with default memory setting. In an server-level hardware, it will use 25% of the system memory as the max heap size, this will cause memory issue in some cases. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-738) TestClientRMTokens is failing irregularly while running all yarn tests
[ https://issues.apache.org/jira/browse/YARN-738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984587#comment-13984587 ] Hudson commented on YARN-738: - SUCCESS: Integrated in Hadoop-trunk-Commit #5584 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5584/]) YARN-738. TestClientRMTokens is failing irregularly while running all yarn tests. Contributed by Ming Ma (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1591030) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestClientRMTokens.java TestClientRMTokens is failing irregularly while running all yarn tests -- Key: YARN-738 URL: https://issues.apache.org/jira/browse/YARN-738 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Ming Ma Fix For: 3.0.0, 2.5.0 Attachments: YARN-738.patch Running org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 16.787 sec FAILURE! testShortCircuitRenewCancel(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens) Time elapsed: 186 sec ERROR! java.lang.RuntimeException: getProxy at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens$YarnBadRPC.getProxy(TestClientRMTokens.java:334) at org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.getRmClient(RMDelegationTokenIdentifier.java:157) at org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.renew(RMDelegationTokenIdentifier.java:102) at org.apache.hadoop.security.token.Token.renew(Token.java:372) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:306) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancel(TestClientRMTokens.java:240) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189) at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165) at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1996) Provide alternative policies for UNHEALTHY nodes.
[ https://issues.apache.org/jira/browse/YARN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984664#comment-13984664 ] Gera Shegalov commented on YARN-1996: - [~ste...@apache.org] thanks for pointing out the JIRA about decommissioning. I'll link them to this JIRA. The main point of this JIRA is to gracefully deal with the UNHEALTHY state determined by the health script. Provide alternative policies for UNHEALTHY nodes. - Key: YARN-1996 URL: https://issues.apache.org/jira/browse/YARN-1996 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, scheduler Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1996.v01.patch Currently, UNHEALTHY nodes can significantly prolong execution of large expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster health even further due to [positive feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set that might have deemed the node unhealthy in the first place starts spreading across the cluster because the current node is declared unusable and all its containers are killed and rescheduled on different nodes. To mitigate this, we experiment with a patch that allows containers already running on a node turning UNHEALTHY to complete (drain) whereas no new container can be assigned to it until it turns healthy again. This mechanism can also be used for graceful decommissioning of NM. To this end, we have to write a health script such that it can deterministically report UNHEALTHY. For example with {code} if [ -e $1 ] ; then echo ERROR Node decommmissioning via health script hack fi {code} In the current version patch, the behavior is controlled by a boolean property {{yarn.nodemanager.unheathy.drain.containers}}. More versatile policies are possible in the future work. Currently, the health state of a node is binary determined based on the disk checker and the health script ERROR outputs. However, we can as well interpret health script output similar to java logging levels (one of which is ERROR) such as WARN, FATAL. Each level can then be treated differently. E.g., - FATAL: unusable like today - ERROR: drain - WARN: halve the node capacity. complimented with some equivalence rules such as 3 WARN messages == ERROR, 2*ERROR == FATAL, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1362) Distinguish between nodemanager shutdown for decommission vs shutdown for restart
[ https://issues.apache.org/jira/browse/YARN-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1362: - Attachment: YARN-1362.patch Small patch that enhances the NM context that provides get/set for a decomm flag. This allows code to query whether the NM has been told to decommission and act accordingly during shutdown. Distinguish between nodemanager shutdown for decommission vs shutdown for restart - Key: YARN-1362 URL: https://issues.apache.org/jira/browse/YARN-1362 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Attachments: YARN-1362.patch When a nodemanager shuts down it needs to determine if it is likely to be restarted. If a restart is likely then it needs to preserve container directories, logs, distributed cache entries, etc. If it is being shutdown more permanently (e.g.: like a decommission) then the nodemanager should cleanup directories and logs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1929) DeadLock in RM when automatic failover is enabled.
[ https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984699#comment-13984699 ] Vinod Kumar Vavilapalli commented on YARN-1929: --- Seems 'fine' to me. It is one of those fine-for-now-but-not-sure-if-anything-else-is-broken. OTOH, we aren't getting rid of the remaining locking in CompositeService. Something that we should fix separately. Don't want this patch to blow up more. The test looks fine except for the 1second sleep. I can see that causing issues on VMs but let's see. Checking this in. DeadLock in RM when automatic failover is enabled. -- Key: YARN-1929 URL: https://issues.apache.org/jira/browse/YARN-1929 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Yarn HA cluster Reporter: Rohith Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1929-1.patch, yarn-1929-2.patch Dead lock detected in RM when automatic failover is enabled. {noformat} Found one Java-level deadlock: = Thread-2: waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a org.apache.hadoop.ha.ActiveStandbyElector), which is held by main-EventThread main-EventThread: waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService), which is held by Thread-2 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1612) Change Fair Scheduler to not disable delay scheduling by default
[ https://issues.apache.org/jira/browse/YARN-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-1612: -- Attachment: YARN-1612-v2.patch Change Fair Scheduler to not disable delay scheduling by default Key: YARN-1612 URL: https://issues.apache.org/jira/browse/YARN-1612 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Sandy Ryza Assignee: Chen He Attachments: YARN-1612-v2.patch, YARN-1612.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1885) RM may not send the finished signal to some nodes where the application ran after RM restarts
[ https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984722#comment-13984722 ] Jian He commented on YARN-1885: --- Thanks for the update! - some places exceed the 80 column limit, like the RMAppImpl transitions. - app.isAppFinalStateStored() better use isAppInFinalState instead ? - sleeping for a fixed amount time is not deterministic, test may fail randomly. it’s better doing it in a while loop with heartbeats, and exit out of the loop if condition meets. {code} // sleep for a while before do next heartbeat Thread.sleep(1000); NodeHeartbeatResponse response = nm1.nodeHeartbeat(true); {code} - timeout = 60, timeout too long. - these two transitions cannot happen? Generally, we should not add events to states where the transitions can never happen, that’ll hide bugs. {code} .addTransition(RMAppState.NEW, RMAppState.NEW, RMAppEventType.NODE_ADDED, new NodeAddedTransition()) .addTransition(RMAppState.NEW_SAVING, RMAppState.NEW_SAVING, RMAppEventType.NODE_ADDED, new NodeAddedTransition()) {code} - These two loops may block the register RPC call for a while, I think we may send them as the payload of RMNodeStartEvent and handle them in RMNodeAddTransition ? {code} // Handle container statuses reported by NM if (!request.getContainerStatuses().isEmpty()) { LOG.info(received container statuses on node manager register : + request.getContainerStatuses()); for (ContainerStatus containerStatus : request.getContainerStatuses()) { handleContainerStatus(containerStatus); } } // Handle running applications reported by NM if (null != request.getRunningApplications()) { for (ApplicationId appId : request.getRunningApplications()) { handleRunningAppOnNode(appId, request.getNodeId()); } } {code} RM may not send the finished signal to some nodes where the application ran after RM restarts - Key: YARN-1885 URL: https://issues.apache.org/jira/browse/YARN-1885 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Wangda Tan Attachments: YARN-1885.patch, YARN-1885.patch, YARN-1885.patch During our HA testing we have seen cases where yarn application logs are not available through the cli but i can look at AM logs through the UI. RM was also being restarted in the background as the application was running. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1929) DeadLock in RM when automatic failover is enabled.
[ https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984843#comment-13984843 ] Hudson commented on YARN-1929: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5585 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5585/]) YARN-1929. Fixed a deadlock in ResourceManager that occurs when failover happens right at the time of shutdown. Contributed by Karthik Kambatla. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1591071) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/service/CompositeService.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/EmbeddedElectorService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMEmbeddedElector.java DeadLock in RM when automatic failover is enabled. -- Key: YARN-1929 URL: https://issues.apache.org/jira/browse/YARN-1929 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Yarn HA cluster Reporter: Rohith Assignee: Karthik Kambatla Priority: Blocker Fix For: 2.4.1 Attachments: yarn-1929-1.patch, yarn-1929-2.patch Dead lock detected in RM when automatic failover is enabled. {noformat} Found one Java-level deadlock: = Thread-2: waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a org.apache.hadoop.ha.ActiveStandbyElector), which is held by main-EventThread main-EventThread: waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService), which is held by Thread-2 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1362) Distinguish between nodemanager shutdown for decommission vs shutdown for restart
[ https://issues.apache.org/jira/browse/YARN-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984884#comment-13984884 ] Hadoop QA commented on YARN-1362: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642514/YARN-1362.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3659//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3659//console This message is automatically generated. Distinguish between nodemanager shutdown for decommission vs shutdown for restart - Key: YARN-1362 URL: https://issues.apache.org/jira/browse/YARN-1362 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1362.patch When a nodemanager shuts down it needs to determine if it is likely to be restarted. If a restart is likely then it needs to preserve container directories, logs, distributed cache entries, etc. If it is being shutdown more permanently (e.g.: like a decommission) then the nodemanager should cleanup directories and logs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1612) Change Fair Scheduler to not disable delay scheduling by default
[ https://issues.apache.org/jira/browse/YARN-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984896#comment-13984896 ] Hadoop QA commented on YARN-1612: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642521/YARN-1612-v2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3658//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3658//console This message is automatically generated. Change Fair Scheduler to not disable delay scheduling by default Key: YARN-1612 URL: https://issues.apache.org/jira/browse/YARN-1612 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Sandy Ryza Assignee: Chen He Attachments: YARN-1612-v2.patch, YARN-1612.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1885) RM may not send the finished signal to some nodes where the application ran after RM restarts
[ https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984945#comment-13984945 ] Wangda Tan commented on YARN-1885: -- [~jianhe], Thanks for your review! bq. some places exceed the 80 column limit, like the RMAppImpl transitions. Will correct this later bq. app.isAppFinalStateStored() better use isAppInFinalState instead ? Agree, it's a bug using isAppFinalStateStored() bq. sleeping for a fixed amount time is not deterministic, test may fail randomly. it’s better doing it in a while loop with heartbeats, and exit out of the loop if condition meets. Agree bq. timeout = 60, timeout too long. Sorry for this typo :) bq. these two transitions cannot happen? Generally, we should not add events to states where the transitions can never happen, that’ll hide bugs. Agree, and I think SUBMITTED is also cannot happen, because an app with SUBMITTED state doesn't launch any container, so NMs will not have the app in runningApplication list. Do you agree? bq. These two loops may block the register RPC call for a while, I think we may send them as the payload of RMNodeStartEvent and handle them in RMNodeAddTransition ? IMO, this shouldn't be a big problem, because there's no blocking calls existed in handleRunningAppOnNode/handleContainerStatus. So additional microseconds of latency (just loop array) should be fine. Is it? Attached new patch. RM may not send the finished signal to some nodes where the application ran after RM restarts - Key: YARN-1885 URL: https://issues.apache.org/jira/browse/YARN-1885 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Wangda Tan Attachments: YARN-1885.patch, YARN-1885.patch, YARN-1885.patch During our HA testing we have seen cases where yarn application logs are not available through the cli but i can look at AM logs through the UI. RM was also being restarted in the background as the application was running. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1885) RM may not send the finished signal to some nodes where the application ran after RM restarts
[ https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-1885: - Attachment: YARN-1885.patch RM may not send the finished signal to some nodes where the application ran after RM restarts - Key: YARN-1885 URL: https://issues.apache.org/jira/browse/YARN-1885 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Wangda Tan Attachments: YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch During our HA testing we have seen cases where yarn application logs are not available through the cli but i can look at AM logs through the UI. RM was also being restarted in the background as the application was running. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1696) Document RM HA
[ https://issues.apache.org/jira/browse/YARN-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1696: -- Attachment: YARN-1696.6.patch Same patch as before but with a few edits to make it better. Will check this in once Jenkins says okay. Document RM HA -- Key: YARN-1696 URL: https://issues.apache.org/jira/browse/YARN-1696 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Tsuyoshi OZAWA Priority: Blocker Attachments: YARN-1676.5.patch, YARN-1696-3.patch, YARN-1696.2.patch, YARN-1696.4.patch, YARN-1696.6.patch, rm-ha-overview.png, rm-ha-overview.svg, yarn-1696-1.patch Add documentation for RM HA. Marking this a blocker for 2.4 as this is required to call RM HA Stable and ready for public consumption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1696) Document RM HA
[ https://issues.apache.org/jira/browse/YARN-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985067#comment-13985067 ] Hadoop QA commented on YARN-1696: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642567/YARN-1696.6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3660//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3660//console This message is automatically generated. Document RM HA -- Key: YARN-1696 URL: https://issues.apache.org/jira/browse/YARN-1696 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Tsuyoshi OZAWA Priority: Blocker Attachments: YARN-1676.5.patch, YARN-1696-3.patch, YARN-1696.2.patch, YARN-1696.4.patch, YARN-1696.6.patch, rm-ha-overview.png, rm-ha-overview.svg, yarn-1696-1.patch Add documentation for RM HA. Marking this a blocker for 2.4 as this is required to call RM HA Stable and ready for public consumption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1998) Change the time zone on the Yarn UI to the local time zone
Fengdong Yu created YARN-1998: - Summary: Change the time zone on the Yarn UI to the local time zone Key: YARN-1998 URL: https://issues.apache.org/jira/browse/YARN-1998 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Fengdong Yu Assignee: Fengdong Yu Priority: Minor It shows GMT time zone for 'startTime' and 'finishTime' on the RM web UI, we should show the local time zone. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1998) Change the time zone on the Yarn UI to the local time zone
[ https://issues.apache.org/jira/browse/YARN-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fengdong Yu updated YARN-1998: -- Attachment: YARN-1998.patch Change the time zone on the Yarn UI to the local time zone -- Key: YARN-1998 URL: https://issues.apache.org/jira/browse/YARN-1998 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Fengdong Yu Assignee: Fengdong Yu Priority: Minor Attachments: YARN-1998.patch It shows GMT time zone for 'startTime' and 'finishTime' on the RM web UI, we should show the local time zone. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1998) Change the time zone on the RM web UI to the local time zone
[ https://issues.apache.org/jira/browse/YARN-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fengdong Yu updated YARN-1998: -- Summary: Change the time zone on the RM web UI to the local time zone (was: Change the time zone on the Yarn UI to the local time zone) Change the time zone on the RM web UI to the local time zone Key: YARN-1998 URL: https://issues.apache.org/jira/browse/YARN-1998 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Fengdong Yu Assignee: Fengdong Yu Priority: Minor Attachments: YARN-1998.patch It shows GMT time zone for 'startTime' and 'finishTime' on the RM web UI, we should show the local time zone. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1999) Move HistoryServerRest.apt.vm into the Mapreduce section
[ https://issues.apache.org/jira/browse/YARN-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Prakash updated YARN-1999: --- Affects Version/s: 2.4.0 Move HistoryServerRest.apt.vm into the Mapreduce section Key: YARN-1999 URL: https://issues.apache.org/jira/browse/YARN-1999 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.4.0 Reporter: Ravi Prakash Now that we have the YARN HistoryServer, perhaps we should move HistoryServerRest.apt.vm into the MapReduce section where it really belongs? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1999) Move HistoryServerRest.apt.vm into the Mapreduce section
[ https://issues.apache.org/jira/browse/YARN-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Prakash updated YARN-1999: --- Component/s: documentation Move HistoryServerRest.apt.vm into the Mapreduce section Key: YARN-1999 URL: https://issues.apache.org/jira/browse/YARN-1999 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.4.0 Reporter: Ravi Prakash Now that we have the YARN HistoryServer, perhaps we should move HistoryServerRest.apt.vm into the MapReduce section where it really belongs? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1999) Move HistoryServerRest.apt.vm into the Mapreduce section
[ https://issues.apache.org/jira/browse/YARN-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Prakash updated YARN-1999: --- Target Version/s: 2.5.0 Move HistoryServerRest.apt.vm into the Mapreduce section Key: YARN-1999 URL: https://issues.apache.org/jira/browse/YARN-1999 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.4.0 Reporter: Ravi Prakash Now that we have the YARN HistoryServer, perhaps we should move HistoryServerRest.apt.vm into the MapReduce section where it really belongs? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1999) Move HistoryServerRest.apt.vm into the Mapreduce section
Ravi Prakash created YARN-1999: -- Summary: Move HistoryServerRest.apt.vm into the Mapreduce section Key: YARN-1999 URL: https://issues.apache.org/jira/browse/YARN-1999 Project: Hadoop YARN Issue Type: Bug Reporter: Ravi Prakash Now that we have the YARN HistoryServer, perhaps we should move HistoryServerRest.apt.vm into the MapReduce section where it really belongs? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1885) RM may not send the finished signal to some nodes where the application ran after RM restarts
[ https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985083#comment-13985083 ] Hadoop QA commented on YARN-1885: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642551/YARN-1885.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3661//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3661//console This message is automatically generated. RM may not send the finished signal to some nodes where the application ran after RM restarts - Key: YARN-1885 URL: https://issues.apache.org/jira/browse/YARN-1885 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Wangda Tan Attachments: YARN-1885.patch, YARN-1885.patch, YARN-1885.patch, YARN-1885.patch During our HA testing we have seen cases where yarn application logs are not available through the cli but i can look at AM logs through the UI. RM was also being restarted in the background as the application was running. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1998) Change the time zone on the RM web UI to the local time zone
[ https://issues.apache.org/jira/browse/YARN-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985105#comment-13985105 ] Hadoop QA commented on YARN-1998: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642581/YARN-1998.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3662//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3662//console This message is automatically generated. Change the time zone on the RM web UI to the local time zone Key: YARN-1998 URL: https://issues.apache.org/jira/browse/YARN-1998 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Fengdong Yu Assignee: Fengdong Yu Priority: Minor Attachments: YARN-1998.patch It shows GMT time zone for 'startTime' and 'finishTime' on the RM web UI, we should show the local time zone. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2001) Persist NMs info for RM restart
Jian He created YARN-2001: - Summary: Persist NMs info for RM restart Key: YARN-2001 URL: https://issues.apache.org/jira/browse/YARN-2001 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He RM should not accept allocate requests from AMs until all the NMs have registered with RM. For that, RM needs to remember the previous NMs and wait for all the NMs to register. This is also useful for remembering decommissioned nodes across restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985148#comment-13985148 ] Jian He commented on YARN-556: -- Hi Anubhav, Looked at the prototype patch. Regarding the approach, it’s better to have a scheduler-agnostic recovery mechanism with no or minimum scheduler-specific changes, instead of implementing each scheduler specifically. YARN-1368 can be renamed to accommodate the necessary common changes for all schedulers.Also, adding cluster timestamp to the container Id doesn’t seem right and that’ll also break compatibility. RM Restart phase 2 - Work preserving restart Key: YARN-556 URL: https://issues.apache.org/jira/browse/YARN-556 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Bikas Saha Assignee: Bikas Saha Attachments: Work Preserving RM Restart.pdf, WorkPreservingRestartPrototype.001.patch YARN-128 covered storing the state needed for the RM to recover critical information. This umbrella jira will track changes needed to recover the running state of the cluster so that work can be preserved across RM restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1368: -- Summary: Common work to re-populate containers’ state into scheduler (was: RM should populate running container allocation information from NM resync) Common work to re-populate containers’ state into scheduler --- Key: YARN-1368 URL: https://issues.apache.org/jira/browse/YARN-1368 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Anubhav Dhoot YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985151#comment-13985151 ] Jian He commented on YARN-1368: --- Hi [~adhoot], mind if I take this over ? I have a preliminary patch which does the bulk of the work. I can upload very soon. thx Common work to re-populate containers’ state into scheduler --- Key: YARN-1368 URL: https://issues.apache.org/jira/browse/YARN-1368 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Anubhav Dhoot YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue
[ https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985160#comment-13985160 ] Sunil G commented on YARN-1963: --- We have done few analysis and implemented support for application priority. I wish to share the thoughts here, kindly check the same. Design thoughts: 1. Configuration Part We planned to use some existing priority configuration as given below. These are used to set a Job priority. a. JobConf.getJobPriority() and Job.setPriority(JobPriority priority) b. We can also use configuration mapreduce.job.priority. The values for priority can be VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW 2. Scheduler Side If the Capacity Scheduler queue has multiple applications(Jobs) to run with different priorities, CS will allocate containers for the highest priority application and then for next priority and so on. When multiple queues are configured with different capacities, this priority will work internal to the each queue. For this, we planned to add a priority comparison check in the below data structure. ComparatorFiCaSchedulerApp applicationComparator We added a priority check here in compare() of applicationComparator while selecting applications. Updated design here will be like, 1. Check for priority first. If there, return highest priority job. 2. Continue existing logic such as App ID comparison and TimeStamp comparison. With these changes, we can make highest priority job will get preference in a queue. NB: In addition to this, we added a preemption module also to get High priority jobs resources fast by preempting lower priority ones. I wish to upload a patch if this approach is fine. Support priorities across applications within the same queue - Key: YARN-1963 URL: https://issues.apache.org/jira/browse/YARN-1963 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Reporter: Arun C Murthy Assignee: Arun C Murthy It will be very useful to support priorities among applications within the same queue, particularly in production scenarios. It allows for finer-grained controls without having to force admins to create a multitude of queues, plus allows existing applications to continue using existing queues which are usually part of institutional memory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985164#comment-13985164 ] Jian He commented on YARN-1368: --- It's good to have a scheduler-agnostic way to recover the containers and all the other scheduler states for app/attempts. Renamed the title to do this. Common work to re-populate containers’ state into scheduler --- Key: YARN-1368 URL: https://issues.apache.org/jira/browse/YARN-1368 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Anubhav Dhoot YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue
[ https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985193#comment-13985193 ] Sandy Ryza commented on YARN-1963: -- Thanks for picking this up Sunil. Can we separate this into a couple JIRAs? One for the ResourceManager and protocol changes, one for the MapReduce changes, and one for the Capacity Scheduler changes. Support priorities across applications within the same queue - Key: YARN-1963 URL: https://issues.apache.org/jira/browse/YARN-1963 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Reporter: Arun C Murthy Assignee: Arun C Murthy It will be very useful to support priorities among applications within the same queue, particularly in production scenarios. It allows for finer-grained controls without having to force admins to create a multitude of queues, plus allows existing applications to continue using existing queues which are usually part of institutional memory. -- This message was sent by Atlassian JIRA (v6.2#6252)