[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035910#comment-15035910 ] Junping Du commented on YARN-4403: -- Agree. LivelinessMonitor is more critical as it affects all YARN daemons/containers lifecycle, so I prefer we get in this first. Later, we can file two separate JIRAs: one for YARN and the other for MapReduce to address other places. I am sure there are many places to change as all timeout could be affected and we should be carefully. Hadoop/HDFS projects should already adopt this early. > (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating > period > > > Key: YARN-4403 > URL: https://issues.apache.org/jira/browse/YARN-4403 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4403.patch > > > Currently, (AM/NM/Container)LivelinessMonitor use current system time to > calculate a duration of expire which could be broken by settimeofday. We > should use Time.monotonicNow() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4408) NodeManager still reports negative running containers
[ https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035886#comment-15035886 ] Junping Du commented on YARN-4408: -- [~rkanter], do we see this happen in a real deployment? I don't quite understand how it happens for container not get launched but EXITED_WITH_SUCCESS. It sounds only theoretically possible for containers in life cycle: Localized -> Killing -> EXITED_WITH_SUCCESS as the only place that send CONTAINER_EXITED_WITH_SUCCESS event is ContainerLaunch (or RecoveredContainerLaunch). Isn't it? Do I miss something here? > NodeManager still reports negative running containers > - > > Key: YARN-4408 > URL: https://issues.apache.org/jira/browse/YARN-4408 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > Attachments: YARN-4408.001.patch > > > YARN-1697 fixed a problem where the NodeManager metrics could report a > negative number of running containers. However, it missed a rare case where > this can still happen. > YARN-1697 added a flag to indicate if the container was actually launched > ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which > is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to > {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge > if we actually ran the container and incremented the gauge . However, this > flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to > {{DONE}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4304) AM max resource configuration per partition to be displayed/updated correctly in UI and in various partition related metrics
[ https://issues.apache.org/jira/browse/YARN-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-4304: -- Attachment: REST_and_UI.zip > AM max resource configuration per partition to be displayed/updated correctly > in UI and in various partition related metrics > > > Key: YARN-4304 > URL: https://issues.apache.org/jira/browse/YARN-4304 > Project: Hadoop YARN > Issue Type: Sub-task > Components: webapp >Affects Versions: 2.7.1 >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-4304.patch, 0002-YARN-4304.patch, > 0003-YARN-4304.patch, REST_and_UI.zip > > > As we are supporting per-partition level max AM resource percentage > configuration, UI and various metrics also need to display correct > configurations related to same. > For eg: Current UI still shows am-resource percentage per queue level. This > is to be updated correctly when label config is used. > - Display max-am-percentage per-partition in Scheduler UI (label also) and in > ClusterMetrics page > - Update queue/partition related metrics w.r.t per-partition > am-resource-percentage -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035950#comment-15035950 ] Junping Du commented on YARN-4403: -- bq. For YARN/MR, I could also definitely help in getting it in shape once this is in. Sure. Feel free to create/assign JIRA and work on it. I will help to review. Thanks! > (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating > period > > > Key: YARN-4403 > URL: https://issues.apache.org/jira/browse/YARN-4403 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4403.patch > > > Currently, (AM/NM/Container)LivelinessMonitor use current system time to > calculate a duration of expire which could be broken by settimeofday. We > should use Time.monotonicNow() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4411) ResourceManager IllegalArgumentException error
[ https://issues.apache.org/jira/browse/YARN-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035895#comment-15035895 ] Junping Du commented on YARN-4411: -- Hi [~yarntime], I just assign the JIRA to you. > ResourceManager IllegalArgumentException error > -- > > Key: YARN-4411 > URL: https://issues.apache.org/jira/browse/YARN-4411 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: yarntime >Assignee: yarntime > > in version 2.7.1, line 1914 may cause IllegalArgumentException in > RMAppAttemptImpl: > YarnApplicationAttemptState.valueOf(this.getState().toString()) > cause by this.getState() returns type RMAppAttemptState which may not be > converted to YarnApplicationAttemptState. > {noformat} > java.lang.IllegalArgumentException: No enum constant > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.LAUNCHED_UNMANAGED_SAVING > at java.lang.Enum.valueOf(Enum.java:236) > at > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.createApplicationAttemptReport(RMAppAttemptImpl.java:1870) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationAttemptReport(ClientRMService.java:355) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationAttemptReport(ApplicationClientProtocolPBServiceImpl.java:355) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:425) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035871#comment-15035871 ] Sunil G commented on YARN-4403: --- Hi [~djp] Thanks for the patch. Yes, its better to use {{MonotonicClock}} generally. We also use SystemClock in proportional preemption policy, I feel as you mentioned a general YARN ticket can handle all this together. > (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating > period > > > Key: YARN-4403 > URL: https://issues.apache.org/jira/browse/YARN-4403 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4403.patch > > > Currently, (AM/NM/Container)LivelinessMonitor use current system time to > calculate a duration of expire which could be broken by settimeofday. We > should use Time.monotonicNow() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4304) AM max resource configuration per partition to be displayed/updated correctly in UI and in various partition related metrics
[ https://issues.apache.org/jira/browse/YARN-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-4304: -- Attachment: (was: REST_and_UI.zip) > AM max resource configuration per partition to be displayed/updated correctly > in UI and in various partition related metrics > > > Key: YARN-4304 > URL: https://issues.apache.org/jira/browse/YARN-4304 > Project: Hadoop YARN > Issue Type: Sub-task > Components: webapp >Affects Versions: 2.7.1 >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-4304.patch, 0002-YARN-4304.patch, > 0003-YARN-4304.patch > > > As we are supporting per-partition level max AM resource percentage > configuration, UI and various metrics also need to display correct > configurations related to same. > For eg: Current UI still shows am-resource percentage per queue level. This > is to be updated correctly when label config is used. > - Display max-am-percentage per-partition in Scheduler UI (label also) and in > ClusterMetrics page > - Update queue/partition related metrics w.r.t per-partition > am-resource-percentage -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4401) A failed app recovery should not prevent the RM from starting
[ https://issues.apache.org/jira/browse/YARN-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035892#comment-15035892 ] Daniel Templeton commented on YARN-4401: I suppose I posed my proposal a little naively. Let's try again. The reason for configuring HA is to prevent an outage. It should be possible to tell the standby to come up regardless of recovery failures, in effect performing automatically the operation that [~sunilg] described or failing the bad app(s) or whatever. The app resource issue I offered was just the first example I (thought I) found while skimming the code. Rather than having to hunt down every possible way to throw an exception (checked or unchecked) during recovery, it would be convenient to have recovery catch any exception, log it, and do something sensible so that the RM can come up for cases where RM availability is a priority. > A failed app recovery should not prevent the RM from starting > - > > Key: YARN-4401 > URL: https://issues.apache.org/jira/browse/YARN-4401 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Critical > Attachments: YARN-4401.001.patch > > > There are many different reasons why an app recovery could fail with an > exception, causing the RM start to be aborted. If that happens the RM will > fail to start. Presumably, the reason the RM is trying to do a recovery is > that it's the standby trying to fill in for the active. Failing to come up > defeats the purpose of the HA configuration. Instead of preventing the RM > from starting, a failed app recovery should log an error and skip the > application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4411) ResourceManager IllegalArgumentException error
[ https://issues.apache.org/jira/browse/YARN-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-4411: - Assignee: yarntime > ResourceManager IllegalArgumentException error > -- > > Key: YARN-4411 > URL: https://issues.apache.org/jira/browse/YARN-4411 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: yarntime >Assignee: yarntime > > in version 2.7.1, line 1914 may cause IllegalArgumentException in > RMAppAttemptImpl: > YarnApplicationAttemptState.valueOf(this.getState().toString()) > cause by this.getState() returns type RMAppAttemptState which may not be > converted to YarnApplicationAttemptState. > {noformat} > java.lang.IllegalArgumentException: No enum constant > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.LAUNCHED_UNMANAGED_SAVING > at java.lang.Enum.valueOf(Enum.java:236) > at > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.createApplicationAttemptReport(RMAppAttemptImpl.java:1870) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationAttemptReport(ClientRMService.java:355) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationAttemptReport(ApplicationClientProtocolPBServiceImpl.java:355) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:425) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4309) Add debug information to application logs when a container fails
[ https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-4309: Attachment: YARN-4309.005.patch Uploaded a new version of the patch with Windows support. > Add debug information to application logs when a container fails > > > Key: YARN-4309 > URL: https://issues.apache.org/jira/browse/YARN-4309 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: YARN-4309.001.patch, YARN-4309.002.patch, > YARN-4309.003.patch, YARN-4309.004.patch, YARN-4309.005.patch > > > Sometimes when a container fails, it can be pretty hard to figure out why it > failed. > My proposal is that if a container fails, we collect information about the > container local dir and dump it into the container log dir. Ideally, I'd like > to tar up the directory entirely, but I'm not sure of the security and space > implications of such a approach. At the very least, we can list all the files > in the container local dir, and dump the contents of launch_container.sh(into > the container log dir). > When log aggregation occurs, all this information will automatically get > collected and make debugging such failures much easier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4304) AM max resource configuration per partition to be displayed/updated correctly in UI and in various partition related metrics
[ https://issues.apache.org/jira/browse/YARN-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-4304: -- Attachment: 0004-YARN-4304.patch Attaching an updated version of patch addressing the comments. Also attached screen shots and REST o/ps [~leftnoteasy] please help to review the same. > AM max resource configuration per partition to be displayed/updated correctly > in UI and in various partition related metrics > > > Key: YARN-4304 > URL: https://issues.apache.org/jira/browse/YARN-4304 > Project: Hadoop YARN > Issue Type: Sub-task > Components: webapp >Affects Versions: 2.7.1 >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-4304.patch, 0002-YARN-4304.patch, > 0003-YARN-4304.patch, 0004-YARN-4304.patch, REST_and_UI.zip > > > As we are supporting per-partition level max AM resource percentage > configuration, UI and various metrics also need to display correct > configurations related to same. > For eg: Current UI still shows am-resource percentage per queue level. This > is to be updated correctly when label config is used. > - Display max-am-percentage per-partition in Scheduler UI (label also) and in > ClusterMetrics page > - Update queue/partition related metrics w.r.t per-partition > am-resource-percentage -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035923#comment-15035923 ] Sunil G commented on YARN-4403: --- Yes. That sounds good. +1 for the patch. For YARN/MR, I could also definitely help in getting it in shape once this is in. > (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating > period > > > Key: YARN-4403 > URL: https://issues.apache.org/jira/browse/YARN-4403 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4403.patch > > > Currently, (AM/NM/Container)LivelinessMonitor use current system time to > calculate a duration of expire which could be broken by settimeofday. We > should use Time.monotonicNow() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4371) "yarn application -kill" should take multiple application ids
[ https://issues.apache.org/jira/browse/YARN-4371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036056#comment-15036056 ] Sunil G commented on YARN-4371: --- Hi [~ozawa] Could you pls help to review the patch. > "yarn application -kill" should take multiple application ids > - > > Key: YARN-4371 > URL: https://issues.apache.org/jira/browse/YARN-4371 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Tsuyoshi Ozawa >Assignee: Sunil G > Attachments: 0001-YARN-4371.patch, 0002-YARN-4371.patch > > > Currently we cannot pass multiple applications to "yarn application -kill" > command. The command should take multiple application ids at the same time. > Each entries should be separated with whitespace like: > {code} > yarn application -kill application_1234_0001 application_1234_0007 > application_1234_0012 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4304) AM max resource configuration per partition to be displayed/updated correctly in UI and in various partition related metrics
[ https://issues.apache.org/jira/browse/YARN-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036136#comment-15036136 ] Sunil G commented on YARN-4304: --- Test case failures and findbugs are related. I will address these in next patch. > AM max resource configuration per partition to be displayed/updated correctly > in UI and in various partition related metrics > > > Key: YARN-4304 > URL: https://issues.apache.org/jira/browse/YARN-4304 > Project: Hadoop YARN > Issue Type: Sub-task > Components: webapp >Affects Versions: 2.7.1 >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-4304.patch, 0002-YARN-4304.patch, > 0003-YARN-4304.patch, 0004-YARN-4304.patch, REST_and_UI.zip > > > As we are supporting per-partition level max AM resource percentage > configuration, UI and various metrics also need to display correct > configurations related to same. > For eg: Current UI still shows am-resource percentage per queue level. This > is to be updated correctly when label config is used. > - Display max-am-percentage per-partition in Scheduler UI (label also) and in > ClusterMetrics page > - Update queue/partition related metrics w.r.t per-partition > am-resource-percentage -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart
[ https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036203#comment-15036203 ] Daniel Templeton commented on YARN-4406: Now that I've had a chance to look at the web UI code, I see that my theory was close, but not quite. The number of decommissioned nodes is taken from {{ClusterMetrics.getMetrics().getDecomissionedNMs()}}, which is just the count of nodes in the excludes list. The list of decommissioned nodes comes from {{ResourceManager.getRMContext().getInactiveRMNodes()}}, which contains only nodes that have been decommissioned since the last restart. > RM Web UI continues to show decommissioned nodes even after RM restart > -- > > Key: YARN-4406 > URL: https://issues.apache.org/jira/browse/YARN-4406 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Ray Chiang >Priority: Minor > > If you start up a cluster, decommission a NodeManager, and restart the RM, > the decommissioned node list will still show a positive number (1 in the case > of 1 node) and if you click on the list, it will be empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036317#comment-15036317 ] Wangda Tan commented on YARN-4225: -- [~eepayne], Thanks for working on the patch, few comments: 1) bq. public abstract Boolean getPreemptionDisabled(); Do you think is it better to return boolean? I'd prefer to return a default value (false) instead of return null 2) For QueueCLI, is it better to print "preemption is disabled/enabled" instead of "preemption status: disabled/enabled"? 3) Is it possible to add a simple test to verify end-to-end behavior? > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-4225.001.patch, YARN-4225.002.patch, > YARN-4225.003.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4408) NodeManager still reports negative running containers
[ https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036345#comment-15036345 ] Robert Kanter commented on YARN-4408: - I haven't been able to reproduce this issue, and I agree that it's not a common occurrence; but we have seen the number of running containers go negative internally on two different clusters and also on a customer's cluster. So I started through the code and state machine for how we could decrement the gauge without first incrementing it. As far as I can tell, this is the only way where this can happen because we don't check {{container.wasLaunched}} like in the other two places where we decrement the gauge. > NodeManager still reports negative running containers > - > > Key: YARN-4408 > URL: https://issues.apache.org/jira/browse/YARN-4408 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > Attachments: YARN-4408.001.patch > > > YARN-1697 fixed a problem where the NodeManager metrics could report a > negative number of running containers. However, it missed a rare case where > this can still happen. > YARN-1697 added a flag to indicate if the container was actually launched > ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which > is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to > {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge > if we actually ran the container and incremented the gauge . However, this > flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to > {{DONE}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart
[ https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036336#comment-15036336 ] Kuhu Shukla commented on YARN-4406: --- Yes that is right, the issue is present on trunk. We could during {{serviceInit}} populate this metric to the number of decommissioned nodes in the inactive list, since we don't care about nodes that were decommissioned before last restart AFAIK. At present: {code} private void setDecomissionedNMsMetrics() { Set excludeList = hostsReader.getExcludedHosts(); ClusterMetrics.getMetrics().setDecommisionedNMs(excludeList.size()); } {code} To: {code} private void setDecomissionedNMsMetrics() { int numDecommissioned = 0; for(RMNode rmNode : rmContext.getInactiveRMNodes().values()) { if (rmNode.getState() == NodeState.DECOMMISSIONED) { numDecommissioned++; } } ClusterMetrics.getMetrics().setDecommisionedNMs(numDecommissioned); } {code} > RM Web UI continues to show decommissioned nodes even after RM restart > -- > > Key: YARN-4406 > URL: https://issues.apache.org/jira/browse/YARN-4406 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Ray Chiang >Priority: Minor > > If you start up a cluster, decommission a NodeManager, and restart the RM, > the decommissioned node list will still show a positive number (1 in the case > of 1 node) and if you click on the list, it will be empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart
[ https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036186#comment-15036186 ] Kuhu Shukla commented on YARN-4406: --- Thank you [~Naganarasimha]. Asking [~rchiang] if its alright for me to work on it. I am currently working in that code base for YARN-4311. > RM Web UI continues to show decommissioned nodes even after RM restart > -- > > Key: YARN-4406 > URL: https://issues.apache.org/jira/browse/YARN-4406 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Ray Chiang >Priority: Minor > > If you start up a cluster, decommission a NodeManager, and restart the RM, > the decommissioned node list will still show a positive number (1 in the case > of 1 node) and if you click on the list, it will be empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart
[ https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang resolved YARN-4406. -- Resolution: Duplicate > RM Web UI continues to show decommissioned nodes even after RM restart > -- > > Key: YARN-4406 > URL: https://issues.apache.org/jira/browse/YARN-4406 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Ray Chiang >Priority: Minor > > If you start up a cluster, decommission a NodeManager, and restart the RM, > the decommissioned node list will still show a positive number (1 in the case > of 1 node) and if you click on the list, it will be empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4392) ApplicationCreatedEvent event time resets after RM restart/failover
[ https://issues.apache.org/jira/browse/YARN-4392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036299#comment-15036299 ] Xuan Gong commented on YARN-4392: - Thanks for the comments, [~Naganarasimha] bq. So actually in the patch i had followed the approach such that for finish events i had sent synchronous push in the ATS side, in this way we are sure that AppFinish event is sent out before we store the state of the app in the RM state store. But yes this approach looks little shaky but thought it might solve the issue. Let us *not synchronously* send the ATS event. Otherwise, it would depend on the ATS. It is always good to make sure that we can send the ATS event "exactly once", but this would make things complicate, such as send ats events synchronously. This would add the additional but not necessary dependency. Currently, we are using "at least once" approach. Since all the information are the same if they are the duplicate events (after applying the patch), I think that is fine. What is your opinion?? > ApplicationCreatedEvent event time resets after RM restart/failover > --- > > Key: YARN-4392 > URL: https://issues.apache.org/jira/browse/YARN-4392 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Xuan Gong >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-4392-2015-11-24.patch, YARN-4392.1.patch, > YARN-4392.2.patch > > > {code}2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - > Finished time 1437453994768 is ahead of started time 1440308399674 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437454008244 is ahead of started time 1440308399676 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444305171 is ahead of started time 1440308399653 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444293115 is ahead of started time 1440308399647 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444379645 is ahead of started time 1440308399656 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444361234 is ahead of started time 1440308399655 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444342029 is ahead of started time 1440308399654 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444323447 is ahead of started time 1440308399654 > 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished > time 143730006 is ahead of started time 1440308399660 > 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished > time 143715698 is ahead of started time 1440308399659 > 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished > time 143719060 is ahead of started time 1440308399658 > 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444393931 is ahead of started time 1440308399657 > {code} . > From ATS logs, we would see a large amount of 'stale alerts' messages > periodically -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4389) "yarn.am.blacklisting.enabled" and "yarn.am.blacklisting.disable-failure-threshold" should be app specific rather than a setting for whole YARN cluster
[ https://issues.apache.org/jira/browse/YARN-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-4389: -- Attachment: 0002-YARN-4389.patch Attaching an updated patch correcting test case failures. Also fixed few checkstyle and javadoc problems. [~djp] Could you please help to review the patch. > "yarn.am.blacklisting.enabled" and > "yarn.am.blacklisting.disable-failure-threshold" should be app specific > rather than a setting for whole YARN cluster > --- > > Key: YARN-4389 > URL: https://issues.apache.org/jira/browse/YARN-4389 > Project: Hadoop YARN > Issue Type: Bug > Components: applications >Reporter: Junping Du >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-4389.patch, 0002-YARN-4389.patch > > > "yarn.am.blacklisting.enabled" and > "yarn.am.blacklisting.disable-failure-threshold" should be application > specific rather than a setting in cluster level, or we should't maintain > amBlacklistingEnabled and blacklistDisableThreshold in per rmApp level. We > should allow each am to override this config, i.e. via submissionContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list
[ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-4311: -- Attachment: YARN-4311-v2.patch This patch addresses graceful and other versions of refreshNodes and also adds a time stamp based check for nodes per {{RM_NODE_REMOVAL_CHK_INTERVAL_MSEC}} in the inactive list that should be untracked and removes nodes based on {{RM_NODE_REMOVAL_TIMEOUT_MSEC}}. A decommissioned node is not transitioned to shutdown but timer acts on it just as it would on a shutdown node. A decommissioning node will transition to shutdown if it was found to be 'untracked'. The unit test tries out several scenarios to check if the metrics and node lists are proper. I can break it into more tests if the idea behind it looks acceptable. > Removing nodes from include and exclude lists will not remove them from > decommissioned nodes list > - > > Key: YARN-4311 > URL: https://issues.apache.org/jira/browse/YARN-4311 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: YARN-4311-v1.patch, YARN-4311-v2.patch > > > In order to fully forget about a node, removing the node from include and > exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The > tricky part that [~jlowe] pointed out was the case when include lists are not > used, in that case we don't want the nodes to fall off if they are not active. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart
[ https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036214#comment-15036214 ] Ray Chiang commented on YARN-4406: -- Thanks [~Naganarasimha]. I'll close up this JIRA as a duplicate. As for fixing it, I'll leave that up to you and [~templedf]. It looks like you two are further ahead than I am. > RM Web UI continues to show decommissioned nodes even after RM restart > -- > > Key: YARN-4406 > URL: https://issues.apache.org/jira/browse/YARN-4406 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Ray Chiang >Priority: Minor > > If you start up a cluster, decommission a NodeManager, and restart the RM, > the decommissioned node list will still show a positive number (1 in the case > of 1 node) and if you click on the list, it will be empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)
[ https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036257#comment-15036257 ] Varun Saxena commented on YARN-3840: [~jianhe], kindly review. [~mohdshahidkhan], probably you can have a look as well. > Resource Manager web ui issue when sorting application by id (with > application having id > ) > > > Key: YARN-3840 > URL: https://issues.apache.org/jira/browse/YARN-3840 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 >Reporter: LINTE >Assignee: Varun Saxena > Fix For: 2.8.0, 2.7.3 > > Attachments: RMApps.png, RMApps_Sorted.png, YARN-3840-1.patch, > YARN-3840-2.patch, YARN-3840-3.patch, YARN-3840-4.patch, YARN-3840-5.patch, > YARN-3840-6.patch, YARN-3840.reopened.001.patch, yarn-3840-7.patch > > > On the WEBUI, the global main view page : > http://resourcemanager:8088/cluster/apps doesn't display applications over > . > With command line it works (# yarn application -list). > Regards, > Alexandre -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4225: - Attachment: (was: YARN-4225.002.patch) > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-4225.001.patch, YARN-4225.002.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4304) AM max resource configuration per partition to be displayed/updated correctly in UI and in various partition related metrics
[ https://issues.apache.org/jira/browse/YARN-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036127#comment-15036127 ] Hadoop QA commented on YARN-4304: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 55s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 12s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 33s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 14s {color} | {color:red} Patch generated 20 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager (total was 210, now 219). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 39s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 31s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager introduced 4 new FindBugs issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 36s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 4s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_85. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 149m 40s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | | Dead store to a in org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(ResponseInfo, String) At
[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4225: - Attachment: YARN-4225.002.patch Attaching {{YARN-4225.002.patch}}, which implements {{getPreemptionDisabled()}} to return a {{Boolean}}, and {{QueueCLI#printQueueInfo}} will check for non-null before printing out queue status. Patch applies cleanly to trunk, branch-2, and branch-2.8. {quote} In General, what is the Hadoop policy when a newer client talks to an older server and the protobuf output is different than expected. Should we expose some form of the has method, or should we overload the get method as I described here? I would appreciate any additional feedback from the community in general (Vinod Kumar Vavilapalli, do you have any thoughts?) {quote} [~vinodkv], did you have a chance to think about this? [~jlowe], do you have any additional thoughts? > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-4225.001.patch, YARN-4225.002.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4412) Create ClusterManager to compute ordered list of preferred NMs for QUEUEABLE containers
Arun Suresh created YARN-4412: - Summary: Create ClusterManager to compute ordered list of preferred NMs for QUEUEABLE containers Key: YARN-4412 URL: https://issues.apache.org/jira/browse/YARN-4412 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun Suresh Assignee: Arun Suresh Introduce a Cluster Manager that aggregates Load and Policy information from individual Node Managers and computes an ordered list of preferred Node managers to be used as target Nodes for QUEUEABLE container allocations. This list can be pushed out to the Node Manager (specifically the AMRMProxy running on the Node) via the Allocate Response. This will be used to make local Scheduling decisions -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4225: - Attachment: YARN-4225.003.patch Sorry, I mis-named the patch. Should have been {{YARN-4225.003.patch}} > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-4225.001.patch, YARN-4225.002.patch, > YARN-4225.003.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4309) Add debug information to application logs when a container fails
[ https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036112#comment-15036112 ] Hadoop QA commented on YARN-4309: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 44s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 38s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 24s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 30s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 37s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 40s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 5s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 35s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 52s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 17s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 24s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 24s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 30s {color} | {color:red} Patch generated 4 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 357, now 358). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 39s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 39s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 40s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 3s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 29s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 10s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 2s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 28s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_85. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 24s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_85. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 23s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_85. {color} | |
[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart
[ https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036241#comment-15036241 ] Sunil G commented on YARN-4406: --- YARN-3226 which is a subtask of YARN-914 will be splitting cluster metrics in to two TABLES (Node metrics table) as we have to show Decommissioning nodes too. Patch is given there already for same. However this particular case s not handled there. Mostly as progress is made for this, please also see the progress in YARN-3226. > RM Web UI continues to show decommissioned nodes even after RM restart > -- > > Key: YARN-4406 > URL: https://issues.apache.org/jira/browse/YARN-4406 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Ray Chiang >Priority: Minor > > If you start up a cluster, decommission a NodeManager, and restart the RM, > the decommissioned node list will still show a positive number (1 in the case > of 1 node) and if you click on the list, it will be empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036240#comment-15036240 ] Hadoop QA commented on YARN-4403: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 55s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 59s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 23s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 31s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 11s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 41s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 56s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 3s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 7s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 5s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 17s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 11s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 53s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 53s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 1s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 59s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 64m 3s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 14s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_85. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 42s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_85. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 168m 35s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_66 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK
[jira] [Updated] (YARN-3458) CPU resource monitoring in Windows
[ https://issues.apache.org/jira/browse/YARN-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Inigo Goiri updated YARN-3458: -- Attachment: YARN-3458-9.patch Rebased to trunk. > CPU resource monitoring in Windows > -- > > Key: YARN-3458 > URL: https://issues.apache.org/jira/browse/YARN-3458 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.7.0 > Environment: Windows >Reporter: Inigo Goiri >Assignee: Inigo Goiri >Priority: Minor > Labels: BB2015-05-TBR, containers, metrics, windows > Attachments: YARN-3458-1.patch, YARN-3458-2.patch, YARN-3458-3.patch, > YARN-3458-4.patch, YARN-3458-5.patch, YARN-3458-6.patch, YARN-3458-7.patch, > YARN-3458-8.patch, YARN-3458-9.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > The current implementation of getCpuUsagePercent() for > WindowsBasedProcessTree is left as unavailable. Attached a proposal of how to > do it. I reused the CpuTimeTracker using 1 jiffy=1ms. > This was left open by YARN-3122. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4398) Yarn recover functionality causes the cluster running slowly and the cluster usage rate is far below 100
[ https://issues.apache.org/jira/browse/YARN-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036381#comment-15036381 ] Jian He commented on YARN-4398: --- [~iceberg565], added you to the contributor list. Assigned this to you. You can also now assign jira to yourself. Committing this. > Yarn recover functionality causes the cluster running slowly and the cluster > usage rate is far below 100 > > > Key: YARN-4398 > URL: https://issues.apache.org/jira/browse/YARN-4398 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: NING DING >Assignee: NING DING > Attachments: YARN-4398.2.patch, YARN-4398.3.patch, YARN-4398.4.patch > > > In my hadoop cluster, the resourceManager recover functionality is enabled > with FileSystemRMStateStore. > I found this cause the yarn cluster running slowly and cluster usage rate is > just 50 even there are many pending Apps. > The scenario is below. > In thread A, the RMAppImpl$RMAppNewlySavingTransition is calling > storeNewApplication method defined in RMStateStore. This storeNewApplication > method is synchronized. > {code:title=RMAppImpl.java|borderStyle=solid} > private static final class RMAppNewlySavingTransition extends > RMAppTransition { > @Override > public void transition(RMAppImpl app, RMAppEvent event) { > // If recovery is enabled then store the application information in a > // non-blocking call so make sure that RM has stored the information > // needed to restart the AM after RM restart without further client > // communication > LOG.info("Storing application with id " + app.applicationId); > app.rmContext.getStateStore().storeNewApplication(app); > } > } > {code} > {code:title=RMStateStore.java|borderStyle=solid} > public synchronized void storeNewApplication(RMApp app) { > ApplicationSubmissionContext context = app > > .getApplicationSubmissionContext(); > assert context instanceof ApplicationSubmissionContextPBImpl; > ApplicationStateData appState = > ApplicationStateData.newInstance( > app.getSubmitTime(), app.getStartTime(), context, app.getUser()); > dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState)); > } > {code} > In thread B, the FileSystemRMStateStore is calling > storeApplicationStateInternal method. It's also synchronized. > This storeApplicationStateInternal method saves an ApplicationStateData into > HDFS and it normally costs 90~300 milliseconds in my hadoop cluster. > {code:title=FileSystemRMStateStore.java|borderStyle=solid} > public synchronized void storeApplicationStateInternal(ApplicationId appId, > ApplicationStateData appStateDataPB) throws Exception { > Path appDirPath = getAppDir(rmAppRoot, appId); > mkdirsWithRetries(appDirPath); > Path nodeCreatePath = getNodePath(appDirPath, appId.toString()); > LOG.info("Storing info for app: " + appId + " at: " + nodeCreatePath); > byte[] appStateData = appStateDataPB.getProto().toByteArray(); > try { > // currently throw all exceptions. May need to respond differently for > HA > // based on whether we have lost the right to write to FS > writeFileWithRetries(nodeCreatePath, appStateData, true); > } catch (Exception e) { > LOG.info("Error storing info for app: " + appId, e); > throw e; > } > } > {code} > Think thread B firstly comes into > FileSystemRMStateStore.storeApplicationStateInternal method, then thread A > will be blocked for a while because of synchronization. In ResourceManager > there is only one RMStateStore instance. In my cluster it's > FileSystemRMStateStore type. > Debug the RMAppNewlySavingTransition.transition method, the thread stack > shows it's called form AsyncDispatcher.dispatch method. This method code is > as below. > {code:title=AsyncDispatcher.java|borderStyle=solid} > protected void dispatch(Event event) { > //all events go thru this loop > if (LOG.isDebugEnabled()) { > LOG.debug("Dispatching the event " + event.getClass().getName() + "." > + event.toString()); > } > Class type = event.getType().getDeclaringClass(); > try{ > EventHandler handler = eventDispatchers.get(type); > if(handler != null) { > handler.handle(event); > } else { > throw new Exception("No handler for registered for " + type); > } > } catch (Throwable t) { > //TODO Maybe log the state of the queue > LOG.fatal("Error in dispatcher thread", t); > // If serviceStop is called, we should exit this thread gracefully. > if
[jira] [Assigned] (YARN-3102) Decommisioned Nodes not listed in Web UI
[ https://issues.apache.org/jira/browse/YARN-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla reassigned YARN-3102: - Assignee: Kuhu Shukla (was: Naganarasimha G R) > Decommisioned Nodes not listed in Web UI > > > Key: YARN-3102 > URL: https://issues.apache.org/jira/browse/YARN-3102 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 > Environment: 2 Node Manager and 1 Resource Manager >Reporter: Bibin A Chundatt >Assignee: Kuhu Shukla >Priority: Minor > > Configure yarn.resourcemanager.nodes.exclude-path in yarn-site.xml to > yarn.exlude file In RM1 machine > Add Yarn.exclude with NM1 Host Name > Start the node as listed below NM1,NM2 Resource manager > Now check Nodes decommisioned in /cluster/nodes > Number of decommisioned node is listed as 1 but Table is empty in > /cluster/nodes/decommissioned (detail of Decommision node not shown) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart
[ https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036410#comment-15036410 ] Daniel Templeton commented on YARN-4406: That's the simplest resolution, but I was actually leaning the other direction: making the list of decommissioned nodes include the full excludes list. I guess it comes down to how we define decommissioned in the UI. I interpret the excludes list as the canonical list of decommissioned nodes. > RM Web UI continues to show decommissioned nodes even after RM restart > -- > > Key: YARN-4406 > URL: https://issues.apache.org/jira/browse/YARN-4406 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Ray Chiang >Assignee: Kuhu Shukla >Priority: Minor > > If you start up a cluster, decommission a NodeManager, and restart the RM, > the decommissioned node list will still show a positive number (1 in the case > of 1 node) and if you click on the list, it will be empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4408) NodeManager still reports negative running containers
[ https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-4408: Attachment: YARN-4408.002.patch The 002 patch adds the debug message. > NodeManager still reports negative running containers > - > > Key: YARN-4408 > URL: https://issues.apache.org/jira/browse/YARN-4408 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > Attachments: YARN-4408.001.patch, YARN-4408.002.patch > > > YARN-1697 fixed a problem where the NodeManager metrics could report a > negative number of running containers. However, it missed a rare case where > this can still happen. > YARN-1697 added a flag to indicate if the container was actually launched > ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which > is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to > {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge > if we actually ran the container and incremented the gauge . However, this > flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to > {{DONE}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4408) NodeManager still reports negative running containers
[ https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036367#comment-15036367 ] Junping Du commented on YARN-4408: -- Thanks Robert for reply. If we don't understand how it happens in the short term, shall we add a warn log message if container get finished with success but not get launched before? I think that could be helpful for our debug in future or this fix could cover something unusual we were missing. Others looks fine to me. > NodeManager still reports negative running containers > - > > Key: YARN-4408 > URL: https://issues.apache.org/jira/browse/YARN-4408 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > Attachments: YARN-4408.001.patch > > > YARN-1697 fixed a problem where the NodeManager metrics could report a > negative number of running containers. However, it missed a rare case where > this can still happen. > YARN-1697 added a flag to indicate if the container was actually launched > ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which > is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to > {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge > if we actually ran the container and incremented the gauge . However, this > flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to > {{DONE}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart
[ https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-4406: - Assignee: Kuhu Shukla > RM Web UI continues to show decommissioned nodes even after RM restart > -- > > Key: YARN-4406 > URL: https://issues.apache.org/jira/browse/YARN-4406 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Ray Chiang >Assignee: Kuhu Shukla >Priority: Minor > > If you start up a cluster, decommission a NodeManager, and restart the RM, > the decommissioned node list will still show a positive number (1 in the case > of 1 node) and if you click on the list, it will be empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4398) Yarn recover functionality causes the cluster running slowly and the cluster usage rate is far below 100
[ https://issues.apache.org/jira/browse/YARN-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036444#comment-15036444 ] Hudson commented on YARN-4398: -- FAILURE: Integrated in Hadoop-trunk-Commit #8910 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8910/]) YARN-4398. Remove unnecessary synchronization in RMStateStore. (jianhe: rev 6b9a5beb2b2f9589ef86670f2d763e8488ee5e90) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java * hadoop-yarn-project/CHANGES.txt > Yarn recover functionality causes the cluster running slowly and the cluster > usage rate is far below 100 > > > Key: YARN-4398 > URL: https://issues.apache.org/jira/browse/YARN-4398 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: NING DING >Assignee: NING DING > Fix For: 2.7.3 > > Attachments: YARN-4398.2.patch, YARN-4398.3.patch, YARN-4398.4.patch > > > In my hadoop cluster, the resourceManager recover functionality is enabled > with FileSystemRMStateStore. > I found this cause the yarn cluster running slowly and cluster usage rate is > just 50 even there are many pending Apps. > The scenario is below. > In thread A, the RMAppImpl$RMAppNewlySavingTransition is calling > storeNewApplication method defined in RMStateStore. This storeNewApplication > method is synchronized. > {code:title=RMAppImpl.java|borderStyle=solid} > private static final class RMAppNewlySavingTransition extends > RMAppTransition { > @Override > public void transition(RMAppImpl app, RMAppEvent event) { > // If recovery is enabled then store the application information in a > // non-blocking call so make sure that RM has stored the information > // needed to restart the AM after RM restart without further client > // communication > LOG.info("Storing application with id " + app.applicationId); > app.rmContext.getStateStore().storeNewApplication(app); > } > } > {code} > {code:title=RMStateStore.java|borderStyle=solid} > public synchronized void storeNewApplication(RMApp app) { > ApplicationSubmissionContext context = app > > .getApplicationSubmissionContext(); > assert context instanceof ApplicationSubmissionContextPBImpl; > ApplicationStateData appState = > ApplicationStateData.newInstance( > app.getSubmitTime(), app.getStartTime(), context, app.getUser()); > dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState)); > } > {code} > In thread B, the FileSystemRMStateStore is calling > storeApplicationStateInternal method. It's also synchronized. > This storeApplicationStateInternal method saves an ApplicationStateData into > HDFS and it normally costs 90~300 milliseconds in my hadoop cluster. > {code:title=FileSystemRMStateStore.java|borderStyle=solid} > public synchronized void storeApplicationStateInternal(ApplicationId appId, > ApplicationStateData appStateDataPB) throws Exception { > Path appDirPath = getAppDir(rmAppRoot, appId); > mkdirsWithRetries(appDirPath); > Path nodeCreatePath = getNodePath(appDirPath, appId.toString()); > LOG.info("Storing info for app: " + appId + " at: " + nodeCreatePath); > byte[] appStateData = appStateDataPB.getProto().toByteArray(); > try { > // currently throw all exceptions. May need to respond differently for > HA > // based on whether we have lost the right to write to FS > writeFileWithRetries(nodeCreatePath, appStateData, true); > } catch (Exception e) { > LOG.info("Error storing info for app: " + appId, e); > throw e; > } > } > {code} > Think thread B firstly comes into > FileSystemRMStateStore.storeApplicationStateInternal method, then thread A > will be blocked for a while because of synchronization. In ResourceManager > there is only one RMStateStore instance. In my cluster it's > FileSystemRMStateStore type. > Debug the RMAppNewlySavingTransition.transition method, the thread stack > shows it's called form AsyncDispatcher.dispatch method. This method code is > as below. > {code:title=AsyncDispatcher.java|borderStyle=solid} > protected void dispatch(Event event) { > //all events go thru this loop > if (LOG.isDebugEnabled()) { > LOG.debug("Dispatching the event " + event.getClass().getName() + "." > + event.toString()); > } > Class type = event.getType().getDeclaringClass(); > try{ >
[jira] [Commented] (YARN-4389) "yarn.am.blacklisting.enabled" and "yarn.am.blacklisting.disable-failure-threshold" should be app specific rather than a setting for whole YARN cluster
[ https://issues.apache.org/jira/browse/YARN-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036511#comment-15036511 ] Hadoop QA commented on YARN-4389: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 59s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 1s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 15s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 41s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 41s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 2s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 32s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 55s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 37s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 6s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 6s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 6s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 18s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 18s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 18s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 29s {color} | {color:red} Patch generated 5 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 160, now 165). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 40s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 40s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 40s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 35s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 57s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 9s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 9s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 27s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_85. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 14s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_85. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 13s {color} | {color:red}
[jira] [Commented] (YARN-4408) NodeManager still reports negative running containers
[ https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036557#comment-15036557 ] Junping Du commented on YARN-4408: -- +1 on 003 patch. Will commit it shortly if no further comments from others. > NodeManager still reports negative running containers > - > > Key: YARN-4408 > URL: https://issues.apache.org/jira/browse/YARN-4408 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > Attachments: YARN-4408.001.patch, YARN-4408.002.patch, > YARN-4408.003.patch > > > YARN-1697 fixed a problem where the NodeManager metrics could report a > negative number of running containers. However, it missed a rare case where > this can still happen. > YARN-1697 added a flag to indicate if the container was actually launched > ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which > is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to > {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge > if we actually ran the container and incremented the gauge . However, this > flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to > {{DONE}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4398) Yarn recover functionality causes the cluster running slowly and the cluster usage rate is far below 100
[ https://issues.apache.org/jira/browse/YARN-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-4398: -- Assignee: NING DING > Yarn recover functionality causes the cluster running slowly and the cluster > usage rate is far below 100 > > > Key: YARN-4398 > URL: https://issues.apache.org/jira/browse/YARN-4398 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: NING DING >Assignee: NING DING > Attachments: YARN-4398.2.patch, YARN-4398.3.patch, YARN-4398.4.patch > > > In my hadoop cluster, the resourceManager recover functionality is enabled > with FileSystemRMStateStore. > I found this cause the yarn cluster running slowly and cluster usage rate is > just 50 even there are many pending Apps. > The scenario is below. > In thread A, the RMAppImpl$RMAppNewlySavingTransition is calling > storeNewApplication method defined in RMStateStore. This storeNewApplication > method is synchronized. > {code:title=RMAppImpl.java|borderStyle=solid} > private static final class RMAppNewlySavingTransition extends > RMAppTransition { > @Override > public void transition(RMAppImpl app, RMAppEvent event) { > // If recovery is enabled then store the application information in a > // non-blocking call so make sure that RM has stored the information > // needed to restart the AM after RM restart without further client > // communication > LOG.info("Storing application with id " + app.applicationId); > app.rmContext.getStateStore().storeNewApplication(app); > } > } > {code} > {code:title=RMStateStore.java|borderStyle=solid} > public synchronized void storeNewApplication(RMApp app) { > ApplicationSubmissionContext context = app > > .getApplicationSubmissionContext(); > assert context instanceof ApplicationSubmissionContextPBImpl; > ApplicationStateData appState = > ApplicationStateData.newInstance( > app.getSubmitTime(), app.getStartTime(), context, app.getUser()); > dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState)); > } > {code} > In thread B, the FileSystemRMStateStore is calling > storeApplicationStateInternal method. It's also synchronized. > This storeApplicationStateInternal method saves an ApplicationStateData into > HDFS and it normally costs 90~300 milliseconds in my hadoop cluster. > {code:title=FileSystemRMStateStore.java|borderStyle=solid} > public synchronized void storeApplicationStateInternal(ApplicationId appId, > ApplicationStateData appStateDataPB) throws Exception { > Path appDirPath = getAppDir(rmAppRoot, appId); > mkdirsWithRetries(appDirPath); > Path nodeCreatePath = getNodePath(appDirPath, appId.toString()); > LOG.info("Storing info for app: " + appId + " at: " + nodeCreatePath); > byte[] appStateData = appStateDataPB.getProto().toByteArray(); > try { > // currently throw all exceptions. May need to respond differently for > HA > // based on whether we have lost the right to write to FS > writeFileWithRetries(nodeCreatePath, appStateData, true); > } catch (Exception e) { > LOG.info("Error storing info for app: " + appId, e); > throw e; > } > } > {code} > Think thread B firstly comes into > FileSystemRMStateStore.storeApplicationStateInternal method, then thread A > will be blocked for a while because of synchronization. In ResourceManager > there is only one RMStateStore instance. In my cluster it's > FileSystemRMStateStore type. > Debug the RMAppNewlySavingTransition.transition method, the thread stack > shows it's called form AsyncDispatcher.dispatch method. This method code is > as below. > {code:title=AsyncDispatcher.java|borderStyle=solid} > protected void dispatch(Event event) { > //all events go thru this loop > if (LOG.isDebugEnabled()) { > LOG.debug("Dispatching the event " + event.getClass().getName() + "." > + event.toString()); > } > Class type = event.getType().getDeclaringClass(); > try{ > EventHandler handler = eventDispatchers.get(type); > if(handler != null) { > handler.handle(event); > } else { > throw new Exception("No handler for registered for " + type); > } > } catch (Throwable t) { > //TODO Maybe log the state of the queue > LOG.fatal("Error in dispatcher thread", t); > // If serviceStop is called, we should exit this thread gracefully. > if (exitOnDispatchException > && (ShutdownHookManager.get().isShutdownInProgress()) == false > && stopped == false) { > Thread shutDownThread = new
[jira] [Commented] (YARN-4408) NodeManager still reports negative running containers
[ https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036485#comment-15036485 ] Junping Du commented on YARN-4408: -- Thanks Robert for updating the patch. Can we make log messages here in WARN level given this is unusual case and our log level is only enabled for INFO or above by default? > NodeManager still reports negative running containers > - > > Key: YARN-4408 > URL: https://issues.apache.org/jira/browse/YARN-4408 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > Attachments: YARN-4408.001.patch, YARN-4408.002.patch > > > YARN-1697 fixed a problem where the NodeManager metrics could report a > negative number of running containers. However, it missed a rare case where > this can still happen. > YARN-1697 added a flag to indicate if the container was actually launched > ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which > is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to > {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge > if we actually ran the container and incremented the gauge . However, this > flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to > {{DONE}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4408) NodeManager still reports negative running containers
[ https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036531#comment-15036531 ] Hadoop QA commented on YARN-4408: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 23s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 12s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 3s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 9s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 54s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 16s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_85. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 35m 57s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12775359/YARN-4408.002.patch | | JIRA Issue | YARN-4408 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 98a4b6fe7980 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 6b9a5be | | findbugs | v3.0.0 | | JDK v1.7.0_85 Test Results |
[jira] [Commented] (YARN-4392) ApplicationCreatedEvent event time resets after RM restart/failover
[ https://issues.apache.org/jira/browse/YARN-4392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036580#comment-15036580 ] Xuan Gong commented on YARN-4392: - [~Naganarasimha] bq. there is no limit on number of running apps in state store and finished apps are restricted to a configurable number. In such cases would not there be many created events in a larger cluster on recovery? This is a good point given the performance of ATS v1 is not that scalable. Will it cause any issue if the APP_CREATED event is missing ? If that only cause the missing related information in ATS webui/webservice, I am OK with not re-sending the ATS events on recovery. [~jlowe] What is your opinion ? > ApplicationCreatedEvent event time resets after RM restart/failover > --- > > Key: YARN-4392 > URL: https://issues.apache.org/jira/browse/YARN-4392 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Xuan Gong >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-4392-2015-11-24.patch, YARN-4392.1.patch, > YARN-4392.2.patch > > > {code}2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - > Finished time 1437453994768 is ahead of started time 1440308399674 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437454008244 is ahead of started time 1440308399676 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444305171 is ahead of started time 1440308399653 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444293115 is ahead of started time 1440308399647 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444379645 is ahead of started time 1440308399656 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444361234 is ahead of started time 1440308399655 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444342029 is ahead of started time 1440308399654 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444323447 is ahead of started time 1440308399654 > 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished > time 143730006 is ahead of started time 1440308399660 > 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished > time 143715698 is ahead of started time 1440308399659 > 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished > time 143719060 is ahead of started time 1440308399658 > 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444393931 is ahead of started time 1440308399657 > {code} . > From ATS logs, we would see a large amount of 'stale alerts' messages > periodically -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart
[ https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036389#comment-15036389 ] Ray Chiang commented on YARN-4406: -- That looks good to me. > RM Web UI continues to show decommissioned nodes even after RM restart > -- > > Key: YARN-4406 > URL: https://issues.apache.org/jira/browse/YARN-4406 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Ray Chiang >Priority: Minor > > If you start up a cluster, decommission a NodeManager, and restart the RM, > the decommissioned node list will still show a positive number (1 in the case > of 1 node) and if you click on the list, it will be empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4408) NodeManager still reports negative running containers
[ https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-4408: Attachment: YARN-4408.003.patch Sure. The 003 patch uses warn level. > NodeManager still reports negative running containers > - > > Key: YARN-4408 > URL: https://issues.apache.org/jira/browse/YARN-4408 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > Attachments: YARN-4408.001.patch, YARN-4408.002.patch, > YARN-4408.003.patch > > > YARN-1697 fixed a problem where the NodeManager metrics could report a > negative number of running containers. However, it missed a rare case where > this can still happen. > YARN-1697 added a flag to indicate if the container was actually launched > ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which > is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to > {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge > if we actually ran the container and incremented the gauge . However, this > flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to > {{DONE}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4392) ApplicationCreatedEvent event time resets after RM restart/failover
[ https://issues.apache.org/jira/browse/YARN-4392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036530#comment-15036530 ] Naganarasimha G R commented on YARN-4392: - [~xgong], Yes you are right, it would not be good to depend on ATS that it will send certain events synchronously. but IIUC there is no limit on number of running apps in state store and finished apps are restricted to a configurable number. In such cases would not there be many created events in a larger cluster on recovery? my 2 cents would be atleast to avoid for app created event but if its not a great deal, then fine with the current fix. :) Thanks for assigning it to me, i can get the test case failure corrected as it was already handled in YARN-3127. > ApplicationCreatedEvent event time resets after RM restart/failover > --- > > Key: YARN-4392 > URL: https://issues.apache.org/jira/browse/YARN-4392 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Xuan Gong >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-4392-2015-11-24.patch, YARN-4392.1.patch, > YARN-4392.2.patch > > > {code}2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - > Finished time 1437453994768 is ahead of started time 1440308399674 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437454008244 is ahead of started time 1440308399676 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444305171 is ahead of started time 1440308399653 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444293115 is ahead of started time 1440308399647 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444379645 is ahead of started time 1440308399656 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444361234 is ahead of started time 1440308399655 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444342029 is ahead of started time 1440308399654 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444323447 is ahead of started time 1440308399654 > 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished > time 143730006 is ahead of started time 1440308399660 > 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished > time 143715698 is ahead of started time 1440308399659 > 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished > time 143719060 is ahead of started time 1440308399658 > 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444393931 is ahead of started time 1440308399657 > {code} . > From ATS logs, we would see a large amount of 'stale alerts' messages > periodically -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036560#comment-15036560 ] Eric Payne commented on YARN-4225: -- Thanks [~leftnoteasy], for your helpful comments. bq. Do you think is it better to return boolean? I'd prefer to return a default value (false) instead of return null This is the nature of the question that I have about the more general Hadoop policy, and which [~jlowe] and I were discussing in the comments above. Basically, the use case is a newer client is querying an older server, and so some of the newer protobuf entries that the client expects may not exist. In that case, we have 2 options that I can see: # The client exposes both the {{get}} protobuf method and the {{has}} protobuf method for the structure in question # We overload the {{get}} protobuf method to do the {{has}} checking internally and return NULL if the field doesn't exist. I actually prefer the second option because it exposes only one method. But, I would like to know the opinion of others and if there is already a precedence for this use case. > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-4225.001.patch, YARN-4225.002.patch, > YARN-4225.003.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4410) hadoop
qeko created YARN-4410: -- Summary: hadoop Key: YARN-4410 URL: https://issues.apache.org/jira/browse/YARN-4410 Project: Hadoop YARN Issue Type: Bug Reporter: qeko -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4401) A failed app recovery should not prevent the RM from starting
[ https://issues.apache.org/jira/browse/YARN-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035492#comment-15035492 ] Sunil G commented on YARN-4401: --- Hi [~templedf] I am not very sure about the use case here. However I feel if such a case occurs, we will have enough information from logs to get the app-id. Then we can use below command to clear such apps if necessary rather than forcefully clear from rmcontext. {noformat} Usage: yarn resourcemanager [-format-state-store] [-remove-application-from-state-store ] {noformat} > A failed app recovery should not prevent the RM from starting > - > > Key: YARN-4401 > URL: https://issues.apache.org/jira/browse/YARN-4401 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Critical > Attachments: YARN-4401.001.patch > > > There are many different reasons why an app recovery could fail with an > exception, causing the RM start to be aborted. If that happens the RM will > fail to start. Presumably, the reason the RM is trying to do a recovery is > that it's the standby trying to fill in for the active. Failing to come up > defeats the purpose of the HA configuration. Instead of preventing the RM > from starting, a failed app recovery should log an error and skip the > application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4309) Add debug information to application logs when a container fails
[ https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-4309: Attachment: YARN-4309.004.patch Thanks for the review [~sidharta-s]. bq. Could you clarify why the debugging information gathering in DockerContainerExecutor.writeLaunchEnv is not guarded by a config check? Good catch. The config check should be present in DockerContainerExecutor as well. Fixed. bq. There seem to be minor inconsistent line spacing issues in the new test function in TestContainerLaunch.java Fixed. I've changed the find command in the latest version to not use the xtype option which seems to be Linux only. I've also renamed the scriptbuilder functions to indicate that they're meant for debugging purposes. > Add debug information to application logs when a container fails > > > Key: YARN-4309 > URL: https://issues.apache.org/jira/browse/YARN-4309 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: YARN-4309.001.patch, YARN-4309.002.patch, > YARN-4309.003.patch, YARN-4309.004.patch > > > Sometimes when a container fails, it can be pretty hard to figure out why it > failed. > My proposal is that if a container fails, we collect information about the > container local dir and dump it into the container log dir. Ideally, I'd like > to tar up the directory entirely, but I'm not sure of the security and space > implications of such a approach. At the very least, we can list all the files > in the container local dir, and dump the contents of launch_container.sh(into > the container log dir). > When log aggregation occurs, all this information will automatically get > collected and make debugging such failures much easier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4410) hadoop
[ https://issues.apache.org/jira/browse/YARN-4410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035468#comment-15035468 ] Naganarasimha G R commented on YARN-4410: - Whats this jira for ? If its raised by mistake, please resolve it as invalid ! > hadoop > -- > > Key: YARN-4410 > URL: https://issues.apache.org/jira/browse/YARN-4410 > Project: Hadoop YARN > Issue Type: Bug >Reporter: qeko > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4398) Yarn recover functionality causes the cluster running slowly and the cluster usage rate is far below 100
[ https://issues.apache.org/jira/browse/YARN-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035471#comment-15035471 ] NING DING commented on YARN-4398: - I uploaded a new patch that removed useless whitespace. The current test cases can cover the modified code in this patch. This patch resolved performance issue. So no new unit test cases. > Yarn recover functionality causes the cluster running slowly and the cluster > usage rate is far below 100 > > > Key: YARN-4398 > URL: https://issues.apache.org/jira/browse/YARN-4398 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: NING DING > Attachments: YARN-4398.2.patch, YARN-4398.3.patch, YARN-4398.4.patch > > > In my hadoop cluster, the resourceManager recover functionality is enabled > with FileSystemRMStateStore. > I found this cause the yarn cluster running slowly and cluster usage rate is > just 50 even there are many pending Apps. > The scenario is below. > In thread A, the RMAppImpl$RMAppNewlySavingTransition is calling > storeNewApplication method defined in RMStateStore. This storeNewApplication > method is synchronized. > {code:title=RMAppImpl.java|borderStyle=solid} > private static final class RMAppNewlySavingTransition extends > RMAppTransition { > @Override > public void transition(RMAppImpl app, RMAppEvent event) { > // If recovery is enabled then store the application information in a > // non-blocking call so make sure that RM has stored the information > // needed to restart the AM after RM restart without further client > // communication > LOG.info("Storing application with id " + app.applicationId); > app.rmContext.getStateStore().storeNewApplication(app); > } > } > {code} > {code:title=RMStateStore.java|borderStyle=solid} > public synchronized void storeNewApplication(RMApp app) { > ApplicationSubmissionContext context = app > > .getApplicationSubmissionContext(); > assert context instanceof ApplicationSubmissionContextPBImpl; > ApplicationStateData appState = > ApplicationStateData.newInstance( > app.getSubmitTime(), app.getStartTime(), context, app.getUser()); > dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState)); > } > {code} > In thread B, the FileSystemRMStateStore is calling > storeApplicationStateInternal method. It's also synchronized. > This storeApplicationStateInternal method saves an ApplicationStateData into > HDFS and it normally costs 90~300 milliseconds in my hadoop cluster. > {code:title=FileSystemRMStateStore.java|borderStyle=solid} > public synchronized void storeApplicationStateInternal(ApplicationId appId, > ApplicationStateData appStateDataPB) throws Exception { > Path appDirPath = getAppDir(rmAppRoot, appId); > mkdirsWithRetries(appDirPath); > Path nodeCreatePath = getNodePath(appDirPath, appId.toString()); > LOG.info("Storing info for app: " + appId + " at: " + nodeCreatePath); > byte[] appStateData = appStateDataPB.getProto().toByteArray(); > try { > // currently throw all exceptions. May need to respond differently for > HA > // based on whether we have lost the right to write to FS > writeFileWithRetries(nodeCreatePath, appStateData, true); > } catch (Exception e) { > LOG.info("Error storing info for app: " + appId, e); > throw e; > } > } > {code} > Think thread B firstly comes into > FileSystemRMStateStore.storeApplicationStateInternal method, then thread A > will be blocked for a while because of synchronization. In ResourceManager > there is only one RMStateStore instance. In my cluster it's > FileSystemRMStateStore type. > Debug the RMAppNewlySavingTransition.transition method, the thread stack > shows it's called form AsyncDispatcher.dispatch method. This method code is > as below. > {code:title=AsyncDispatcher.java|borderStyle=solid} > protected void dispatch(Event event) { > //all events go thru this loop > if (LOG.isDebugEnabled()) { > LOG.debug("Dispatching the event " + event.getClass().getName() + "." > + event.toString()); > } > Class type = event.getType().getDeclaringClass(); > try{ > EventHandler handler = eventDispatchers.get(type); > if(handler != null) { > handler.handle(event); > } else { > throw new Exception("No handler for registered for " + type); > } > } catch (Throwable t) { > //TODO Maybe log the state of the queue > LOG.fatal("Error in dispatcher thread", t); > // If serviceStop is called, we should exit this thread gracefully. >
[jira] [Created] (YARN-4413) Nodes in the includes list should not be listed as decommissioned in the UI
Daniel Templeton created YARN-4413: -- Summary: Nodes in the includes list should not be listed as decommissioned in the UI Key: YARN-4413 URL: https://issues.apache.org/jira/browse/YARN-4413 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.7.1 Reporter: Daniel Templeton Assignee: Daniel Templeton If I decommission a node and then move it from the excludes list back to the includes list, but I don't restart the node, the node will still be listed by the web UI as decomissioned until either the NM or RM is restarted. Ideally, removing the node from the excludes list and putting it back into the includes list should cause the node to be reported as shutdown instead. CC [~kshukla] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4292) ResourceUtilization should be a part of NodeInfo REST API
[ https://issues.apache.org/jira/browse/YARN-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036619#comment-15036619 ] Wangda Tan commented on YARN-4292: -- Looks good, +1, will commit in a few days if no opposite opinions. Thanks [~sunilg]. > ResourceUtilization should be a part of NodeInfo REST API > - > > Key: YARN-4292 > URL: https://issues.apache.org/jira/browse/YARN-4292 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Sunil G > Attachments: 0001-YARN-4292.patch, 0002-YARN-4292.patch, > 0003-YARN-4292.patch, 0004-YARN-4292.patch, 0005-YARN-4292.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3623) We should have a config to indicate the Timeline Service version
[ https://issues.apache.org/jira/browse/YARN-3623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036654#comment-15036654 ] Li Lu commented on YARN-3623: - Hi [~Naganarasimha] sorry for the late reply (just came back from a vacation). Plan sounds good to me. One thing is that we may not need to mark this JIRA as ATS v1.5 since the fix here is a rather general one: from v1.5 on we need to handle this configuration in ATS correctly, thus I think it'll be a general JIRA and not attached with any specific version of ATS. Right now we have the patch for this JIRA so we can proceed with the rest of the plan? It will certainly be helpful if anyone has any concerns on Naga's plan. > We should have a config to indicate the Timeline Service version > > > Key: YARN-3623 > URL: https://issues.apache.org/jira/browse/YARN-3623 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Naganarasimha G R > Attachments: YARN-3623-2015-11-19.1.patch > > > So far RM, MR AM, DA AM added/changed new config to enable the feature to > write the timeline data to v2 server. It's good to have a YARN > timeline-service.version config like timeline-service.enable to indicate the > version of the running timeline service with the given YARN cluster. It's > beneficial for users to more smoothly move from v1 to v2, as they don't need > to change the existing config, but switch this config from v1 to v2. And each > framework doesn't need to have their own v1/v2 config. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4405) Support node label store in non-appendable file system
[ https://issues.apache.org/jira/browse/YARN-4405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-4405: - Attachment: YARN-4405.2.patch Attached ver.2 patch fixed test failures and findbug warnings. > Support node label store in non-appendable file system > -- > > Key: YARN-4405 > URL: https://issues.apache.org/jira/browse/YARN-4405 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-4405.1.patch, YARN-4405.2.patch > > > Existing node label file system store implementation uses append to write > edit logs. However, some file system doesn't support append, we need add an > implementation to support such non-appendable file systems as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list
[ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036629#comment-15036629 ] Hadoop QA commented on YARN-4311: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 2s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 19s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 28s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 47s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 47s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 44s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 29s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 4s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 43s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 10m 43s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 54s {color} | {color:green} the patch passed with JDK v {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 10m 54s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 13s {color} | {color:red} Patch generated 5 new checkstyle issues in root (total was 396, now 400). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 43s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 47s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 22s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 40s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 26s {color} | {color:red} hadoop-yarn-api in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 47s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 54s {color} | {color:green} hadoop-sls in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 25s {color} | {color:red} hadoop-yarn-api in the patch failed with JDK v1.7.0_85. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 54s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_85. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 59s {color} | {color:green} hadoop-sls in the patch passed with JDK v1.7.0_85. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 218m 41s
[jira] [Commented] (YARN-4413) Nodes in the includes list should not be listed as decommissioned in the UI
[ https://issues.apache.org/jira/browse/YARN-4413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036643#comment-15036643 ] Kuhu Shukla commented on YARN-4413: --- Thanks for reporting this [~templedf]. Was a node refresh done after the file change ? If yes then I think, since this metric is updated during AddNodeTransition (which updates rejoined metrics) , there is no transition that takes care of this until the node tries to register/heartbeat (as it is absent from all RMNodeImpl lists). One way could be to do this check in {{refreshNodes}}. > Nodes in the includes list should not be listed as decommissioned in the UI > --- > > Key: YARN-4413 > URL: https://issues.apache.org/jira/browse/YARN-4413 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Daniel Templeton >Assignee: Daniel Templeton > > If I decommission a node and then move it from the excludes list back to the > includes list, but I don't restart the node, the node will still be listed by > the web UI as decomissioned until either the NM or RM is restarted. Ideally, > removing the node from the excludes list and putting it back into the > includes list should cause the node to be reported as shutdown instead. > CC [~kshukla] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart
[ https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036616#comment-15036616 ] Kuhu Shukla commented on YARN-4406: --- I agree. I was thinking about that too. During {{registerwithRM()}} we throw a YarnException while on the ResourceTrackerService side we just send NodeAction as SHUTDOWN. We could in fact update InactiveRMNode list with this node, so that it is consistent. Let me know what you think. I will put up a patch soon. > RM Web UI continues to show decommissioned nodes even after RM restart > -- > > Key: YARN-4406 > URL: https://issues.apache.org/jira/browse/YARN-4406 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Ray Chiang >Assignee: Kuhu Shukla >Priority: Minor > > If you start up a cluster, decommission a NodeManager, and restart the RM, > the decommissioned node list will still show a positive number (1 in the case > of 1 node) and if you click on the list, it will be empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4293) ResourceUtilization should be a part of yarn node CLI
[ https://issues.apache.org/jira/browse/YARN-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036630#comment-15036630 ] Wangda Tan commented on YARN-4293: -- Thanks [~sunilg]. [~kasha], I found the biggest change of this patch is moving ResourceUtilization class from server.api to api. Do you think if ResourceUtilization should be a part of user-facing API? > ResourceUtilization should be a part of yarn node CLI > - > > Key: YARN-4293 > URL: https://issues.apache.org/jira/browse/YARN-4293 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Sunil G > Attachments: 0001-YARN-4293.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4413) Nodes in the includes list should not be listed as decommissioned in the UI
[ https://issues.apache.org/jira/browse/YARN-4413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036645#comment-15036645 ] Daniel Templeton commented on YARN-4413: That's what I was thinking. > Nodes in the includes list should not be listed as decommissioned in the UI > --- > > Key: YARN-4413 > URL: https://issues.apache.org/jira/browse/YARN-4413 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Daniel Templeton >Assignee: Daniel Templeton > > If I decommission a node and then move it from the excludes list back to the > includes list, but I don't restart the node, the node will still be listed by > the web UI as decomissioned until either the NM or RM is restarted. Ideally, > removing the node from the excludes list and putting it back into the > includes list should cause the node to be reported as shutdown instead. > CC [~kshukla] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4293) ResourceUtilization should be a part of yarn node CLI
[ https://issues.apache.org/jira/browse/YARN-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036771#comment-15036771 ] Hadoop QA commented on YARN-4293: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 4s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 26s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 7s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 47s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 50s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 7m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 57s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 33s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 6s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 9m 6s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 21m 42s {color} | {color:red} root-jdk1.8.0_66 with JDK v1.8.0_66 generated 1 new issues (was 751, now 751). {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 9m 6s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 27s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 9m 27s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 31m 9s {color} | {color:red} root-jdk1.7.0_85 with JDK v1.7.0_85 generated 1 new issues (was 745, now 745). {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 9m 27s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 5s {color} | {color:red} Patch generated 8 new checkstyle issues in root (total was 254, now 261). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 46s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 49s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 8m 47s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 59s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 37s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 25s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 2m 1s {color} | {color:red} hadoop-yarn-common in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 25s {color} | {color:green} hadoop-yarn-server-common in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 53s {color} | {color:green}
[jira] [Commented] (YARN-4413) Nodes in the includes list should not be listed as decommissioned in the UI
[ https://issues.apache.org/jira/browse/YARN-4413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036644#comment-15036644 ] Daniel Templeton commented on YARN-4413: Yes. The refresh marks nodes newly added to the excludes list as decommissioned, but it doesn't do anything for nodes newly added to the includes list. > Nodes in the includes list should not be listed as decommissioned in the UI > --- > > Key: YARN-4413 > URL: https://issues.apache.org/jira/browse/YARN-4413 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Daniel Templeton >Assignee: Daniel Templeton > > If I decommission a node and then move it from the excludes list back to the > includes list, but I don't restart the node, the node will still be listed by > the web UI as decomissioned until either the NM or RM is restarted. Ideally, > removing the node from the excludes list and putting it back into the > includes list should cause the node to be reported as shutdown instead. > CC [~kshukla] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)
[ https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036708#comment-15036708 ] Jian He commented on YARN-3840: --- latest patch looks good to me, thanks [~varun_saxena], > Resource Manager web ui issue when sorting application by id (with > application having id > ) > > > Key: YARN-3840 > URL: https://issues.apache.org/jira/browse/YARN-3840 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 >Reporter: LINTE >Assignee: Varun Saxena > Fix For: 2.8.0, 2.7.3 > > Attachments: RMApps.png, RMApps_Sorted.png, YARN-3840-1.patch, > YARN-3840-2.patch, YARN-3840-3.patch, YARN-3840-4.patch, YARN-3840-5.patch, > YARN-3840-6.patch, YARN-3840.reopened.001.patch, yarn-3840-7.patch > > > On the WEBUI, the global main view page : > http://resourcemanager:8088/cluster/apps doesn't display applications over > . > With command line it works (# yarn application -list). > Regards, > Alexandre -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036821#comment-15036821 ] Hadoop QA commented on YARN-4225: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 4 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 58s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 0s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 22s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 29s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 4s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 57s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 45s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 49s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 8s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 57s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 5s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 5s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 18s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 18s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 18s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 29s {color} | {color:red} Patch generated 1 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 50, now 50). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 6s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 54s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 34s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common introduced 1 new FindBugs issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 51s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 9s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 27s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 4s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 64m 25s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 49m 30s {color} | {color:red} hadoop-yarn-client in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 26s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_85. {color} | | {color:green}+1{color} | {color:green} unit {color} |
[jira] [Updated] (YARN-2575) Consider creating separate ACLs for Reservation create/update/delete ops
[ https://issues.apache.org/jira/browse/YARN-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-2575: - Assignee: Sean Po (was: Subru Krishnan) > Consider creating separate ACLs for Reservation create/update/delete ops > > > Key: YARN-2575 > URL: https://issues.apache.org/jira/browse/YARN-2575 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Subru Krishnan >Assignee: Sean Po > > YARN-1051 introduces the ReservationSystem and in the current implementation > anyone who can submit applications can also submit reservations. This JIRA is > to evaluate creating separate ACLs for Reservation create/update/delete ops. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4411) ResourceManager IllegalArgumentException error
[ https://issues.apache.org/jira/browse/YARN-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037003#comment-15037003 ] yarntime commented on YARN-4411: Hi Naganarasimha G R, thank you very much. > ResourceManager IllegalArgumentException error > -- > > Key: YARN-4411 > URL: https://issues.apache.org/jira/browse/YARN-4411 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: yarntime >Assignee: yarntime > > in version 2.7.1, line 1914 may cause IllegalArgumentException in > RMAppAttemptImpl: > YarnApplicationAttemptState.valueOf(this.getState().toString()) > cause by this.getState() returns type RMAppAttemptState which may not be > converted to YarnApplicationAttemptState. > {noformat} > java.lang.IllegalArgumentException: No enum constant > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.LAUNCHED_UNMANAGED_SAVING > at java.lang.Enum.valueOf(Enum.java:236) > at > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.createApplicationAttemptReport(RMAppAttemptImpl.java:1870) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationAttemptReport(ClientRMService.java:355) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationAttemptReport(ApplicationClientProtocolPBServiceImpl.java:355) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:425) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4411) ResourceManager IllegalArgumentException error
[ https://issues.apache.org/jira/browse/YARN-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037008#comment-15037008 ] yarntime commented on YARN-4411: OK, thank you very much. > ResourceManager IllegalArgumentException error > -- > > Key: YARN-4411 > URL: https://issues.apache.org/jira/browse/YARN-4411 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: yarntime >Assignee: yarntime > > in version 2.7.1, line 1914 may cause IllegalArgumentException in > RMAppAttemptImpl: > YarnApplicationAttemptState.valueOf(this.getState().toString()) > cause by this.getState() returns type RMAppAttemptState which may not be > converted to YarnApplicationAttemptState. > {noformat} > java.lang.IllegalArgumentException: No enum constant > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.LAUNCHED_UNMANAGED_SAVING > at java.lang.Enum.valueOf(Enum.java:236) > at > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.createApplicationAttemptReport(RMAppAttemptImpl.java:1870) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationAttemptReport(ClientRMService.java:355) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationAttemptReport(ApplicationClientProtocolPBServiceImpl.java:355) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:425) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4398) Yarn recover functionality causes the cluster running slowly and the cluster usage rate is far below 100
[ https://issues.apache.org/jira/browse/YARN-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036968#comment-15036968 ] Hudson commented on YARN-4398: -- ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #658 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/658/]) YARN-4398. Remove unnecessary synchronization in RMStateStore. (jianhe: rev 6b9a5beb2b2f9589ef86670f2d763e8488ee5e90) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java > Yarn recover functionality causes the cluster running slowly and the cluster > usage rate is far below 100 > > > Key: YARN-4398 > URL: https://issues.apache.org/jira/browse/YARN-4398 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: NING DING >Assignee: NING DING > Fix For: 2.7.3 > > Attachments: YARN-4398.2.patch, YARN-4398.3.patch, YARN-4398.4.patch > > > In my hadoop cluster, the resourceManager recover functionality is enabled > with FileSystemRMStateStore. > I found this cause the yarn cluster running slowly and cluster usage rate is > just 50 even there are many pending Apps. > The scenario is below. > In thread A, the RMAppImpl$RMAppNewlySavingTransition is calling > storeNewApplication method defined in RMStateStore. This storeNewApplication > method is synchronized. > {code:title=RMAppImpl.java|borderStyle=solid} > private static final class RMAppNewlySavingTransition extends > RMAppTransition { > @Override > public void transition(RMAppImpl app, RMAppEvent event) { > // If recovery is enabled then store the application information in a > // non-blocking call so make sure that RM has stored the information > // needed to restart the AM after RM restart without further client > // communication > LOG.info("Storing application with id " + app.applicationId); > app.rmContext.getStateStore().storeNewApplication(app); > } > } > {code} > {code:title=RMStateStore.java|borderStyle=solid} > public synchronized void storeNewApplication(RMApp app) { > ApplicationSubmissionContext context = app > > .getApplicationSubmissionContext(); > assert context instanceof ApplicationSubmissionContextPBImpl; > ApplicationStateData appState = > ApplicationStateData.newInstance( > app.getSubmitTime(), app.getStartTime(), context, app.getUser()); > dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState)); > } > {code} > In thread B, the FileSystemRMStateStore is calling > storeApplicationStateInternal method. It's also synchronized. > This storeApplicationStateInternal method saves an ApplicationStateData into > HDFS and it normally costs 90~300 milliseconds in my hadoop cluster. > {code:title=FileSystemRMStateStore.java|borderStyle=solid} > public synchronized void storeApplicationStateInternal(ApplicationId appId, > ApplicationStateData appStateDataPB) throws Exception { > Path appDirPath = getAppDir(rmAppRoot, appId); > mkdirsWithRetries(appDirPath); > Path nodeCreatePath = getNodePath(appDirPath, appId.toString()); > LOG.info("Storing info for app: " + appId + " at: " + nodeCreatePath); > byte[] appStateData = appStateDataPB.getProto().toByteArray(); > try { > // currently throw all exceptions. May need to respond differently for > HA > // based on whether we have lost the right to write to FS > writeFileWithRetries(nodeCreatePath, appStateData, true); > } catch (Exception e) { > LOG.info("Error storing info for app: " + appId, e); > throw e; > } > } > {code} > Think thread B firstly comes into > FileSystemRMStateStore.storeApplicationStateInternal method, then thread A > will be blocked for a while because of synchronization. In ResourceManager > there is only one RMStateStore instance. In my cluster it's > FileSystemRMStateStore type. > Debug the RMAppNewlySavingTransition.transition method, the thread stack > shows it's called form AsyncDispatcher.dispatch method. This method code is > as below. > {code:title=AsyncDispatcher.java|borderStyle=solid} > protected void dispatch(Event event) { > //all events go thru this loop > if (LOG.isDebugEnabled()) { > LOG.debug("Dispatching the event " + event.getClass().getName() + "." > + event.toString()); > } > Class type = event.getType().getDeclaringClass(); >
[jira] [Commented] (YARN-4309) Add debug information to application logs when a container fails
[ https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036979#comment-15036979 ] Sidharta Seethana commented on YARN-4309: - hi [~vvasudev], I am using the find command that you have in the patch against broken symlinks - it is not clear to me how broken symlink info is captured (please see below). Could you please clarify? {code} q (19:50:35) ~/symlink-test$ ls -l total 0 q (19:50:47) ~/symlink-test$ ln -s world hello q (19:51:03) ~/symlink-test$ find -L . -maxdepth 5 -type l -ls 21492794320 lrwxrwxrwx 1 sseethana sseethana5 Dec 2 19:51 ./hello -> world q (19:51:15) ~/symlink-test$ echo $? 0 q (19:51:52) ~/symlink-test$ uname -a Linux q 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux {code} > Add debug information to application logs when a container fails > > > Key: YARN-4309 > URL: https://issues.apache.org/jira/browse/YARN-4309 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: YARN-4309.001.patch, YARN-4309.002.patch, > YARN-4309.003.patch, YARN-4309.004.patch, YARN-4309.005.patch > > > Sometimes when a container fails, it can be pretty hard to figure out why it > failed. > My proposal is that if a container fails, we collect information about the > container local dir and dump it into the container log dir. Ideally, I'd like > to tar up the directory entirely, but I'm not sure of the security and space > implications of such a approach. At the very least, we can list all the files > in the container local dir, and dump the contents of launch_container.sh(into > the container log dir). > When log aggregation occurs, all this information will automatically get > collected and make debugging such failures much easier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037185#comment-15037185 ] Xianyin Xin commented on YARN-4403: --- And will provide new patch of YARN-4177 once this is in. > (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating > period > > > Key: YARN-4403 > URL: https://issues.apache.org/jira/browse/YARN-4403 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4403.patch > > > Currently, (AM/NM/Container)LivelinessMonitor use current system time to > calculate a duration of expire which could be broken by settimeofday. We > should use Time.monotonicNow() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4389) "yarn.am.blacklisting.enabled" and "yarn.am.blacklisting.disable-failure-threshold" should be app specific rather than a setting for whole YARN cluster
[ https://issues.apache.org/jira/browse/YARN-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037217#comment-15037217 ] Sunil G commented on YARN-4389: --- Test case failures are known ones. And has separate tickets to handle the same. > "yarn.am.blacklisting.enabled" and > "yarn.am.blacklisting.disable-failure-threshold" should be app specific > rather than a setting for whole YARN cluster > --- > > Key: YARN-4389 > URL: https://issues.apache.org/jira/browse/YARN-4389 > Project: Hadoop YARN > Issue Type: Bug > Components: applications >Reporter: Junping Du >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-4389.patch, 0002-YARN-4389.patch > > > "yarn.am.blacklisting.enabled" and > "yarn.am.blacklisting.disable-failure-threshold" should be application > specific rather than a setting in cluster level, or we should't maintain > amBlacklistingEnabled and blacklistDisableThreshold in per rmApp level. We > should allow each am to override this config, i.e. via submissionContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2885) Create AMRMProxy request interceptor for distributed scheduling decisions for queueable containers
[ https://issues.apache.org/jira/browse/YARN-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037424#comment-15037424 ] Konstantinos Karanasos commented on YARN-2885: -- Adding some more on point #2 (I agree with the rest)... First I agree that the AM should not know whether a container came from the RM or from a distributed scheduler. Regarding the AllocateRequest, I don't think it is currently used in the code, so it can be removed. However, it is used in the RegisterAMRequest to make sure that both the NM and the RM have distributed scheduling enabled when setting some of the parameters related to the dist scheduling. If we assume that all nodes have dist scheduling enabled as long as it is enabled by the RM, then keeping the isDistributedScheduling boolean in the RegisterRequest is not needed either. After all it is only for setting a few parameters (even if we want to disable dist scheduling in a particular NM, that NM can simply discard these parameters). That said, I am not sure if it is required to create a wrapper at this point for the AM protocol. > Create AMRMProxy request interceptor for distributed scheduling decisions for > queueable containers > -- > > Key: YARN-2885 > URL: https://issues.apache.org/jira/browse/YARN-2885 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Arun Suresh > Attachments: YARN-2885-yarn-2877.001.patch > > > We propose to add a Local ResourceManager (LocalRM) to the NM in order to > support distributed scheduling decisions. > Architecturally we leverage the RMProxy, introduced in YARN-2884. > The LocalRM makes distributed decisions for queuable containers requests. > Guaranteed-start requests are still handled by the central RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4361) Total resource count mistake:NodeRemovedSchedulerEvent in ReconnectNodeTransition will reduce the newNode.getTotalCapability() in Multi-thread model
[ https://issues.apache.org/jira/browse/YARN-4361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037226#comment-15037226 ] jialei weng commented on YARN-4361: --- yes, I check the patch, it can also the issue. Thanks. > Total resource count mistake:NodeRemovedSchedulerEvent in > ReconnectNodeTransition will reduce the newNode.getTotalCapability() in > Multi-thread model > > > Key: YARN-4361 > URL: https://issues.apache.org/jira/browse/YARN-4361 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.2 >Reporter: jialei weng > Labels: patch > Attachments: YARN-4361v1.patch > > > Total resource count mistake: > NodeRemovedSchedulerEvent in ReconnectNodeTransition will reduce the > newNode.getTotalCapability() in Multi-thread model. Since the RMNode and > scheduler in different queue. So it cannot guarantee the remove-update-add > operation in sequence. Sometimes the total resource will reduce the > newNode.getTotalCapability() when handling NodeRemovedSchedulerEvent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037284#comment-15037284 ] Xianyin Xin commented on YARN-4403: --- Thanks, [~sunilg]. > (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating > period > > > Key: YARN-4403 > URL: https://issues.apache.org/jira/browse/YARN-4403 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4403.patch > > > Currently, (AM/NM/Container)LivelinessMonitor use current system time to > calculate a duration of expire which could be broken by settimeofday. We > should use Time.monotonicNow() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4398) Yarn recover functionality causes the cluster running slowly and the cluster usage rate is far below 100
[ https://issues.apache.org/jira/browse/YARN-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037381#comment-15037381 ] NING DING commented on YARN-4398: - [~jianhe], thank you. > Yarn recover functionality causes the cluster running slowly and the cluster > usage rate is far below 100 > > > Key: YARN-4398 > URL: https://issues.apache.org/jira/browse/YARN-4398 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: NING DING >Assignee: NING DING > Fix For: 2.7.3 > > Attachments: YARN-4398.2.patch, YARN-4398.3.patch, YARN-4398.4.patch > > > In my hadoop cluster, the resourceManager recover functionality is enabled > with FileSystemRMStateStore. > I found this cause the yarn cluster running slowly and cluster usage rate is > just 50 even there are many pending Apps. > The scenario is below. > In thread A, the RMAppImpl$RMAppNewlySavingTransition is calling > storeNewApplication method defined in RMStateStore. This storeNewApplication > method is synchronized. > {code:title=RMAppImpl.java|borderStyle=solid} > private static final class RMAppNewlySavingTransition extends > RMAppTransition { > @Override > public void transition(RMAppImpl app, RMAppEvent event) { > // If recovery is enabled then store the application information in a > // non-blocking call so make sure that RM has stored the information > // needed to restart the AM after RM restart without further client > // communication > LOG.info("Storing application with id " + app.applicationId); > app.rmContext.getStateStore().storeNewApplication(app); > } > } > {code} > {code:title=RMStateStore.java|borderStyle=solid} > public synchronized void storeNewApplication(RMApp app) { > ApplicationSubmissionContext context = app > > .getApplicationSubmissionContext(); > assert context instanceof ApplicationSubmissionContextPBImpl; > ApplicationStateData appState = > ApplicationStateData.newInstance( > app.getSubmitTime(), app.getStartTime(), context, app.getUser()); > dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState)); > } > {code} > In thread B, the FileSystemRMStateStore is calling > storeApplicationStateInternal method. It's also synchronized. > This storeApplicationStateInternal method saves an ApplicationStateData into > HDFS and it normally costs 90~300 milliseconds in my hadoop cluster. > {code:title=FileSystemRMStateStore.java|borderStyle=solid} > public synchronized void storeApplicationStateInternal(ApplicationId appId, > ApplicationStateData appStateDataPB) throws Exception { > Path appDirPath = getAppDir(rmAppRoot, appId); > mkdirsWithRetries(appDirPath); > Path nodeCreatePath = getNodePath(appDirPath, appId.toString()); > LOG.info("Storing info for app: " + appId + " at: " + nodeCreatePath); > byte[] appStateData = appStateDataPB.getProto().toByteArray(); > try { > // currently throw all exceptions. May need to respond differently for > HA > // based on whether we have lost the right to write to FS > writeFileWithRetries(nodeCreatePath, appStateData, true); > } catch (Exception e) { > LOG.info("Error storing info for app: " + appId, e); > throw e; > } > } > {code} > Think thread B firstly comes into > FileSystemRMStateStore.storeApplicationStateInternal method, then thread A > will be blocked for a while because of synchronization. In ResourceManager > there is only one RMStateStore instance. In my cluster it's > FileSystemRMStateStore type. > Debug the RMAppNewlySavingTransition.transition method, the thread stack > shows it's called form AsyncDispatcher.dispatch method. This method code is > as below. > {code:title=AsyncDispatcher.java|borderStyle=solid} > protected void dispatch(Event event) { > //all events go thru this loop > if (LOG.isDebugEnabled()) { > LOG.debug("Dispatching the event " + event.getClass().getName() + "." > + event.toString()); > } > Class type = event.getType().getDeclaringClass(); > try{ > EventHandler handler = eventDispatchers.get(type); > if(handler != null) { > handler.handle(event); > } else { > throw new Exception("No handler for registered for " + type); > } > } catch (Throwable t) { > //TODO Maybe log the state of the queue > LOG.fatal("Error in dispatcher thread", t); > // If serviceStop is called, we should exit this thread gracefully. > if (exitOnDispatchException > && (ShutdownHookManager.get().isShutdownInProgress()) == false
[jira] [Commented] (YARN-2885) Create AMRMProxy request interceptor for distributed scheduling decisions for queueable containers
[ https://issues.apache.org/jira/browse/YARN-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037404#comment-15037404 ] Konstantinos Karanasos commented on YARN-2885: -- Thank you for the patch, [~asuresh]. Adding some more comments to this first version: # Given that the list of nodes to be used for distributed scheduling ("top-k nodes") is ordered, we need to send the whole list at each AllocateResponse (it will become complicated to do so by sending just the delta of the list in the form of new/removed nodes). # Given the above point, we will not need to have a node list in the RegisterApplicationMasterResponse. # I suggest to remove the two parameters for setting limits to the number of QUEUEABLE containers from this JIRA, since YARN-2889 targets this functionality. # I propose to remove the support for locality from this first version of the JIRA. Getting it right requires more work (given that each LocalRM only sees a subset of the cluster's nodes), and should probably be the objective of a separate sub-JIRA. # When creating the Interceptor chain in the AMRMProxyService, make sure the DistSchedulerRequestInterceptor is always placed in the beginning of the chain. # We could make DistSchedulerParameters a subclass to the DistSchedulerRequestInterceptor rather than a separate class. > Create AMRMProxy request interceptor for distributed scheduling decisions for > queueable containers > -- > > Key: YARN-2885 > URL: https://issues.apache.org/jira/browse/YARN-2885 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Arun Suresh > Attachments: YARN-2885-yarn-2877.001.patch > > > We propose to add a Local ResourceManager (LocalRM) to the NM in order to > support distributed scheduling decisions. > Architecturally we leverage the RMProxy, introduced in YARN-2884. > The LocalRM makes distributed decisions for queuable containers requests. > Guaranteed-start requests are still handled by the central RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4309) Add debug information to application logs when a container fails
[ https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037418#comment-15037418 ] Varun Vasudev commented on YARN-4309: - bq. do we need to worry about -L following links outside of the current directory? find will follow the links outside the current directory upto the maxdepth. This is useful to because we symlink to resources outside the work dir from the container work dir(like the mapreduce jar, the job conf, etc). > Add debug information to application logs when a container fails > > > Key: YARN-4309 > URL: https://issues.apache.org/jira/browse/YARN-4309 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: YARN-4309.001.patch, YARN-4309.002.patch, > YARN-4309.003.patch, YARN-4309.004.patch, YARN-4309.005.patch > > > Sometimes when a container fails, it can be pretty hard to figure out why it > failed. > My proposal is that if a container fails, we collect information about the > container local dir and dump it into the container log dir. Ideally, I'd like > to tar up the directory entirely, but I'm not sure of the security and space > implications of such a approach. At the very least, we can list all the files > in the container local dir, and dump the contents of launch_container.sh(into > the container log dir). > When log aggregation occurs, all this information will automatically get > collected and make debugging such failures much easier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4413) Nodes in the includes list should not be listed as decommissioned in the UI
[ https://issues.apache.org/jira/browse/YARN-4413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037232#comment-15037232 ] Sunil G commented on YARN-4413: --- Hi [~templedf] Thank you for raising this ticket. As you mentioned, I could see that a node is moved from exclude to include list and performed {{-refreshNodes}}. And this caused some counts still to be displayed in UI. But a restart will help here to clear the metrics. One point to note here. The way I see it, I do not think we can remove or reset this decommissioned count directly by only seeing the include list. There can be cases where we would have done {{graceful decommissioning}}, and this can add few nodes to decommissioned list which is not one-to-one mapped with exclude list. So I feel we could look both lists upon refresh and remove/add nodes based on the entries in both files and from memory. > Nodes in the includes list should not be listed as decommissioned in the UI > --- > > Key: YARN-4413 > URL: https://issues.apache.org/jira/browse/YARN-4413 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Daniel Templeton >Assignee: Daniel Templeton > > If I decommission a node and then move it from the excludes list back to the > includes list, but I don't restart the node, the node will still be listed by > the web UI as decomissioned until either the NM or RM is restarted. Ideally, > removing the node from the excludes list and putting it back into the > includes list should cause the node to be reported as shutdown instead. > CC [~kshukla] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4392) ApplicationCreatedEvent event time resets after RM restart/failover
[ https://issues.apache.org/jira/browse/YARN-4392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037280#comment-15037280 ] Naganarasimha G R commented on YARN-4392: - [~xgong], bq, Will it cause any issue if the APP_CREATED event is missing ? If that only cause the missing related information in ATS webui/webservice, I am OK with not re-sending the ATS events on recovery. IMO even if it causes any issue we need to correct it, as there is another scenario when RM is started much before the ATS server., then there is possibility that ATS will miss the App start events but might receive the App finish events. > ApplicationCreatedEvent event time resets after RM restart/failover > --- > > Key: YARN-4392 > URL: https://issues.apache.org/jira/browse/YARN-4392 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Xuan Gong >Assignee: Naganarasimha G R >Priority: Critical > Attachments: YARN-4392-2015-11-24.patch, YARN-4392.1.patch, > YARN-4392.2.patch > > > {code}2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - > Finished time 1437453994768 is ahead of started time 1440308399674 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437454008244 is ahead of started time 1440308399676 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444305171 is ahead of started time 1440308399653 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444293115 is ahead of started time 1440308399647 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444379645 is ahead of started time 1440308399656 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444361234 is ahead of started time 1440308399655 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444342029 is ahead of started time 1440308399654 > 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444323447 is ahead of started time 1440308399654 > 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished > time 143730006 is ahead of started time 1440308399660 > 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished > time 143715698 is ahead of started time 1440308399659 > 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished > time 143719060 is ahead of started time 1440308399658 > 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished > time 1437444393931 is ahead of started time 1440308399657 > {code} . > From ATS logs, we would see a large amount of 'stale alerts' messages > periodically -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4397) if this addAll() function`s params is fault? @NodeListManager#getUnusableNodes()
[ https://issues.apache.org/jira/browse/YARN-4397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Yuan resolved YARN-4397. - Resolution: Not A Problem > if this addAll() function`s params is fault? > @NodeListManager#getUnusableNodes() > > > Key: YARN-4397 > URL: https://issues.apache.org/jira/browse/YARN-4397 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.6.0 >Reporter: Feng Yuan > Fix For: 2.8.0 > > > code in NodeListManager#144L: > /** >* Provides the currently unusable nodes. Copies it into provided > collection. >* @param unUsableNodes >* Collection to which the unusable nodes are added >* @return number of unusable nodes added >*/ > public int getUnusableNodes(Collection unUsableNodes) { > unUsableNodes.addAll(unusableRMNodesConcurrentSet); > return unusableRMNodesConcurrentSet.size(); > } > unUsableNodes and unusableRMNodesConcurrentSet's sequence is wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2885) Create AMRMProxy request interceptor for distributed scheduling decisions for queueable containers
[ https://issues.apache.org/jira/browse/YARN-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037386#comment-15037386 ] Arun Suresh commented on YARN-2885: --- Thank you for the review [~leftnoteasy] !! let me try to clarify your concerns.. [~kkaranasos], correct me if im wrong.. bq. I'm not sure if it is possibly that queueable resource requests could be also sent to RM with this implementation. What we were aiming for is to not send any Queueable resource reqs to the RM. The Local RM (the core functionality of which is now encapsulated in the DistSchedulerRequestInterceptor class). As [~sriramrao] had mentioned, we do plan to enforce policies around how the Distributed Scheduling is actually done on the NM. In the first cut (this JIRA), these policies, which WILL be pushed down from the RM, would be stuff like *Maximum resource capability of containers allocated* or *set of nodes on which to target Queuable containers*. These would be computed at the RM and sent back as part of the AllocateResponse and the RegisterResponse. The plan is to have that actual computation happen in in the Coordinator running in the RM which we plan to tackle as part of YARN-4412. bq. I'm not quite sure why isDistributedSchedulingEnabled is required for AM's AllocateRequest and RegisterRequest I totally agree that the AM should not be bothered with this.. But if you notice, It is actually not set by the AM, it set by the DistSchedulerReqeustInterceptor when it proxies the AM calls. Also, to further your point, I am not really happy with putting stuff in the Allocation/Register response, that can be seen by the AM which is only relevant to the DistScheduler framework. Again, I’m not really happy with this either… I was thinking of alternatively the following : # creating a Wrapper Protocol (Distributed Scheduling AM Protocol) over the AM protocol, which basically Wraps each request/response with additional info which will be seen only by the DistScheduler running on the NM # Have an Distributed Scheduler AM Service running on the RM if DS is enabled. This will implement the new protocol (it will delegate all the AMProtocol stuff to the AMService and will handle DistScheduler specific stuff) # Instead of having the DSReqInterceptor at the begining of the AMRMProxy pipeline, add it to the end (or replace the DefaultReqInterceptor) and have it talk the new DistSchedulerAMProtocol (which wraps the Allocate/Register requests with the extra DS stuff) What do you think ? will take a crack at this in the next patch. Regarding #3, I just wanted a conf to specify that Dist Scheuling has been 'turned on’.. which if set to false, will revert to default behavior of sending even the Queuable reqs to the RM. I think most of #4 will be taken care of if we create a Wrapper protocol as I mentioned earlier.. .. w.r.t getContainerIdStart, technically, the containerId for each app starts from the RM epoch.. which is what I wanted to pass on to the NM.. .. agreed, will change the name of getNodeList .. w.r.t containerTokenExpiryInterval.. so this gets sent from the RM and signifies the token expiry for allocated queue able containers.. don’t think it might vary per NM .. w.r.t getMin/MaxAllocatableCapabilty.. we wanted this to be something that is specific to the Queueable containers and with is policy driven (or decided by the Dist coordinator).. I agree, we can change its name. Regarding #5, Agreed, will make the changes to public APIs in a separate JIRA. Hope this makes sense ? > Create AMRMProxy request interceptor for distributed scheduling decisions for > queueable containers > -- > > Key: YARN-2885 > URL: https://issues.apache.org/jira/browse/YARN-2885 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Arun Suresh > Attachments: YARN-2885-yarn-2877.001.patch > > > We propose to add a Local ResourceManager (LocalRM) to the NM in order to > support distributed scheduling decisions. > Architecturally we leverage the RMProxy, introduced in YARN-2884. > The LocalRM makes distributed decisions for queuable containers requests. > Guaranteed-start requests are still handled by the central RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037180#comment-15037180 ] Xianyin Xin commented on YARN-4403: --- hi [~djp], this is a good suggestion, and YARN-4177 provides some discussion on this, so link it. > (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating > period > > > Key: YARN-4403 > URL: https://issues.apache.org/jira/browse/YARN-4403 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4403.patch > > > Currently, (AM/NM/Container)LivelinessMonitor use current system time to > calculate a duration of expire which could be broken by settimeofday. We > should use Time.monotonicNow() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4002) make ResourceTrackerService.nodeHeartbeat more concurrent
[ https://issues.apache.org/jira/browse/YARN-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brook Zhou updated YARN-4002: - Attachment: YARN-4002-v0.patch Added a patch for this. > make ResourceTrackerService.nodeHeartbeat more concurrent > - > > Key: YARN-4002 > URL: https://issues.apache.org/jira/browse/YARN-4002 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Hong Zhiguo >Assignee: Hong Zhiguo >Priority: Critical > Attachments: YARN-4002-v0.patch > > > We have multiple RPC threads to handle NodeHeartbeatRequest from NMs. By > design the method ResourceTrackerService.nodeHeartbeat should be concurrent > enough to scale for large clusters. > But we have a "BIG" lock in NodesListManager.isValidNode which I think it's > unnecessary. > First, the fields "includes" and "excludes" of HostsFileReader are only > updated on "refresh nodes". All RPC threads handling node heartbeats are > only readers. So RWLock could be used to alow concurrent access by RPC > threads. > Second, since he fields "includes" and "excludes" of HostsFileReader are > always updated by "reference assignment", which is atomic in Java, the reader > side lock could just be skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4405) Support node label store in non-appendable file system
[ https://issues.apache.org/jira/browse/YARN-4405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037246#comment-15037246 ] Hadoop QA commented on YARN-4405: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 1s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 1s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 15s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 29s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 40s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 40s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 2s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 29s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 53s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 58s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 58s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 14s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 14s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 29s {color} | {color:red} Patch generated 7 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 264, now 267). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 41s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 41s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 14 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 30s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 55s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 24s {color} | {color:red} hadoop-yarn-api in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 59s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 59m 59s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 26s {color} | {color:red} hadoop-yarn-api in the patch failed with JDK v1.7.0_85. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 14s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_85. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 60m 50s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_85. {color} | |
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037335#comment-15037335 ] Konstantinos Karanasos commented on YARN-2877: -- Thank you for the detailed comments, [~leftnoteasy]. Regarding #1: - Indeed the AM-LocalRM communication should be much more frequent than the LocalRM-RM (and subsequently AM-RM) communication, in order to achieve mili-second latency allocations. We are planning to address this by having smaller heartbeat intervals in the AM-LocalRM communication when compared to the LocalRM-RM. For instance, the AM-LocalRM heartbeat interval can be set to 50ms, while the LocalRM-RM interval to 200ms (in other words, we will only propagate to the RM only one in every four heartbeats). We will soon create a sub-JIRA for this. - Each NM will periodically estimate its expected queue wait time (YARN-2886). This can simply be based on the number of tasks currently in its queue, or (even better) based on the estimated execution times of those tasks (in case they are available). Then, this expected queue wait time is pushed through the NM-RM heartbeats to the ClusterMonitor (YARN-4412) that is running as a service in the RM. The ClusterMonitor gathers this information from all nodes, periodically computes the least loaded nodes (i.e., with the smallest queue wait times), and adds that list to the heartbeat response, so that all nodes (and in turn LocalRMs) get the list. This list is then used during scheduling in the LocalRM. Note that simpler solutions (such as the power of two choices used in Sparrow) could be employed, but our experiments have shown that the above "top-k node list" leads to considerably better placement (and thus load balancing), especially when task durations are heterogeneous. Regarding #2: This is a valid concern. The best way to minimize preemption is through the "top-k node list" technique described above. As the LocalRM will be placing the QUEUEABLE containers to the least loaded nodes, preemption will be minimized. More techniques can be used to further mitigate the problem. For instance, we can "promote" a QUEUEABLE container to a GUARANTEED one in case it has been preempted more than k times. Moreover, we can dynamically set limits to the number of QUEUEABLE containers accepted by a node in case of excessive load due to GUARANTEED containers. That said, as you also mention, QUEUEABLE containers are more suitable for short-running tasks, where the probability of a container being preempted is smaller. > Extend YARN to support distributed scheduling > - > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037208#comment-15037208 ] Sunil G commented on YARN-4403: --- Thanks [~xinxianyin] for updating this, I missed this somehow. It seems like you are handling Monotonic Clock in various YARN code which uses clock. So it can be made as general ticket in YARN. Meantime I will raise an MR ticket to handle this for MapReduce. MAPREDUCE-6562 is linked for the same. > (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating > period > > > Key: YARN-4403 > URL: https://issues.apache.org/jira/browse/YARN-4403 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4403.patch > > > Currently, (AM/NM/Container)LivelinessMonitor use current system time to > calculate a duration of expire which could be broken by settimeofday. We > should use Time.monotonicNow() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4361) Total resource count mistake:NodeRemovedSchedulerEvent in ReconnectNodeTransition will reduce the newNode.getTotalCapability() in Multi-thread model
[ https://issues.apache.org/jira/browse/YARN-4361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jialei weng resolved YARN-4361. --- Resolution: Duplicate > Total resource count mistake:NodeRemovedSchedulerEvent in > ReconnectNodeTransition will reduce the newNode.getTotalCapability() in > Multi-thread model > > > Key: YARN-4361 > URL: https://issues.apache.org/jira/browse/YARN-4361 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.2 >Reporter: jialei weng > Labels: patch > Attachments: YARN-4361v1.patch > > > Total resource count mistake: > NodeRemovedSchedulerEvent in ReconnectNodeTransition will reduce the > newNode.getTotalCapability() in Multi-thread model. Since the RMNode and > scheduler in different queue. So it cannot guarantee the remove-update-add > operation in sequence. Sometimes the total resource will reduce the > newNode.getTotalCapability() when handling NodeRemovedSchedulerEvent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4340) Add "list" API to reservation system
[ https://issues.apache.org/jira/browse/YARN-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036873#comment-15036873 ] Subru Krishnan commented on YARN-4340: -- Thinking more about it, I am not sure we should have user in the interface. We should automatically pick up user from context. If we want to allow user, then it should be an Admin API and not a Client API IMHO. > Add "list" API to reservation system > > > Key: YARN-4340 > URL: https://issues.apache.org/jira/browse/YARN-4340 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Carlo Curino >Assignee: Sean Po > Attachments: YARN-4340.v1.patch, YARN-4340.v2.patch, > YARN-4340.v3.patch, YARN-4340.v4.patch > > > This JIRA tracks changes to the APIs of the reservation system, and enables > querying the reservation system on which reservation exists by "time-range, > reservation-id, username". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling
[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036916#comment-15036916 ] Wangda Tan commented on YARN-2877: -- Thanks [~kkaranasos], [~asuresh], I just caught up with latest design doc, my 2 cents: There're two major purpose of distributed RM, 1) get better allocation latency 2) leverage idle resources. #1 will be achieved when - AM -> LocalRM communication can be done within a single RPC call. (Doesn't do heartbeat like normal AM-RM allocation), otherwise it will be hard to achieve milli-seconds level latency. - LocalRM has enough information to allocate resource on a NM which could be directly used without waiting. I think stochastic + caching some information of other LocalRM could solve the problem. #2 can be achieved, but since the distributed RM solution doesn't have a global picture of resources and guaranteed containers can always preempt queueable containers. This could lead to excessive queueable containers preempted. If we can decide where to allocate queueable container from RM, RM could avoid a lots of such preemptions. (Instead of allocating on a node has lots of queueable containers, allocate on node with "real" idle resources). To me, this becomes a bigger issue if application wants to use opportunistic resources to run normal containers (such as a 10 min MR task). How to guarantee RM doesn't allocate more resources for a long time to LocalRM is a problem. IMO distributed RM is more suitable for short-lifed (few seconds) and low latency tasks. > Extend YARN to support distributed scheduling > - > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Sriram Rao >Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4408) NodeManager still reports negative running containers
[ https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036857#comment-15036857 ] Robert Kanter commented on YARN-4408: - Test failure looks unrelated. > NodeManager still reports negative running containers > - > > Key: YARN-4408 > URL: https://issues.apache.org/jira/browse/YARN-4408 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > Attachments: YARN-4408.001.patch, YARN-4408.002.patch, > YARN-4408.003.patch > > > YARN-1697 fixed a problem where the NodeManager metrics could report a > negative number of running containers. However, it missed a rare case where > this can still happen. > YARN-1697 added a flag to indicate if the container was actually launched > ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which > is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to > {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge > if we actually ran the container and incremented the gauge . However, this > flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to > {{DONE}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036864#comment-15036864 ] Wangda Tan commented on YARN-4225: -- Thanks [~eepayne], bq. The use case is a newer client is querying an older server... I'm wondering if this is a valid use case: IMHO, rolling upgrade should be always server-first. If we plan support newer client talks to older server, we may experience many issues AND we need to add this to Hadoop's code compatibility policy. > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-4225.001.patch, YARN-4225.002.patch, > YARN-4225.003.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4409) Fix javadoc and checkstyle issues in timelineservice code
[ https://issues.apache.org/jira/browse/YARN-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-4409: --- Description: There are a large number of javadoc and checkstyle issues currently open in timelineservice code. We need to fix them before we merge it into trunk. Refer to https://issues.apache.org/jira/browse/YARN-3862?focusedCommentId=15035267=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15035267 We still have 94 open checkstyle issues and javadocs failing for Java 8. was:There are a large number of javadoc and checkstyle issues currently open in timelineservice code. We need to fix them before we merge it into trunk. > Fix javadoc and checkstyle issues in timelineservice code > - > > Key: YARN-4409 > URL: https://issues.apache.org/jira/browse/YARN-4409 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Varun Saxena >Assignee: Varun Saxena > > There are a large number of javadoc and checkstyle issues currently open in > timelineservice code. We need to fix them before we merge it into trunk. > Refer to > https://issues.apache.org/jira/browse/YARN-3862?focusedCommentId=15035267=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15035267 > We still have 94 open checkstyle issues and javadocs failing for Java 8. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4309) Add debug information to application logs when a container fails
[ https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035588#comment-15035588 ] Hadoop QA commented on YARN-4309: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 1s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 16s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 30s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 33s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 44s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 29s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 46s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 0s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 0s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 21s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 21s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 30s {color} | {color:red} Patch generated 3 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 356, now 356). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 33s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 39s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 30s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 50s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 59s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 56s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 26s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_85. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 16s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_85. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 22s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_85. {color} | |
[jira] [Resolved] (YARN-4410) hadoop
[ https://issues.apache.org/jira/browse/YARN-4410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S resolved YARN-4410. - Resolution: Invalid It looks to be created by mistake. Closing as invalid. > hadoop > -- > > Key: YARN-4410 > URL: https://issues.apache.org/jira/browse/YARN-4410 > Project: Hadoop YARN > Issue Type: Bug >Reporter: qeko > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4411) ResourceManager IllegalArgumentException error
yarntime created YARN-4411: -- Summary: ResourceManager IllegalArgumentException error Key: YARN-4411 URL: https://issues.apache.org/jira/browse/YARN-4411 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.1 Reporter: yarntime in version 2.7.1, line 1914 may cause IllegalArgumentException in RMAppAttemptImpl: YarnApplicationAttemptState.valueOf(this.getState().toString()) cause by this.getState() returns type RMAppAttemptState which may not be converted to YarnApplicationAttemptState. java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.LAUNCHED_UNMANAGED_SAVING at java.lang.Enum.valueOf(Enum.java:236) at org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.createApplicationAttemptReport(RMAppAttemptImpl.java:1870) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationAttemptReport(ClientRMService.java:355) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationAttemptReport(ApplicationClientProtocolPBServiceImpl.java:355) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:425) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) -- This message was sent by Atlassian JIRA (v6.3.4#6332)