[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period

2015-12-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035910#comment-15035910
 ] 

Junping Du commented on YARN-4403:
--

Agree. LivelinessMonitor is more critical as it affects all YARN 
daemons/containers lifecycle, so I prefer we get in this first. 
Later, we can file two separate JIRAs: one for YARN and the other for MapReduce 
to address other places. I am sure there are many places to change as all 
timeout could be affected and we should be carefully. Hadoop/HDFS projects 
should already adopt this early.

> (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating 
> period
> 
>
> Key: YARN-4403
> URL: https://issues.apache.org/jira/browse/YARN-4403
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4403.patch
>
>
> Currently, (AM/NM/Container)LivelinessMonitor use current system time to 
> calculate a duration of expire which could be broken by settimeofday. We 
> should use Time.monotonicNow() instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4408) NodeManager still reports negative running containers

2015-12-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035886#comment-15035886
 ] 

Junping Du commented on YARN-4408:
--

[~rkanter], do we see this happen in a real deployment? I don't quite 
understand how it happens for container not get launched but 
EXITED_WITH_SUCCESS. It sounds only theoretically possible for containers in 
life cycle: Localized -> Killing -> EXITED_WITH_SUCCESS as the only place that 
send CONTAINER_EXITED_WITH_SUCCESS event is ContainerLaunch (or 
RecoveredContainerLaunch). Isn't it? Do I miss something here?

> NodeManager still reports negative running containers
> -
>
> Key: YARN-4408
> URL: https://issues.apache.org/jira/browse/YARN-4408
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: YARN-4408.001.patch
>
>
> YARN-1697 fixed a problem where the NodeManager metrics could report a 
> negative number of running containers.  However, it missed a rare case where 
> this can still happen.
> YARN-1697 added a flag to indicate if the container was actually launched 
> ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which 
> is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to 
> {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge 
> if we actually ran the container and incremented the gauge .  However, this 
> flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to 
> {{DONE}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4304) AM max resource configuration per partition to be displayed/updated correctly in UI and in various partition related metrics

2015-12-02 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-4304:
--
Attachment: REST_and_UI.zip

> AM max resource configuration per partition to be displayed/updated correctly 
> in UI and in various partition related metrics
> 
>
> Key: YARN-4304
> URL: https://issues.apache.org/jira/browse/YARN-4304
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: webapp
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-4304.patch, 0002-YARN-4304.patch, 
> 0003-YARN-4304.patch, REST_and_UI.zip
>
>
> As we are supporting per-partition level max AM resource percentage 
> configuration, UI and various metrics also need to display correct 
> configurations related to same. 
> For eg: Current UI still shows am-resource percentage per queue level. This 
> is to be updated correctly when label config is used.
> - Display max-am-percentage per-partition in Scheduler UI (label also) and in 
> ClusterMetrics page
> - Update queue/partition related metrics w.r.t per-partition 
> am-resource-percentage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period

2015-12-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035950#comment-15035950
 ] 

Junping Du commented on YARN-4403:
--

bq. For YARN/MR, I could also definitely help in getting it in shape once this 
is in.
Sure. Feel free to create/assign JIRA and work on it. I will help to review. 
Thanks! 

> (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating 
> period
> 
>
> Key: YARN-4403
> URL: https://issues.apache.org/jira/browse/YARN-4403
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4403.patch
>
>
> Currently, (AM/NM/Container)LivelinessMonitor use current system time to 
> calculate a duration of expire which could be broken by settimeofday. We 
> should use Time.monotonicNow() instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4411) ResourceManager IllegalArgumentException error

2015-12-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035895#comment-15035895
 ] 

Junping Du commented on YARN-4411:
--

Hi [~yarntime], I just assign the JIRA to you.

> ResourceManager IllegalArgumentException error
> --
>
> Key: YARN-4411
> URL: https://issues.apache.org/jira/browse/YARN-4411
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: yarntime
>Assignee: yarntime
>
> in version 2.7.1, line 1914  may cause IllegalArgumentException in 
> RMAppAttemptImpl:
>   YarnApplicationAttemptState.valueOf(this.getState().toString())
> cause by this.getState() returns type RMAppAttemptState which may not be 
> converted to YarnApplicationAttemptState.
> {noformat}
> java.lang.IllegalArgumentException: No enum constant 
> org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.LAUNCHED_UNMANAGED_SAVING
> at java.lang.Enum.valueOf(Enum.java:236)
> at 
> org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.createApplicationAttemptReport(RMAppAttemptImpl.java:1870)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationAttemptReport(ClientRMService.java:355)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationAttemptReport(ApplicationClientProtocolPBServiceImpl.java:355)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:425)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period

2015-12-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035871#comment-15035871
 ] 

Sunil G commented on YARN-4403:
---

Hi [~djp]
Thanks for the patch. Yes, its better to use {{MonotonicClock}} generally.
We also use SystemClock in proportional preemption policy, I feel as you 
mentioned a general YARN ticket can handle all this together.

> (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating 
> period
> 
>
> Key: YARN-4403
> URL: https://issues.apache.org/jira/browse/YARN-4403
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4403.patch
>
>
> Currently, (AM/NM/Container)LivelinessMonitor use current system time to 
> calculate a duration of expire which could be broken by settimeofday. We 
> should use Time.monotonicNow() instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4304) AM max resource configuration per partition to be displayed/updated correctly in UI and in various partition related metrics

2015-12-02 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-4304:
--
Attachment: (was: REST_and_UI.zip)

> AM max resource configuration per partition to be displayed/updated correctly 
> in UI and in various partition related metrics
> 
>
> Key: YARN-4304
> URL: https://issues.apache.org/jira/browse/YARN-4304
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: webapp
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-4304.patch, 0002-YARN-4304.patch, 
> 0003-YARN-4304.patch
>
>
> As we are supporting per-partition level max AM resource percentage 
> configuration, UI and various metrics also need to display correct 
> configurations related to same. 
> For eg: Current UI still shows am-resource percentage per queue level. This 
> is to be updated correctly when label config is used.
> - Display max-am-percentage per-partition in Scheduler UI (label also) and in 
> ClusterMetrics page
> - Update queue/partition related metrics w.r.t per-partition 
> am-resource-percentage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4401) A failed app recovery should not prevent the RM from starting

2015-12-02 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035892#comment-15035892
 ] 

Daniel Templeton commented on YARN-4401:


I suppose I posed my proposal a little naively.  Let's try again.

The reason for configuring HA is to prevent an outage.  It should be possible 
to tell the standby to come up regardless of recovery failures, in effect 
performing automatically the operation that [~sunilg] described or failing the 
bad app(s) or whatever.

The app resource issue I offered was just the first example I (thought I) found 
while skimming the code.  Rather than having to hunt down every possible way to 
throw an exception (checked or unchecked) during recovery, it would be 
convenient to have recovery catch any exception, log it, and do something 
sensible so that the RM can come up for cases where RM availability is a 
priority.

> A failed app recovery should not prevent the RM from starting
> -
>
> Key: YARN-4401
> URL: https://issues.apache.org/jira/browse/YARN-4401
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-4401.001.patch
>
>
> There are many different reasons why an app recovery could fail with an 
> exception, causing the RM start to be aborted.  If that happens the RM will 
> fail to start.  Presumably, the reason the RM is trying to do a recovery is 
> that it's the standby trying to fill in for the active.  Failing to come up 
> defeats the purpose of the HA configuration.  Instead of preventing the RM 
> from starting, a failed app recovery should log an error and skip the 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4411) ResourceManager IllegalArgumentException error

2015-12-02 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-4411:
-
Assignee: yarntime

> ResourceManager IllegalArgumentException error
> --
>
> Key: YARN-4411
> URL: https://issues.apache.org/jira/browse/YARN-4411
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: yarntime
>Assignee: yarntime
>
> in version 2.7.1, line 1914  may cause IllegalArgumentException in 
> RMAppAttemptImpl:
>   YarnApplicationAttemptState.valueOf(this.getState().toString())
> cause by this.getState() returns type RMAppAttemptState which may not be 
> converted to YarnApplicationAttemptState.
> {noformat}
> java.lang.IllegalArgumentException: No enum constant 
> org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.LAUNCHED_UNMANAGED_SAVING
> at java.lang.Enum.valueOf(Enum.java:236)
> at 
> org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.createApplicationAttemptReport(RMAppAttemptImpl.java:1870)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationAttemptReport(ClientRMService.java:355)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationAttemptReport(ApplicationClientProtocolPBServiceImpl.java:355)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:425)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4309) Add debug information to application logs when a container fails

2015-12-02 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-4309:

Attachment: YARN-4309.005.patch

Uploaded a new version of the patch with Windows support.

> Add debug information to application logs when a container fails
> 
>
> Key: YARN-4309
> URL: https://issues.apache.org/jira/browse/YARN-4309
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: YARN-4309.001.patch, YARN-4309.002.patch, 
> YARN-4309.003.patch, YARN-4309.004.patch, YARN-4309.005.patch
>
>
> Sometimes when a container fails, it can be pretty hard to figure out why it 
> failed.
> My proposal is that if a container fails, we collect information about the 
> container local dir and dump it into the container log dir. Ideally, I'd like 
> to tar up the directory entirely, but I'm not sure of the security and space 
> implications of such a approach. At the very least, we can list all the files 
> in the container local dir, and dump the contents of launch_container.sh(into 
> the container log dir).
> When log aggregation occurs, all this information will automatically get 
> collected and make debugging such failures much easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4304) AM max resource configuration per partition to be displayed/updated correctly in UI and in various partition related metrics

2015-12-02 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-4304:
--
Attachment: 0004-YARN-4304.patch

Attaching an updated version of patch addressing the comments. Also attached 
screen shots and REST o/ps

[~leftnoteasy] please help to review the same.

> AM max resource configuration per partition to be displayed/updated correctly 
> in UI and in various partition related metrics
> 
>
> Key: YARN-4304
> URL: https://issues.apache.org/jira/browse/YARN-4304
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: webapp
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-4304.patch, 0002-YARN-4304.patch, 
> 0003-YARN-4304.patch, 0004-YARN-4304.patch, REST_and_UI.zip
>
>
> As we are supporting per-partition level max AM resource percentage 
> configuration, UI and various metrics also need to display correct 
> configurations related to same. 
> For eg: Current UI still shows am-resource percentage per queue level. This 
> is to be updated correctly when label config is used.
> - Display max-am-percentage per-partition in Scheduler UI (label also) and in 
> ClusterMetrics page
> - Update queue/partition related metrics w.r.t per-partition 
> am-resource-percentage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period

2015-12-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035923#comment-15035923
 ] 

Sunil G commented on YARN-4403:
---

Yes. That sounds good. +1 for the patch.
For YARN/MR, I could also definitely help in getting it in shape once this is 
in.

> (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating 
> period
> 
>
> Key: YARN-4403
> URL: https://issues.apache.org/jira/browse/YARN-4403
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4403.patch
>
>
> Currently, (AM/NM/Container)LivelinessMonitor use current system time to 
> calculate a duration of expire which could be broken by settimeofday. We 
> should use Time.monotonicNow() instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4371) "yarn application -kill" should take multiple application ids

2015-12-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036056#comment-15036056
 ] 

Sunil G commented on YARN-4371:
---

Hi [~ozawa]
Could you pls help to review the patch.

> "yarn application -kill" should take multiple application ids
> -
>
> Key: YARN-4371
> URL: https://issues.apache.org/jira/browse/YARN-4371
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Tsuyoshi Ozawa
>Assignee: Sunil G
> Attachments: 0001-YARN-4371.patch, 0002-YARN-4371.patch
>
>
> Currently we cannot pass multiple applications to "yarn application -kill" 
> command. The command should take multiple application ids at the same time. 
> Each entries should be separated with whitespace like:
> {code}
> yarn application -kill application_1234_0001 application_1234_0007 
> application_1234_0012
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4304) AM max resource configuration per partition to be displayed/updated correctly in UI and in various partition related metrics

2015-12-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036136#comment-15036136
 ] 

Sunil G commented on YARN-4304:
---

Test case failures and findbugs are related. 
I will address these in next patch.

> AM max resource configuration per partition to be displayed/updated correctly 
> in UI and in various partition related metrics
> 
>
> Key: YARN-4304
> URL: https://issues.apache.org/jira/browse/YARN-4304
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: webapp
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-4304.patch, 0002-YARN-4304.patch, 
> 0003-YARN-4304.patch, 0004-YARN-4304.patch, REST_and_UI.zip
>
>
> As we are supporting per-partition level max AM resource percentage 
> configuration, UI and various metrics also need to display correct 
> configurations related to same. 
> For eg: Current UI still shows am-resource percentage per queue level. This 
> is to be updated correctly when label config is used.
> - Display max-am-percentage per-partition in Scheduler UI (label also) and in 
> ClusterMetrics page
> - Update queue/partition related metrics w.r.t per-partition 
> am-resource-percentage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-02 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036203#comment-15036203
 ] 

Daniel Templeton commented on YARN-4406:


Now that I've had a chance to look at the web UI code, I see that my theory was 
close, but not quite.  The number of decommissioned nodes is taken from 
{{ClusterMetrics.getMetrics().getDecomissionedNMs()}}, which is just the count 
of nodes in the excludes list.  The list of decommissioned nodes comes from 
{{ResourceManager.getRMContext().getInactiveRMNodes()}}, which contains only 
nodes that have been decommissioned since the last restart.

> RM Web UI continues to show decommissioned nodes even after RM restart
> --
>
> Key: YARN-4406
> URL: https://issues.apache.org/jira/browse/YARN-4406
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ray Chiang
>Priority: Minor
>
> If you start up a cluster, decommission a NodeManager, and restart the RM, 
> the decommissioned node list will still show a positive number (1 in the case 
> of 1 node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-12-02 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036317#comment-15036317
 ] 

Wangda Tan commented on YARN-4225:
--

[~eepayne],

Thanks for working on the patch, few comments:

1)
bq. public abstract Boolean getPreemptionDisabled();
Do you think is it better to return boolean? I'd prefer to return a default 
value (false) instead of return null

2) 
For QueueCLI, is it better to print "preemption is disabled/enabled" instead of 
"preemption status: disabled/enabled"?

3)
Is it possible to add a simple test to verify end-to-end behavior?

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-4225.001.patch, YARN-4225.002.patch, 
> YARN-4225.003.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4408) NodeManager still reports negative running containers

2015-12-02 Thread Robert Kanter (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036345#comment-15036345
 ] 

Robert Kanter commented on YARN-4408:
-

I haven't been able to reproduce this issue, and I agree that it's not a common 
occurrence; but we have seen the number of running containers go negative 
internally on two different clusters and also on a customer's cluster.  So I 
started through the code and state machine for how we could decrement the gauge 
without first incrementing it.  As far as I can tell, this is the only way 
where this can happen because we don't check {{container.wasLaunched}} like in 
the other two places where we decrement the gauge.  

> NodeManager still reports negative running containers
> -
>
> Key: YARN-4408
> URL: https://issues.apache.org/jira/browse/YARN-4408
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: YARN-4408.001.patch
>
>
> YARN-1697 fixed a problem where the NodeManager metrics could report a 
> negative number of running containers.  However, it missed a rare case where 
> this can still happen.
> YARN-1697 added a flag to indicate if the container was actually launched 
> ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which 
> is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to 
> {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge 
> if we actually ran the container and incremented the gauge .  However, this 
> flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to 
> {{DONE}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-02 Thread Kuhu Shukla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036336#comment-15036336
 ] 

Kuhu Shukla commented on YARN-4406:
---

Yes that is right, the issue is present on trunk. We could during 
{{serviceInit}} populate this metric to the number of decommissioned nodes in 
the inactive list, since we don't care about nodes that were decommissioned 
before last restart AFAIK. 

At present:
{code}
  private void setDecomissionedNMsMetrics() {
Set excludeList = hostsReader.getExcludedHosts();
ClusterMetrics.getMetrics().setDecommisionedNMs(excludeList.size());
  }
{code}

To:
{code}
  private void setDecomissionedNMsMetrics() {
int numDecommissioned = 0;
for(RMNode rmNode : rmContext.getInactiveRMNodes().values()) {
  if (rmNode.getState() == NodeState.DECOMMISSIONED) {
numDecommissioned++;
  }
}
ClusterMetrics.getMetrics().setDecommisionedNMs(numDecommissioned);
  }
{code}


> RM Web UI continues to show decommissioned nodes even after RM restart
> --
>
> Key: YARN-4406
> URL: https://issues.apache.org/jira/browse/YARN-4406
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ray Chiang
>Priority: Minor
>
> If you start up a cluster, decommission a NodeManager, and restart the RM, 
> the decommissioned node list will still show a positive number (1 in the case 
> of 1 node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-02 Thread Kuhu Shukla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036186#comment-15036186
 ] 

Kuhu Shukla commented on YARN-4406:
---

Thank you [~Naganarasimha]. Asking [~rchiang] if its alright for me to work on 
it. I am currently working in that code base for YARN-4311.

> RM Web UI continues to show decommissioned nodes even after RM restart
> --
>
> Key: YARN-4406
> URL: https://issues.apache.org/jira/browse/YARN-4406
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ray Chiang
>Priority: Minor
>
> If you start up a cluster, decommission a NodeManager, and restart the RM, 
> the decommissioned node list will still show a positive number (1 in the case 
> of 1 node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-02 Thread Ray Chiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang resolved YARN-4406.
--
Resolution: Duplicate

> RM Web UI continues to show decommissioned nodes even after RM restart
> --
>
> Key: YARN-4406
> URL: https://issues.apache.org/jira/browse/YARN-4406
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ray Chiang
>Priority: Minor
>
> If you start up a cluster, decommission a NodeManager, and restart the RM, 
> the decommissioned node list will still show a positive number (1 in the case 
> of 1 node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4392) ApplicationCreatedEvent event time resets after RM restart/failover

2015-12-02 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036299#comment-15036299
 ] 

Xuan Gong commented on YARN-4392:
-

Thanks for the comments, [~Naganarasimha]
bq. So actually in the patch i had followed the approach such that for finish 
events i had sent synchronous push in the ATS side, in this way we are sure 
that AppFinish event is sent out before we store the state of the app in the RM 
state store. But yes this approach looks little shaky but thought it might 
solve the issue.

Let us *not synchronously* send the ATS event. Otherwise, it would depend on 
the ATS. 
It is always good to make sure that we can send the ATS event "exactly once", 
but this would make things complicate, such as send ats events synchronously. 
This would add the additional but not necessary dependency.  
Currently, we are using "at least once" approach. Since all the information are 
the same if they are the duplicate events (after applying the patch), I think 
that is fine. 

What is your opinion??

> ApplicationCreatedEvent event time resets after RM restart/failover
> ---
>
> Key: YARN-4392
> URL: https://issues.apache.org/jira/browse/YARN-4392
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0
>Reporter: Xuan Gong
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-4392-2015-11-24.patch, YARN-4392.1.patch, 
> YARN-4392.2.patch
>
>
> {code}2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - 
> Finished time 1437453994768 is ahead of started time 1440308399674 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437454008244 is ahead of started time 1440308399676 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444305171 is ahead of started time 1440308399653 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444293115 is ahead of started time 1440308399647 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444379645 is ahead of started time 1440308399656 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444361234 is ahead of started time 1440308399655 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444342029 is ahead of started time 1440308399654 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444323447 is ahead of started time 1440308399654 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143730006 is ahead of started time 1440308399660 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143715698 is ahead of started time 1440308399659 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143719060 is ahead of started time 1440308399658 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444393931 is ahead of started time 1440308399657
> {code} . 
> From ATS logs, we would see a large amount of 'stale alerts' messages 
> periodically



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4389) "yarn.am.blacklisting.enabled" and "yarn.am.blacklisting.disable-failure-threshold" should be app specific rather than a setting for whole YARN cluster

2015-12-02 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-4389:
--
Attachment: 0002-YARN-4389.patch

Attaching an updated patch correcting test case failures. Also fixed few 
checkstyle and javadoc problems.
[~djp] Could you please help to review the patch.

> "yarn.am.blacklisting.enabled" and 
> "yarn.am.blacklisting.disable-failure-threshold" should be app specific 
> rather than a setting for whole YARN cluster
> ---
>
> Key: YARN-4389
> URL: https://issues.apache.org/jira/browse/YARN-4389
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Reporter: Junping Du
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-4389.patch, 0002-YARN-4389.patch
>
>
> "yarn.am.blacklisting.enabled" and 
> "yarn.am.blacklisting.disable-failure-threshold" should be application 
> specific rather than a setting in cluster level, or we should't maintain 
> amBlacklistingEnabled and blacklistDisableThreshold in per rmApp level. We 
> should allow each am to override this config, i.e. via submissionContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2015-12-02 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-4311:
--
Attachment: YARN-4311-v2.patch

This patch addresses graceful and other versions of refreshNodes and also adds 
a time stamp based check for nodes per {{RM_NODE_REMOVAL_CHK_INTERVAL_MSEC}} in 
the inactive list that should be untracked and removes nodes based on 
{{RM_NODE_REMOVAL_TIMEOUT_MSEC}}. A decommissioned node is not transitioned to 
shutdown but timer acts on it just as it would on a shutdown node.

A decommissioning node will transition to shutdown if it was found to be 
'untracked'. 

The unit test tries out several scenarios to check if the metrics and node 
lists are proper. I can break it into more tests if the idea behind it looks 
acceptable.

> Removing nodes from include and exclude lists will not remove them from 
> decommissioned nodes list
> -
>
> Key: YARN-4311
> URL: https://issues.apache.org/jira/browse/YARN-4311
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: YARN-4311-v1.patch, YARN-4311-v2.patch
>
>
> In order to fully forget about a node, removing the node from include and 
> exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The 
> tricky part that [~jlowe] pointed out was the case when include lists are not 
> used, in that case we don't want the nodes to fall off if they are not active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-02 Thread Ray Chiang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036214#comment-15036214
 ] 

Ray Chiang commented on YARN-4406:
--

Thanks [~Naganarasimha].  I'll close up this JIRA as a duplicate.

As for fixing it, I'll leave that up to you and [~templedf].  It looks like you 
two are further ahead than I am.

> RM Web UI continues to show decommissioned nodes even after RM restart
> --
>
> Key: YARN-4406
> URL: https://issues.apache.org/jira/browse/YARN-4406
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ray Chiang
>Priority: Minor
>
> If you start up a cluster, decommission a NodeManager, and restart the RM, 
> the decommissioned node list will still show a positive number (1 in the case 
> of 1 node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)

2015-12-02 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036257#comment-15036257
 ] 

Varun Saxena commented on YARN-3840:


[~jianhe], kindly review.
[~mohdshahidkhan], probably you can have a look as well.

> Resource Manager web ui issue when sorting application by id (with 
> application having id > )
> 
>
> Key: YARN-3840
> URL: https://issues.apache.org/jira/browse/YARN-3840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: LINTE
>Assignee: Varun Saxena
> Fix For: 2.8.0, 2.7.3
>
> Attachments: RMApps.png, RMApps_Sorted.png, YARN-3840-1.patch, 
> YARN-3840-2.patch, YARN-3840-3.patch, YARN-3840-4.patch, YARN-3840-5.patch, 
> YARN-3840-6.patch, YARN-3840.reopened.001.patch, yarn-3840-7.patch
>
>
> On the WEBUI, the global main view page : 
> http://resourcemanager:8088/cluster/apps doesn't display applications over 
> .
> With command line it works (# yarn application -list).
> Regards,
> Alexandre



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-12-02 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4225:
-
Attachment: (was: YARN-4225.002.patch)

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-4225.001.patch, YARN-4225.002.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4304) AM max resource configuration per partition to be displayed/updated correctly in UI and in various partition related metrics

2015-12-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036127#comment-15036127
 ] 

Hadoop QA commented on YARN-4304:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
55s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
12s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 39s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
14s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
36s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 14s 
{color} | {color:red} Patch generated 20 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 (total was 210, now 219). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 39s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
17s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 31s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 introduced 4 new FindBugs issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 36s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 4s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_85. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
22s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 149m 40s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
|  |  Dead store to a in 
org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(ResponseInfo,
 String)  At 

[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-12-02 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4225:
-
Attachment: YARN-4225.002.patch

Attaching {{YARN-4225.002.patch}}, which implements {{getPreemptionDisabled()}} 
to return a {{Boolean}}, and {{QueueCLI#printQueueInfo}} will check for 
non-null before printing out queue status. Patch applies cleanly to trunk, 
branch-2, and branch-2.8.
{quote}
In General, what is the Hadoop policy when a newer client talks to an older 
server and the protobuf output is different than expected. Should we expose 
some form of the has method, or should we overload the get method as I 
described here?

I would appreciate any additional feedback from the community in general (Vinod 
Kumar Vavilapalli, do you have any thoughts?)
{quote}
[~vinodkv], did you have a chance to think about this? [~jlowe], do you have 
any additional thoughts?

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-4225.001.patch, YARN-4225.002.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4412) Create ClusterManager to compute ordered list of preferred NMs for QUEUEABLE containers

2015-12-02 Thread Arun Suresh (JIRA)
Arun Suresh created YARN-4412:
-

 Summary: Create ClusterManager to compute ordered list of 
preferred NMs for QUEUEABLE containers
 Key: YARN-4412
 URL: https://issues.apache.org/jira/browse/YARN-4412
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun Suresh
Assignee: Arun Suresh


Introduce a Cluster Manager that aggregates Load and Policy information from 
individual Node Managers and computes an ordered list of preferred Node 
managers to be used as target Nodes for QUEUEABLE container allocations. 

This list can be pushed out to the Node Manager (specifically the AMRMProxy 
running on the Node) via the Allocate Response. This will be used to make local 
Scheduling decisions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-12-02 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4225:
-
Attachment: YARN-4225.003.patch

Sorry, I mis-named the patch. Should have been {{YARN-4225.003.patch}}

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-4225.001.patch, YARN-4225.002.patch, 
> YARN-4225.003.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4309) Add debug information to application logs when a container fails

2015-12-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036112#comment-15036112
 ] 

Hadoop QA commented on YARN-4309:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
44s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 38s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 24s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
30s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 37s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
40s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 5s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 35s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 52s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
30s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 17s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 17s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 24s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 30s 
{color} | {color:red} Patch generated 4 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 357, now 358). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 39s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
39s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
29s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 40s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 3s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 29s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 10s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 2s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 28s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_85. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 24s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_85. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 23s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.7.0_85. {color} |
| 

[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036241#comment-15036241
 ] 

Sunil G commented on YARN-4406:
---

YARN-3226 which is a subtask of YARN-914 will be splitting cluster metrics in 
to two TABLES (Node metrics table)  as we have to show Decommissioning nodes 
too. 
Patch is given there already for same. However this particular case s not 
handled there. Mostly as progress is made for this,  please also see the 
progress in YARN-3226.



> RM Web UI continues to show decommissioned nodes even after RM restart
> --
>
> Key: YARN-4406
> URL: https://issues.apache.org/jira/browse/YARN-4406
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ray Chiang
>Priority: Minor
>
> If you start up a cluster, decommission a NodeManager, and restart the RM, 
> the decommissioned node list will still show a positive number (1 in the case 
> of 1 node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period

2015-12-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036240#comment-15036240
 ] 

Hadoop QA commented on YARN-4403:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
55s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 59s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 23s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
31s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 11s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
28s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
41s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 56s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 3s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
7s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 5s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 17s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 17s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
29s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 11s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
28s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
53s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 53s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 1s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 59s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 64m 3s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 14s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_85. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 42s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_85. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
22s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 168m 35s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_66 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK 

[jira] [Updated] (YARN-3458) CPU resource monitoring in Windows

2015-12-02 Thread Inigo Goiri (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Inigo Goiri updated YARN-3458:
--
Attachment: YARN-3458-9.patch

Rebased to trunk.

> CPU resource monitoring in Windows
> --
>
> Key: YARN-3458
> URL: https://issues.apache.org/jira/browse/YARN-3458
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Affects Versions: 2.7.0
> Environment: Windows
>Reporter: Inigo Goiri
>Assignee: Inigo Goiri
>Priority: Minor
>  Labels: BB2015-05-TBR, containers, metrics, windows
> Attachments: YARN-3458-1.patch, YARN-3458-2.patch, YARN-3458-3.patch, 
> YARN-3458-4.patch, YARN-3458-5.patch, YARN-3458-6.patch, YARN-3458-7.patch, 
> YARN-3458-8.patch, YARN-3458-9.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of getCpuUsagePercent() for 
> WindowsBasedProcessTree is left as unavailable. Attached a proposal of how to 
> do it. I reused the CpuTimeTracker using 1 jiffy=1ms.
> This was left open by YARN-3122.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4398) Yarn recover functionality causes the cluster running slowly and the cluster usage rate is far below 100

2015-12-02 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036381#comment-15036381
 ] 

Jian He commented on YARN-4398:
---

[~iceberg565], added you to the contributor list.  Assigned this to you.  You 
can also now assign jira to yourself. 
Committing this.



> Yarn recover functionality causes the cluster running slowly and the cluster 
> usage rate is far below 100
> 
>
> Key: YARN-4398
> URL: https://issues.apache.org/jira/browse/YARN-4398
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: NING DING
>Assignee: NING DING
> Attachments: YARN-4398.2.patch, YARN-4398.3.patch, YARN-4398.4.patch
>
>
> In my hadoop cluster, the resourceManager recover functionality is enabled 
> with FileSystemRMStateStore.
> I found this cause the yarn cluster running slowly and cluster usage rate is 
> just 50 even there are many pending Apps. 
> The scenario is below.
> In thread A, the RMAppImpl$RMAppNewlySavingTransition is calling 
> storeNewApplication method defined in RMStateStore. This storeNewApplication 
> method is synchronized.
> {code:title=RMAppImpl.java|borderStyle=solid}
>   private static final class RMAppNewlySavingTransition extends 
> RMAppTransition {
> @Override
> public void transition(RMAppImpl app, RMAppEvent event) {
>   // If recovery is enabled then store the application information in a
>   // non-blocking call so make sure that RM has stored the information
>   // needed to restart the AM after RM restart without further client
>   // communication
>   LOG.info("Storing application with id " + app.applicationId);
>   app.rmContext.getStateStore().storeNewApplication(app);
> }
>   }
> {code}
> {code:title=RMStateStore.java|borderStyle=solid}
> public synchronized void storeNewApplication(RMApp app) {
> ApplicationSubmissionContext context = app
> 
> .getApplicationSubmissionContext();
> assert context instanceof ApplicationSubmissionContextPBImpl;
> ApplicationStateData appState =
> ApplicationStateData.newInstance(
> app.getSubmitTime(), app.getStartTime(), context, app.getUser());
> dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState));
>   }
> {code}
> In thread B, the FileSystemRMStateStore is calling 
> storeApplicationStateInternal method. It's also synchronized.
> This storeApplicationStateInternal method saves an ApplicationStateData into 
> HDFS and it normally costs 90~300 milliseconds in my hadoop cluster.
> {code:title=FileSystemRMStateStore.java|borderStyle=solid}
> public synchronized void storeApplicationStateInternal(ApplicationId appId,
>   ApplicationStateData appStateDataPB) throws Exception {
> Path appDirPath = getAppDir(rmAppRoot, appId);
> mkdirsWithRetries(appDirPath);
> Path nodeCreatePath = getNodePath(appDirPath, appId.toString());
> LOG.info("Storing info for app: " + appId + " at: " + nodeCreatePath);
> byte[] appStateData = appStateDataPB.getProto().toByteArray();
> try {
>   // currently throw all exceptions. May need to respond differently for 
> HA
>   // based on whether we have lost the right to write to FS
>   writeFileWithRetries(nodeCreatePath, appStateData, true);
> } catch (Exception e) {
>   LOG.info("Error storing info for app: " + appId, e);
>   throw e;
> }
>   }
> {code}
> Think thread B firstly comes into 
> FileSystemRMStateStore.storeApplicationStateInternal method, then thread A 
> will be blocked for a while because of synchronization. In ResourceManager 
> there is only one RMStateStore instance. In my cluster it's 
> FileSystemRMStateStore type.
> Debug the RMAppNewlySavingTransition.transition method, the thread stack 
> shows it's called form AsyncDispatcher.dispatch method. This method code is 
> as below. 
> {code:title=AsyncDispatcher.java|borderStyle=solid}
>   protected void dispatch(Event event) {
> //all events go thru this loop
> if (LOG.isDebugEnabled()) {
>   LOG.debug("Dispatching the event " + event.getClass().getName() + "."
>   + event.toString());
> }
> Class type = event.getType().getDeclaringClass();
> try{
>   EventHandler handler = eventDispatchers.get(type);
>   if(handler != null) {
> handler.handle(event);
>   } else {
> throw new Exception("No handler for registered for " + type);
>   }
> } catch (Throwable t) {
>   //TODO Maybe log the state of the queue
>   LOG.fatal("Error in dispatcher thread", t);
>   // If serviceStop is called, we should exit this thread gracefully.
>   if 

[jira] [Assigned] (YARN-3102) Decommisioned Nodes not listed in Web UI

2015-12-02 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla reassigned YARN-3102:
-

Assignee: Kuhu Shukla  (was: Naganarasimha G R)

> Decommisioned Nodes not listed in Web UI
> 
>
> Key: YARN-3102
> URL: https://issues.apache.org/jira/browse/YARN-3102
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
> Environment: 2 Node Manager and 1 Resource Manager 
>Reporter: Bibin A Chundatt
>Assignee: Kuhu Shukla
>Priority: Minor
>
> Configure yarn.resourcemanager.nodes.exclude-path in yarn-site.xml to 
> yarn.exlude file In RM1 machine
> Add Yarn.exclude with NM1 Host Name 
> Start the node as listed below NM1,NM2 Resource manager
> Now check Nodes decommisioned in /cluster/nodes
> Number of decommisioned node is listed as 1 but Table is empty in 
> /cluster/nodes/decommissioned (detail of Decommision node not shown)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-02 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036410#comment-15036410
 ] 

Daniel Templeton commented on YARN-4406:


That's the simplest resolution, but I was actually leaning the other direction: 
making the list of decommissioned nodes include the full excludes list.  I 
guess it comes down to how we define decommissioned in the UI.  I interpret the 
excludes list as the canonical list of decommissioned nodes.

> RM Web UI continues to show decommissioned nodes even after RM restart
> --
>
> Key: YARN-4406
> URL: https://issues.apache.org/jira/browse/YARN-4406
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ray Chiang
>Assignee: Kuhu Shukla
>Priority: Minor
>
> If you start up a cluster, decommission a NodeManager, and restart the RM, 
> the decommissioned node list will still show a positive number (1 in the case 
> of 1 node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4408) NodeManager still reports negative running containers

2015-12-02 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated YARN-4408:

Attachment: YARN-4408.002.patch

The 002 patch adds the debug message.

> NodeManager still reports negative running containers
> -
>
> Key: YARN-4408
> URL: https://issues.apache.org/jira/browse/YARN-4408
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: YARN-4408.001.patch, YARN-4408.002.patch
>
>
> YARN-1697 fixed a problem where the NodeManager metrics could report a 
> negative number of running containers.  However, it missed a rare case where 
> this can still happen.
> YARN-1697 added a flag to indicate if the container was actually launched 
> ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which 
> is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to 
> {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge 
> if we actually ran the container and incremented the gauge .  However, this 
> flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to 
> {{DONE}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4408) NodeManager still reports negative running containers

2015-12-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036367#comment-15036367
 ] 

Junping Du commented on YARN-4408:
--

Thanks Robert for reply. If we don't understand how it happens in the short 
term, shall we add a warn log message if container get finished with success 
but not get launched before? I think that could be helpful for our debug in 
future or this fix could cover something unusual we were missing. Others looks 
fine to me.

> NodeManager still reports negative running containers
> -
>
> Key: YARN-4408
> URL: https://issues.apache.org/jira/browse/YARN-4408
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: YARN-4408.001.patch
>
>
> YARN-1697 fixed a problem where the NodeManager metrics could report a 
> negative number of running containers.  However, it missed a rare case where 
> this can still happen.
> YARN-1697 added a flag to indicate if the container was actually launched 
> ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which 
> is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to 
> {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge 
> if we actually ran the container and incremented the gauge .  However, this 
> flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to 
> {{DONE}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-02 Thread Ray Chiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang updated YARN-4406:
-
Assignee: Kuhu Shukla

> RM Web UI continues to show decommissioned nodes even after RM restart
> --
>
> Key: YARN-4406
> URL: https://issues.apache.org/jira/browse/YARN-4406
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ray Chiang
>Assignee: Kuhu Shukla
>Priority: Minor
>
> If you start up a cluster, decommission a NodeManager, and restart the RM, 
> the decommissioned node list will still show a positive number (1 in the case 
> of 1 node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4398) Yarn recover functionality causes the cluster running slowly and the cluster usage rate is far below 100

2015-12-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036444#comment-15036444
 ] 

Hudson commented on YARN-4398:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8910 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8910/])
YARN-4398. Remove unnecessary synchronization in RMStateStore. (jianhe: rev 
6b9a5beb2b2f9589ef86670f2d763e8488ee5e90)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java
* hadoop-yarn-project/CHANGES.txt


> Yarn recover functionality causes the cluster running slowly and the cluster 
> usage rate is far below 100
> 
>
> Key: YARN-4398
> URL: https://issues.apache.org/jira/browse/YARN-4398
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: NING DING
>Assignee: NING DING
> Fix For: 2.7.3
>
> Attachments: YARN-4398.2.patch, YARN-4398.3.patch, YARN-4398.4.patch
>
>
> In my hadoop cluster, the resourceManager recover functionality is enabled 
> with FileSystemRMStateStore.
> I found this cause the yarn cluster running slowly and cluster usage rate is 
> just 50 even there are many pending Apps. 
> The scenario is below.
> In thread A, the RMAppImpl$RMAppNewlySavingTransition is calling 
> storeNewApplication method defined in RMStateStore. This storeNewApplication 
> method is synchronized.
> {code:title=RMAppImpl.java|borderStyle=solid}
>   private static final class RMAppNewlySavingTransition extends 
> RMAppTransition {
> @Override
> public void transition(RMAppImpl app, RMAppEvent event) {
>   // If recovery is enabled then store the application information in a
>   // non-blocking call so make sure that RM has stored the information
>   // needed to restart the AM after RM restart without further client
>   // communication
>   LOG.info("Storing application with id " + app.applicationId);
>   app.rmContext.getStateStore().storeNewApplication(app);
> }
>   }
> {code}
> {code:title=RMStateStore.java|borderStyle=solid}
> public synchronized void storeNewApplication(RMApp app) {
> ApplicationSubmissionContext context = app
> 
> .getApplicationSubmissionContext();
> assert context instanceof ApplicationSubmissionContextPBImpl;
> ApplicationStateData appState =
> ApplicationStateData.newInstance(
> app.getSubmitTime(), app.getStartTime(), context, app.getUser());
> dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState));
>   }
> {code}
> In thread B, the FileSystemRMStateStore is calling 
> storeApplicationStateInternal method. It's also synchronized.
> This storeApplicationStateInternal method saves an ApplicationStateData into 
> HDFS and it normally costs 90~300 milliseconds in my hadoop cluster.
> {code:title=FileSystemRMStateStore.java|borderStyle=solid}
> public synchronized void storeApplicationStateInternal(ApplicationId appId,
>   ApplicationStateData appStateDataPB) throws Exception {
> Path appDirPath = getAppDir(rmAppRoot, appId);
> mkdirsWithRetries(appDirPath);
> Path nodeCreatePath = getNodePath(appDirPath, appId.toString());
> LOG.info("Storing info for app: " + appId + " at: " + nodeCreatePath);
> byte[] appStateData = appStateDataPB.getProto().toByteArray();
> try {
>   // currently throw all exceptions. May need to respond differently for 
> HA
>   // based on whether we have lost the right to write to FS
>   writeFileWithRetries(nodeCreatePath, appStateData, true);
> } catch (Exception e) {
>   LOG.info("Error storing info for app: " + appId, e);
>   throw e;
> }
>   }
> {code}
> Think thread B firstly comes into 
> FileSystemRMStateStore.storeApplicationStateInternal method, then thread A 
> will be blocked for a while because of synchronization. In ResourceManager 
> there is only one RMStateStore instance. In my cluster it's 
> FileSystemRMStateStore type.
> Debug the RMAppNewlySavingTransition.transition method, the thread stack 
> shows it's called form AsyncDispatcher.dispatch method. This method code is 
> as below. 
> {code:title=AsyncDispatcher.java|borderStyle=solid}
>   protected void dispatch(Event event) {
> //all events go thru this loop
> if (LOG.isDebugEnabled()) {
>   LOG.debug("Dispatching the event " + event.getClass().getName() + "."
>   + event.toString());
> }
> Class type = event.getType().getDeclaringClass();
> try{
>

[jira] [Commented] (YARN-4389) "yarn.am.blacklisting.enabled" and "yarn.am.blacklisting.disable-failure-threshold" should be app specific rather than a setting for whole YARN cluster

2015-12-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036511#comment-15036511
 ] 

Hadoop QA commented on YARN-4389:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
59s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 1s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 15s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
28s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 41s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
41s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 2s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 32s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 55s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
37s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 6s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 6s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 6s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 18s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 18s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 18s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 29s 
{color} | {color:red} Patch generated 5 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 160, now 165). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 40s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
40s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
40s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 35s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 57s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 9s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 9s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 27s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_85. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 14s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_85. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 13s {color} 
| {color:red} 

[jira] [Commented] (YARN-4408) NodeManager still reports negative running containers

2015-12-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036557#comment-15036557
 ] 

Junping Du commented on YARN-4408:
--

+1 on 003 patch. Will commit it shortly if no further comments from others.

> NodeManager still reports negative running containers
> -
>
> Key: YARN-4408
> URL: https://issues.apache.org/jira/browse/YARN-4408
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: YARN-4408.001.patch, YARN-4408.002.patch, 
> YARN-4408.003.patch
>
>
> YARN-1697 fixed a problem where the NodeManager metrics could report a 
> negative number of running containers.  However, it missed a rare case where 
> this can still happen.
> YARN-1697 added a flag to indicate if the container was actually launched 
> ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which 
> is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to 
> {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge 
> if we actually ran the container and incremented the gauge .  However, this 
> flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to 
> {{DONE}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4398) Yarn recover functionality causes the cluster running slowly and the cluster usage rate is far below 100

2015-12-02 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-4398:
--
Assignee: NING DING

> Yarn recover functionality causes the cluster running slowly and the cluster 
> usage rate is far below 100
> 
>
> Key: YARN-4398
> URL: https://issues.apache.org/jira/browse/YARN-4398
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: NING DING
>Assignee: NING DING
> Attachments: YARN-4398.2.patch, YARN-4398.3.patch, YARN-4398.4.patch
>
>
> In my hadoop cluster, the resourceManager recover functionality is enabled 
> with FileSystemRMStateStore.
> I found this cause the yarn cluster running slowly and cluster usage rate is 
> just 50 even there are many pending Apps. 
> The scenario is below.
> In thread A, the RMAppImpl$RMAppNewlySavingTransition is calling 
> storeNewApplication method defined in RMStateStore. This storeNewApplication 
> method is synchronized.
> {code:title=RMAppImpl.java|borderStyle=solid}
>   private static final class RMAppNewlySavingTransition extends 
> RMAppTransition {
> @Override
> public void transition(RMAppImpl app, RMAppEvent event) {
>   // If recovery is enabled then store the application information in a
>   // non-blocking call so make sure that RM has stored the information
>   // needed to restart the AM after RM restart without further client
>   // communication
>   LOG.info("Storing application with id " + app.applicationId);
>   app.rmContext.getStateStore().storeNewApplication(app);
> }
>   }
> {code}
> {code:title=RMStateStore.java|borderStyle=solid}
> public synchronized void storeNewApplication(RMApp app) {
> ApplicationSubmissionContext context = app
> 
> .getApplicationSubmissionContext();
> assert context instanceof ApplicationSubmissionContextPBImpl;
> ApplicationStateData appState =
> ApplicationStateData.newInstance(
> app.getSubmitTime(), app.getStartTime(), context, app.getUser());
> dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState));
>   }
> {code}
> In thread B, the FileSystemRMStateStore is calling 
> storeApplicationStateInternal method. It's also synchronized.
> This storeApplicationStateInternal method saves an ApplicationStateData into 
> HDFS and it normally costs 90~300 milliseconds in my hadoop cluster.
> {code:title=FileSystemRMStateStore.java|borderStyle=solid}
> public synchronized void storeApplicationStateInternal(ApplicationId appId,
>   ApplicationStateData appStateDataPB) throws Exception {
> Path appDirPath = getAppDir(rmAppRoot, appId);
> mkdirsWithRetries(appDirPath);
> Path nodeCreatePath = getNodePath(appDirPath, appId.toString());
> LOG.info("Storing info for app: " + appId + " at: " + nodeCreatePath);
> byte[] appStateData = appStateDataPB.getProto().toByteArray();
> try {
>   // currently throw all exceptions. May need to respond differently for 
> HA
>   // based on whether we have lost the right to write to FS
>   writeFileWithRetries(nodeCreatePath, appStateData, true);
> } catch (Exception e) {
>   LOG.info("Error storing info for app: " + appId, e);
>   throw e;
> }
>   }
> {code}
> Think thread B firstly comes into 
> FileSystemRMStateStore.storeApplicationStateInternal method, then thread A 
> will be blocked for a while because of synchronization. In ResourceManager 
> there is only one RMStateStore instance. In my cluster it's 
> FileSystemRMStateStore type.
> Debug the RMAppNewlySavingTransition.transition method, the thread stack 
> shows it's called form AsyncDispatcher.dispatch method. This method code is 
> as below. 
> {code:title=AsyncDispatcher.java|borderStyle=solid}
>   protected void dispatch(Event event) {
> //all events go thru this loop
> if (LOG.isDebugEnabled()) {
>   LOG.debug("Dispatching the event " + event.getClass().getName() + "."
>   + event.toString());
> }
> Class type = event.getType().getDeclaringClass();
> try{
>   EventHandler handler = eventDispatchers.get(type);
>   if(handler != null) {
> handler.handle(event);
>   } else {
> throw new Exception("No handler for registered for " + type);
>   }
> } catch (Throwable t) {
>   //TODO Maybe log the state of the queue
>   LOG.fatal("Error in dispatcher thread", t);
>   // If serviceStop is called, we should exit this thread gracefully.
>   if (exitOnDispatchException
>   && (ShutdownHookManager.get().isShutdownInProgress()) == false
>   && stopped == false) {
> Thread shutDownThread = new 

[jira] [Commented] (YARN-4408) NodeManager still reports negative running containers

2015-12-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036485#comment-15036485
 ] 

Junping Du commented on YARN-4408:
--

Thanks Robert for updating the patch. Can we make log messages here in WARN 
level given this is unusual case and our log level is only enabled for INFO or 
above by default?

> NodeManager still reports negative running containers
> -
>
> Key: YARN-4408
> URL: https://issues.apache.org/jira/browse/YARN-4408
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: YARN-4408.001.patch, YARN-4408.002.patch
>
>
> YARN-1697 fixed a problem where the NodeManager metrics could report a 
> negative number of running containers.  However, it missed a rare case where 
> this can still happen.
> YARN-1697 added a flag to indicate if the container was actually launched 
> ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which 
> is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to 
> {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge 
> if we actually ran the container and incremented the gauge .  However, this 
> flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to 
> {{DONE}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4408) NodeManager still reports negative running containers

2015-12-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036531#comment-15036531
 ] 

Hadoop QA commented on YARN-4408:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
23s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
12s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 3s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
28s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 28s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
10s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 9s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 54s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 16s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.7.0_85. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
25s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 35m 57s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12775359/YARN-4408.002.patch |
| JIRA Issue | YARN-4408 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 98a4b6fe7980 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 6b9a5be |
| findbugs | v3.0.0 |
| JDK v1.7.0_85  Test Results | 

[jira] [Commented] (YARN-4392) ApplicationCreatedEvent event time resets after RM restart/failover

2015-12-02 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036580#comment-15036580
 ] 

Xuan Gong commented on YARN-4392:
-

[~Naganarasimha]

bq. there is no limit on number of running apps in state store and finished 
apps are restricted to a configurable number. In such cases would not there be 
many created events in a larger cluster on recovery?

This is a good point given the performance of ATS v1 is not that scalable. 

Will it cause any issue if the APP_CREATED event is missing ? If that only 
cause the missing related information in ATS webui/webservice, I am OK with not 
re-sending the ATS events on recovery. 

[~jlowe] What is your opinion ?

> ApplicationCreatedEvent event time resets after RM restart/failover
> ---
>
> Key: YARN-4392
> URL: https://issues.apache.org/jira/browse/YARN-4392
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0
>Reporter: Xuan Gong
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-4392-2015-11-24.patch, YARN-4392.1.patch, 
> YARN-4392.2.patch
>
>
> {code}2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - 
> Finished time 1437453994768 is ahead of started time 1440308399674 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437454008244 is ahead of started time 1440308399676 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444305171 is ahead of started time 1440308399653 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444293115 is ahead of started time 1440308399647 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444379645 is ahead of started time 1440308399656 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444361234 is ahead of started time 1440308399655 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444342029 is ahead of started time 1440308399654 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444323447 is ahead of started time 1440308399654 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143730006 is ahead of started time 1440308399660 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143715698 is ahead of started time 1440308399659 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143719060 is ahead of started time 1440308399658 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444393931 is ahead of started time 1440308399657
> {code} . 
> From ATS logs, we would see a large amount of 'stale alerts' messages 
> periodically



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-02 Thread Ray Chiang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036389#comment-15036389
 ] 

Ray Chiang commented on YARN-4406:
--

That looks good to me.

> RM Web UI continues to show decommissioned nodes even after RM restart
> --
>
> Key: YARN-4406
> URL: https://issues.apache.org/jira/browse/YARN-4406
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ray Chiang
>Priority: Minor
>
> If you start up a cluster, decommission a NodeManager, and restart the RM, 
> the decommissioned node list will still show a positive number (1 in the case 
> of 1 node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4408) NodeManager still reports negative running containers

2015-12-02 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated YARN-4408:

Attachment: YARN-4408.003.patch

Sure.  The 003 patch uses warn level.

> NodeManager still reports negative running containers
> -
>
> Key: YARN-4408
> URL: https://issues.apache.org/jira/browse/YARN-4408
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: YARN-4408.001.patch, YARN-4408.002.patch, 
> YARN-4408.003.patch
>
>
> YARN-1697 fixed a problem where the NodeManager metrics could report a 
> negative number of running containers.  However, it missed a rare case where 
> this can still happen.
> YARN-1697 added a flag to indicate if the container was actually launched 
> ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which 
> is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to 
> {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge 
> if we actually ran the container and incremented the gauge .  However, this 
> flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to 
> {{DONE}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4392) ApplicationCreatedEvent event time resets after RM restart/failover

2015-12-02 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036530#comment-15036530
 ] 

Naganarasimha G R commented on YARN-4392:
-

[~xgong], 
Yes you are right, it would not be good to depend on ATS that it will send 
certain events synchronously.
but IIUC there is no limit on number of running apps in state store and 
finished apps are restricted to a configurable number. In such cases would not 
there be many created events in a larger cluster on recovery? my 2 cents would 
be atleast to avoid for app created event but if its not a great deal, then 
fine with the current fix. :) 
Thanks for assigning it to me, i can get the test case failure corrected as it 
was already handled in YARN-3127.

> ApplicationCreatedEvent event time resets after RM restart/failover
> ---
>
> Key: YARN-4392
> URL: https://issues.apache.org/jira/browse/YARN-4392
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0
>Reporter: Xuan Gong
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-4392-2015-11-24.patch, YARN-4392.1.patch, 
> YARN-4392.2.patch
>
>
> {code}2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - 
> Finished time 1437453994768 is ahead of started time 1440308399674 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437454008244 is ahead of started time 1440308399676 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444305171 is ahead of started time 1440308399653 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444293115 is ahead of started time 1440308399647 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444379645 is ahead of started time 1440308399656 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444361234 is ahead of started time 1440308399655 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444342029 is ahead of started time 1440308399654 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444323447 is ahead of started time 1440308399654 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143730006 is ahead of started time 1440308399660 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143715698 is ahead of started time 1440308399659 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143719060 is ahead of started time 1440308399658 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444393931 is ahead of started time 1440308399657
> {code} . 
> From ATS logs, we would see a large amount of 'stale alerts' messages 
> periodically



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-12-02 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036560#comment-15036560
 ] 

Eric Payne commented on YARN-4225:
--

Thanks [~leftnoteasy], for your helpful comments.
bq. Do you think is it better to return boolean? I'd prefer to return a default 
value (false) instead of return null
This is the nature of the question that I have about the more general Hadoop 
policy, and which [~jlowe] and I were discussing in the comments above.
Basically, the use case is a newer client is querying an older server, and so 
some of the newer protobuf entries that the client expects may not exist. In 
that case, we have 2 options that I can see:
# The client exposes both the {{get}} protobuf method and the {{has}} protobuf 
method for the structure in question
# We overload the {{get}} protobuf method to do the {{has}} checking internally 
and return NULL if the field doesn't exist.
I actually prefer the second option because it exposes only one method. But, I 
would like to know the opinion of others and if there is already a precedence 
for this use case.


> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-4225.001.patch, YARN-4225.002.patch, 
> YARN-4225.003.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4410) hadoop

2015-12-02 Thread qeko (JIRA)
qeko created YARN-4410:
--

 Summary: hadoop
 Key: YARN-4410
 URL: https://issues.apache.org/jira/browse/YARN-4410
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: qeko






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4401) A failed app recovery should not prevent the RM from starting

2015-12-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035492#comment-15035492
 ] 

Sunil G commented on YARN-4401:
---

Hi [~templedf]
I am not very sure about the use case here. However I feel if such a case 
occurs, we will have enough information from logs to get the app-id.
Then we can use below command to clear such apps if necessary rather than 
forcefully clear from rmcontext.
{noformat}
Usage: yarn resourcemanager [-format-state-store]
[-remove-application-from-state-store ]
{noformat}

> A failed app recovery should not prevent the RM from starting
> -
>
> Key: YARN-4401
> URL: https://issues.apache.org/jira/browse/YARN-4401
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-4401.001.patch
>
>
> There are many different reasons why an app recovery could fail with an 
> exception, causing the RM start to be aborted.  If that happens the RM will 
> fail to start.  Presumably, the reason the RM is trying to do a recovery is 
> that it's the standby trying to fill in for the active.  Failing to come up 
> defeats the purpose of the HA configuration.  Instead of preventing the RM 
> from starting, a failed app recovery should log an error and skip the 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4309) Add debug information to application logs when a container fails

2015-12-02 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-4309:

Attachment: YARN-4309.004.patch

Thanks for the review [~sidharta-s].

bq. Could you clarify why the debugging information gathering in 
DockerContainerExecutor.writeLaunchEnv is not guarded by a config check?

Good catch. The config check should be present in DockerContainerExecutor as 
well. Fixed.

bq. There seem to be minor inconsistent line spacing issues in the new test 
function in TestContainerLaunch.java

Fixed.

I've changed the find command in the latest version to not use the xtype option 
which seems to be Linux only. I've also renamed the scriptbuilder functions to 
indicate that they're meant for debugging purposes.

> Add debug information to application logs when a container fails
> 
>
> Key: YARN-4309
> URL: https://issues.apache.org/jira/browse/YARN-4309
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: YARN-4309.001.patch, YARN-4309.002.patch, 
> YARN-4309.003.patch, YARN-4309.004.patch
>
>
> Sometimes when a container fails, it can be pretty hard to figure out why it 
> failed.
> My proposal is that if a container fails, we collect information about the 
> container local dir and dump it into the container log dir. Ideally, I'd like 
> to tar up the directory entirely, but I'm not sure of the security and space 
> implications of such a approach. At the very least, we can list all the files 
> in the container local dir, and dump the contents of launch_container.sh(into 
> the container log dir).
> When log aggregation occurs, all this information will automatically get 
> collected and make debugging such failures much easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4410) hadoop

2015-12-02 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035468#comment-15035468
 ] 

Naganarasimha G R commented on YARN-4410:
-

Whats this jira for ? If its raised by mistake, please resolve it as invalid !

> hadoop
> --
>
> Key: YARN-4410
> URL: https://issues.apache.org/jira/browse/YARN-4410
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: qeko
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4398) Yarn recover functionality causes the cluster running slowly and the cluster usage rate is far below 100

2015-12-02 Thread NING DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035471#comment-15035471
 ] 

NING DING commented on YARN-4398:
-

I uploaded a new patch that removed useless whitespace.
The current test cases can cover the modified code in this patch. This patch 
resolved performance issue. So no new unit test cases.

> Yarn recover functionality causes the cluster running slowly and the cluster 
> usage rate is far below 100
> 
>
> Key: YARN-4398
> URL: https://issues.apache.org/jira/browse/YARN-4398
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: NING DING
> Attachments: YARN-4398.2.patch, YARN-4398.3.patch, YARN-4398.4.patch
>
>
> In my hadoop cluster, the resourceManager recover functionality is enabled 
> with FileSystemRMStateStore.
> I found this cause the yarn cluster running slowly and cluster usage rate is 
> just 50 even there are many pending Apps. 
> The scenario is below.
> In thread A, the RMAppImpl$RMAppNewlySavingTransition is calling 
> storeNewApplication method defined in RMStateStore. This storeNewApplication 
> method is synchronized.
> {code:title=RMAppImpl.java|borderStyle=solid}
>   private static final class RMAppNewlySavingTransition extends 
> RMAppTransition {
> @Override
> public void transition(RMAppImpl app, RMAppEvent event) {
>   // If recovery is enabled then store the application information in a
>   // non-blocking call so make sure that RM has stored the information
>   // needed to restart the AM after RM restart without further client
>   // communication
>   LOG.info("Storing application with id " + app.applicationId);
>   app.rmContext.getStateStore().storeNewApplication(app);
> }
>   }
> {code}
> {code:title=RMStateStore.java|borderStyle=solid}
> public synchronized void storeNewApplication(RMApp app) {
> ApplicationSubmissionContext context = app
> 
> .getApplicationSubmissionContext();
> assert context instanceof ApplicationSubmissionContextPBImpl;
> ApplicationStateData appState =
> ApplicationStateData.newInstance(
> app.getSubmitTime(), app.getStartTime(), context, app.getUser());
> dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState));
>   }
> {code}
> In thread B, the FileSystemRMStateStore is calling 
> storeApplicationStateInternal method. It's also synchronized.
> This storeApplicationStateInternal method saves an ApplicationStateData into 
> HDFS and it normally costs 90~300 milliseconds in my hadoop cluster.
> {code:title=FileSystemRMStateStore.java|borderStyle=solid}
> public synchronized void storeApplicationStateInternal(ApplicationId appId,
>   ApplicationStateData appStateDataPB) throws Exception {
> Path appDirPath = getAppDir(rmAppRoot, appId);
> mkdirsWithRetries(appDirPath);
> Path nodeCreatePath = getNodePath(appDirPath, appId.toString());
> LOG.info("Storing info for app: " + appId + " at: " + nodeCreatePath);
> byte[] appStateData = appStateDataPB.getProto().toByteArray();
> try {
>   // currently throw all exceptions. May need to respond differently for 
> HA
>   // based on whether we have lost the right to write to FS
>   writeFileWithRetries(nodeCreatePath, appStateData, true);
> } catch (Exception e) {
>   LOG.info("Error storing info for app: " + appId, e);
>   throw e;
> }
>   }
> {code}
> Think thread B firstly comes into 
> FileSystemRMStateStore.storeApplicationStateInternal method, then thread A 
> will be blocked for a while because of synchronization. In ResourceManager 
> there is only one RMStateStore instance. In my cluster it's 
> FileSystemRMStateStore type.
> Debug the RMAppNewlySavingTransition.transition method, the thread stack 
> shows it's called form AsyncDispatcher.dispatch method. This method code is 
> as below. 
> {code:title=AsyncDispatcher.java|borderStyle=solid}
>   protected void dispatch(Event event) {
> //all events go thru this loop
> if (LOG.isDebugEnabled()) {
>   LOG.debug("Dispatching the event " + event.getClass().getName() + "."
>   + event.toString());
> }
> Class type = event.getType().getDeclaringClass();
> try{
>   EventHandler handler = eventDispatchers.get(type);
>   if(handler != null) {
> handler.handle(event);
>   } else {
> throw new Exception("No handler for registered for " + type);
>   }
> } catch (Throwable t) {
>   //TODO Maybe log the state of the queue
>   LOG.fatal("Error in dispatcher thread", t);
>   // If serviceStop is called, we should exit this thread gracefully.
>   

[jira] [Created] (YARN-4413) Nodes in the includes list should not be listed as decommissioned in the UI

2015-12-02 Thread Daniel Templeton (JIRA)
Daniel Templeton created YARN-4413:
--

 Summary: Nodes in the includes list should not be listed as 
decommissioned in the UI
 Key: YARN-4413
 URL: https://issues.apache.org/jira/browse/YARN-4413
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.7.1
Reporter: Daniel Templeton
Assignee: Daniel Templeton


If I decommission a node and then move it from the excludes list back to the 
includes list, but I don't restart the node, the node will still be listed by 
the web UI as decomissioned until either the NM or RM is restarted.  Ideally, 
removing the node from the excludes list and putting it back into the includes 
list should cause the node to be reported as shutdown instead.

CC [~kshukla]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4292) ResourceUtilization should be a part of NodeInfo REST API

2015-12-02 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036619#comment-15036619
 ] 

Wangda Tan commented on YARN-4292:
--

Looks good, +1, will commit in a few days if no opposite opinions. Thanks 
[~sunilg].

> ResourceUtilization should be a part of NodeInfo REST API
> -
>
> Key: YARN-4292
> URL: https://issues.apache.org/jira/browse/YARN-4292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Sunil G
> Attachments: 0001-YARN-4292.patch, 0002-YARN-4292.patch, 
> 0003-YARN-4292.patch, 0004-YARN-4292.patch, 0005-YARN-4292.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3623) We should have a config to indicate the Timeline Service version

2015-12-02 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036654#comment-15036654
 ] 

Li Lu commented on YARN-3623:
-

Hi [~Naganarasimha] sorry for the late reply (just came back from a vacation). 
Plan sounds good to me. One thing is that we may not need to mark this JIRA as 
ATS v1.5 since the fix here is a rather general one: from v1.5 on we need to 
handle this configuration in ATS correctly, thus I think it'll be a general 
JIRA and not attached with any specific version of ATS. Right now we have the 
patch for this JIRA so we can proceed with the rest of the plan? It will 
certainly be helpful if anyone has any concerns on Naga's plan. 

> We should have a config to indicate the Timeline Service version
> 
>
> Key: YARN-3623
> URL: https://issues.apache.org/jira/browse/YARN-3623
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Naganarasimha G R
> Attachments: YARN-3623-2015-11-19.1.patch
>
>
> So far RM, MR AM, DA AM added/changed new config to enable the feature to 
> write the timeline data to v2 server. It's good to have a YARN 
> timeline-service.version config like timeline-service.enable to indicate the 
> version of the running timeline service with the given YARN cluster. It's 
> beneficial for users to more smoothly move from v1 to v2, as they don't need 
> to change the existing config, but switch this config from v1 to v2. And each 
> framework doesn't need to have their own v1/v2 config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4405) Support node label store in non-appendable file system

2015-12-02 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4405:
-
Attachment: YARN-4405.2.patch

Attached ver.2 patch fixed test failures and findbug warnings.

> Support node label store in non-appendable file system
> --
>
> Key: YARN-4405
> URL: https://issues.apache.org/jira/browse/YARN-4405
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4405.1.patch, YARN-4405.2.patch
>
>
> Existing node label file system store implementation uses append to write 
> edit logs. However, some file system doesn't support append, we need add an 
> implementation to support such non-appendable file systems as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2015-12-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036629#comment-15036629
 ] 

Hadoop QA commented on YARN-4311:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
2s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 
19s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 
28s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
13s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 47s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
47s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
44s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 29s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 4s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
36s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 
43s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 10m 43s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 
54s {color} | {color:green} the patch passed with JDK v {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 10m 54s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 13s 
{color} | {color:red} Patch generated 5 new checkstyle issues in root (total 
was 396, now 400). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 43s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
47s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
23s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 22s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 40s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 26s {color} 
| {color:red} hadoop-yarn-api in the patch failed with JDK v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 47s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 54s 
{color} | {color:green} hadoop-sls in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 25s {color} 
| {color:red} hadoop-yarn-api in the patch failed with JDK v1.7.0_85. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 54s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_85. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 59s 
{color} | {color:green} hadoop-sls in the patch passed with JDK v1.7.0_85. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
23s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 218m 41s 

[jira] [Commented] (YARN-4413) Nodes in the includes list should not be listed as decommissioned in the UI

2015-12-02 Thread Kuhu Shukla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036643#comment-15036643
 ] 

Kuhu Shukla commented on YARN-4413:
---

Thanks for reporting this [~templedf]. Was a node refresh done after the file 
change ? If yes then I think,  since this metric is updated during 
AddNodeTransition (which updates rejoined metrics) , there is no transition 
that takes care of this until the node tries to register/heartbeat (as it is 
absent from all RMNodeImpl lists). One way could be to do this check in 
{{refreshNodes}}. 

> Nodes in the includes list should not be listed as decommissioned in the UI
> ---
>
> Key: YARN-4413
> URL: https://issues.apache.org/jira/browse/YARN-4413
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>
> If I decommission a node and then move it from the excludes list back to the 
> includes list, but I don't restart the node, the node will still be listed by 
> the web UI as decomissioned until either the NM or RM is restarted.  Ideally, 
> removing the node from the excludes list and putting it back into the 
> includes list should cause the node to be reported as shutdown instead.
> CC [~kshukla]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-02 Thread Kuhu Shukla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036616#comment-15036616
 ] 

Kuhu Shukla commented on YARN-4406:
---

I agree. I was thinking about that too. During {{registerwithRM()}} we throw a 
YarnException while on the ResourceTrackerService side we just send NodeAction 
as SHUTDOWN. We could in fact update InactiveRMNode list with this node, so 
that it is consistent. Let me know what you think. I will put up a patch soon.

> RM Web UI continues to show decommissioned nodes even after RM restart
> --
>
> Key: YARN-4406
> URL: https://issues.apache.org/jira/browse/YARN-4406
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ray Chiang
>Assignee: Kuhu Shukla
>Priority: Minor
>
> If you start up a cluster, decommission a NodeManager, and restart the RM, 
> the decommissioned node list will still show a positive number (1 in the case 
> of 1 node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4293) ResourceUtilization should be a part of yarn node CLI

2015-12-02 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036630#comment-15036630
 ] 

Wangda Tan commented on YARN-4293:
--

Thanks [~sunilg].

[~kasha], I found the biggest change of this patch is moving 
ResourceUtilization class from server.api to api. Do you think if 
ResourceUtilization should be a part of user-facing API?

> ResourceUtilization should be a part of yarn node CLI
> -
>
> Key: YARN-4293
> URL: https://issues.apache.org/jira/browse/YARN-4293
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Sunil G
> Attachments: 0001-YARN-4293.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4413) Nodes in the includes list should not be listed as decommissioned in the UI

2015-12-02 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036645#comment-15036645
 ] 

Daniel Templeton commented on YARN-4413:


That's what I was thinking.

> Nodes in the includes list should not be listed as decommissioned in the UI
> ---
>
> Key: YARN-4413
> URL: https://issues.apache.org/jira/browse/YARN-4413
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>
> If I decommission a node and then move it from the excludes list back to the 
> includes list, but I don't restart the node, the node will still be listed by 
> the web UI as decomissioned until either the NM or RM is restarted.  Ideally, 
> removing the node from the excludes list and putting it back into the 
> includes list should cause the node to be reported as shutdown instead.
> CC [~kshukla]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4293) ResourceUtilization should be a part of yarn node CLI

2015-12-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036771#comment-15036771
 ] 

Hadoop QA commented on YARN-4293:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
13s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 4s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 26s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
7s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 47s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
50s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 7m 
39s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 57s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 33s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
29s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 6s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 9m 6s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 21m 42s 
{color} | {color:red} root-jdk1.8.0_66 with JDK v1.8.0_66 generated 1 new 
issues (was 751, now 751). {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 9m 6s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 27s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 9m 27s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 31m 9s {color} 
| {color:red} root-jdk1.7.0_85 with JDK v1.7.0_85 generated 1 new issues (was 
745, now 745). {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 9m 27s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 5s 
{color} | {color:red} Patch generated 8 new checkstyle issues in root (total 
was 254, now 261). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 46s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
49s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 8m 
47s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 59s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 37s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 25s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 2m 1s {color} | 
{color:red} hadoop-yarn-common in the patch failed with JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 25s 
{color} | {color:green} hadoop-yarn-server-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 53s 
{color} | {color:green} 

[jira] [Commented] (YARN-4413) Nodes in the includes list should not be listed as decommissioned in the UI

2015-12-02 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036644#comment-15036644
 ] 

Daniel Templeton commented on YARN-4413:


Yes.  The refresh marks nodes newly added to the excludes list as 
decommissioned, but it doesn't do anything for nodes newly added to the 
includes list.

> Nodes in the includes list should not be listed as decommissioned in the UI
> ---
>
> Key: YARN-4413
> URL: https://issues.apache.org/jira/browse/YARN-4413
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>
> If I decommission a node and then move it from the excludes list back to the 
> includes list, but I don't restart the node, the node will still be listed by 
> the web UI as decomissioned until either the NM or RM is restarted.  Ideally, 
> removing the node from the excludes list and putting it back into the 
> includes list should cause the node to be reported as shutdown instead.
> CC [~kshukla]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)

2015-12-02 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036708#comment-15036708
 ] 

Jian He commented on YARN-3840:
---

latest patch looks good to me, thanks [~varun_saxena], 

> Resource Manager web ui issue when sorting application by id (with 
> application having id > )
> 
>
> Key: YARN-3840
> URL: https://issues.apache.org/jira/browse/YARN-3840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: LINTE
>Assignee: Varun Saxena
> Fix For: 2.8.0, 2.7.3
>
> Attachments: RMApps.png, RMApps_Sorted.png, YARN-3840-1.patch, 
> YARN-3840-2.patch, YARN-3840-3.patch, YARN-3840-4.patch, YARN-3840-5.patch, 
> YARN-3840-6.patch, YARN-3840.reopened.001.patch, yarn-3840-7.patch
>
>
> On the WEBUI, the global main view page : 
> http://resourcemanager:8088/cluster/apps doesn't display applications over 
> .
> With command line it works (# yarn application -list).
> Regards,
> Alexandre



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-12-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036821#comment-15036821
 ] 

Hadoop QA commented on YARN-4225:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
58s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 0s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 22s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
29s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 4s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
57s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
45s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 49s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 8s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
57s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 5s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 5s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 18s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 18s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 18s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 29s 
{color} | {color:red} Patch generated 1 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 50, now 50). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 6s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
54s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 34s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
introduced 1 new FindBugs issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 51s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 9s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 27s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 4s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 64m 25s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 49m 30s {color} 
| {color:red} hadoop-yarn-client in the patch failed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 26s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_85. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | 

[jira] [Updated] (YARN-2575) Consider creating separate ACLs for Reservation create/update/delete ops

2015-12-02 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-2575:
-
Assignee: Sean Po  (was: Subru Krishnan)

> Consider creating separate ACLs for Reservation create/update/delete ops
> 
>
> Key: YARN-2575
> URL: https://issues.apache.org/jira/browse/YARN-2575
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Sean Po
>
> YARN-1051 introduces the ReservationSystem and in the current implementation 
> anyone who can submit applications can also submit reservations. This JIRA is 
> to evaluate creating separate ACLs for Reservation create/update/delete ops.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4411) ResourceManager IllegalArgumentException error

2015-12-02 Thread yarntime (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037003#comment-15037003
 ] 

yarntime commented on YARN-4411:


Hi Naganarasimha G R, thank you very much.

> ResourceManager IllegalArgumentException error
> --
>
> Key: YARN-4411
> URL: https://issues.apache.org/jira/browse/YARN-4411
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: yarntime
>Assignee: yarntime
>
> in version 2.7.1, line 1914  may cause IllegalArgumentException in 
> RMAppAttemptImpl:
>   YarnApplicationAttemptState.valueOf(this.getState().toString())
> cause by this.getState() returns type RMAppAttemptState which may not be 
> converted to YarnApplicationAttemptState.
> {noformat}
> java.lang.IllegalArgumentException: No enum constant 
> org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.LAUNCHED_UNMANAGED_SAVING
> at java.lang.Enum.valueOf(Enum.java:236)
> at 
> org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.createApplicationAttemptReport(RMAppAttemptImpl.java:1870)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationAttemptReport(ClientRMService.java:355)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationAttemptReport(ApplicationClientProtocolPBServiceImpl.java:355)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:425)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4411) ResourceManager IllegalArgumentException error

2015-12-02 Thread yarntime (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037008#comment-15037008
 ] 

yarntime commented on YARN-4411:


OK, thank you very much.

> ResourceManager IllegalArgumentException error
> --
>
> Key: YARN-4411
> URL: https://issues.apache.org/jira/browse/YARN-4411
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: yarntime
>Assignee: yarntime
>
> in version 2.7.1, line 1914  may cause IllegalArgumentException in 
> RMAppAttemptImpl:
>   YarnApplicationAttemptState.valueOf(this.getState().toString())
> cause by this.getState() returns type RMAppAttemptState which may not be 
> converted to YarnApplicationAttemptState.
> {noformat}
> java.lang.IllegalArgumentException: No enum constant 
> org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.LAUNCHED_UNMANAGED_SAVING
> at java.lang.Enum.valueOf(Enum.java:236)
> at 
> org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.createApplicationAttemptReport(RMAppAttemptImpl.java:1870)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationAttemptReport(ClientRMService.java:355)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationAttemptReport(ApplicationClientProtocolPBServiceImpl.java:355)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:425)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4398) Yarn recover functionality causes the cluster running slowly and the cluster usage rate is far below 100

2015-12-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036968#comment-15036968
 ] 

Hudson commented on YARN-4398:
--

ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #658 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/658/])
YARN-4398. Remove unnecessary synchronization in RMStateStore. (jianhe: rev 
6b9a5beb2b2f9589ef86670f2d763e8488ee5e90)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java


> Yarn recover functionality causes the cluster running slowly and the cluster 
> usage rate is far below 100
> 
>
> Key: YARN-4398
> URL: https://issues.apache.org/jira/browse/YARN-4398
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: NING DING
>Assignee: NING DING
> Fix For: 2.7.3
>
> Attachments: YARN-4398.2.patch, YARN-4398.3.patch, YARN-4398.4.patch
>
>
> In my hadoop cluster, the resourceManager recover functionality is enabled 
> with FileSystemRMStateStore.
> I found this cause the yarn cluster running slowly and cluster usage rate is 
> just 50 even there are many pending Apps. 
> The scenario is below.
> In thread A, the RMAppImpl$RMAppNewlySavingTransition is calling 
> storeNewApplication method defined in RMStateStore. This storeNewApplication 
> method is synchronized.
> {code:title=RMAppImpl.java|borderStyle=solid}
>   private static final class RMAppNewlySavingTransition extends 
> RMAppTransition {
> @Override
> public void transition(RMAppImpl app, RMAppEvent event) {
>   // If recovery is enabled then store the application information in a
>   // non-blocking call so make sure that RM has stored the information
>   // needed to restart the AM after RM restart without further client
>   // communication
>   LOG.info("Storing application with id " + app.applicationId);
>   app.rmContext.getStateStore().storeNewApplication(app);
> }
>   }
> {code}
> {code:title=RMStateStore.java|borderStyle=solid}
> public synchronized void storeNewApplication(RMApp app) {
> ApplicationSubmissionContext context = app
> 
> .getApplicationSubmissionContext();
> assert context instanceof ApplicationSubmissionContextPBImpl;
> ApplicationStateData appState =
> ApplicationStateData.newInstance(
> app.getSubmitTime(), app.getStartTime(), context, app.getUser());
> dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState));
>   }
> {code}
> In thread B, the FileSystemRMStateStore is calling 
> storeApplicationStateInternal method. It's also synchronized.
> This storeApplicationStateInternal method saves an ApplicationStateData into 
> HDFS and it normally costs 90~300 milliseconds in my hadoop cluster.
> {code:title=FileSystemRMStateStore.java|borderStyle=solid}
> public synchronized void storeApplicationStateInternal(ApplicationId appId,
>   ApplicationStateData appStateDataPB) throws Exception {
> Path appDirPath = getAppDir(rmAppRoot, appId);
> mkdirsWithRetries(appDirPath);
> Path nodeCreatePath = getNodePath(appDirPath, appId.toString());
> LOG.info("Storing info for app: " + appId + " at: " + nodeCreatePath);
> byte[] appStateData = appStateDataPB.getProto().toByteArray();
> try {
>   // currently throw all exceptions. May need to respond differently for 
> HA
>   // based on whether we have lost the right to write to FS
>   writeFileWithRetries(nodeCreatePath, appStateData, true);
> } catch (Exception e) {
>   LOG.info("Error storing info for app: " + appId, e);
>   throw e;
> }
>   }
> {code}
> Think thread B firstly comes into 
> FileSystemRMStateStore.storeApplicationStateInternal method, then thread A 
> will be blocked for a while because of synchronization. In ResourceManager 
> there is only one RMStateStore instance. In my cluster it's 
> FileSystemRMStateStore type.
> Debug the RMAppNewlySavingTransition.transition method, the thread stack 
> shows it's called form AsyncDispatcher.dispatch method. This method code is 
> as below. 
> {code:title=AsyncDispatcher.java|borderStyle=solid}
>   protected void dispatch(Event event) {
> //all events go thru this loop
> if (LOG.isDebugEnabled()) {
>   LOG.debug("Dispatching the event " + event.getClass().getName() + "."
>   + event.toString());
> }
> Class type = event.getType().getDeclaringClass();
> 

[jira] [Commented] (YARN-4309) Add debug information to application logs when a container fails

2015-12-02 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036979#comment-15036979
 ] 

Sidharta Seethana commented on YARN-4309:
-

hi [~vvasudev],

I am using the find command that you have in the patch against broken symlinks 
- it is not clear to me how broken symlink info is captured (please see below). 
Could you please clarify?

{code}
q (19:50:35) ~/symlink-test$ ls -l
total 0
q (19:50:47) ~/symlink-test$ ln -s world hello
q (19:51:03) ~/symlink-test$ find -L . -maxdepth 5 -type l -ls
21492794320 lrwxrwxrwx   1 sseethana sseethana5 Dec  2 19:51 
./hello -> world
q (19:51:15) ~/symlink-test$ echo $?
0
q (19:51:52) ~/symlink-test$ uname -a
Linux q 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 
x86_64 GNU/Linux
{code}

> Add debug information to application logs when a container fails
> 
>
> Key: YARN-4309
> URL: https://issues.apache.org/jira/browse/YARN-4309
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: YARN-4309.001.patch, YARN-4309.002.patch, 
> YARN-4309.003.patch, YARN-4309.004.patch, YARN-4309.005.patch
>
>
> Sometimes when a container fails, it can be pretty hard to figure out why it 
> failed.
> My proposal is that if a container fails, we collect information about the 
> container local dir and dump it into the container log dir. Ideally, I'd like 
> to tar up the directory entirely, but I'm not sure of the security and space 
> implications of such a approach. At the very least, we can list all the files 
> in the container local dir, and dump the contents of launch_container.sh(into 
> the container log dir).
> When log aggregation occurs, all this information will automatically get 
> collected and make debugging such failures much easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period

2015-12-02 Thread Xianyin Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037185#comment-15037185
 ] 

Xianyin Xin commented on YARN-4403:
---

And will provide new patch of YARN-4177 once this is in.

> (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating 
> period
> 
>
> Key: YARN-4403
> URL: https://issues.apache.org/jira/browse/YARN-4403
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4403.patch
>
>
> Currently, (AM/NM/Container)LivelinessMonitor use current system time to 
> calculate a duration of expire which could be broken by settimeofday. We 
> should use Time.monotonicNow() instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4389) "yarn.am.blacklisting.enabled" and "yarn.am.blacklisting.disable-failure-threshold" should be app specific rather than a setting for whole YARN cluster

2015-12-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037217#comment-15037217
 ] 

Sunil G commented on YARN-4389:
---

Test case failures are known ones. And has separate tickets to handle the same.

> "yarn.am.blacklisting.enabled" and 
> "yarn.am.blacklisting.disable-failure-threshold" should be app specific 
> rather than a setting for whole YARN cluster
> ---
>
> Key: YARN-4389
> URL: https://issues.apache.org/jira/browse/YARN-4389
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Reporter: Junping Du
>Assignee: Sunil G
>Priority: Critical
> Attachments: 0001-YARN-4389.patch, 0002-YARN-4389.patch
>
>
> "yarn.am.blacklisting.enabled" and 
> "yarn.am.blacklisting.disable-failure-threshold" should be application 
> specific rather than a setting in cluster level, or we should't maintain 
> amBlacklistingEnabled and blacklistDisableThreshold in per rmApp level. We 
> should allow each am to override this config, i.e. via submissionContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2885) Create AMRMProxy request interceptor for distributed scheduling decisions for queueable containers

2015-12-02 Thread Konstantinos Karanasos (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037424#comment-15037424
 ] 

Konstantinos Karanasos commented on YARN-2885:
--

Adding some more on point #2 (I agree with the rest)...
First I agree that the AM should not know whether a container came from the RM 
or from a distributed scheduler.
Regarding the AllocateRequest, I don't think it is currently used in the code, 
so it can be removed.
However, it is used in the RegisterAMRequest to make sure that both the NM and 
the RM have distributed scheduling enabled when setting some of the parameters 
related to the dist scheduling. If we assume that all nodes have dist 
scheduling enabled as long as it is enabled by the RM, then keeping the 
isDistributedScheduling boolean in the RegisterRequest is not needed either. 
After all it is only for setting a few parameters (even if we want to disable 
dist scheduling in a particular NM, that NM can simply discard these 
parameters).

That said, I am not sure if it is required to create a wrapper at this point 
for the AM protocol.

> Create AMRMProxy request interceptor for distributed scheduling decisions for 
> queueable containers
> --
>
> Key: YARN-2885
> URL: https://issues.apache.org/jira/browse/YARN-2885
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Arun Suresh
> Attachments: YARN-2885-yarn-2877.001.patch
>
>
> We propose to add a Local ResourceManager (LocalRM) to the NM in order to 
> support distributed scheduling decisions. 
> Architecturally we leverage the RMProxy, introduced in YARN-2884. 
> The LocalRM makes distributed decisions for queuable containers requests. 
> Guaranteed-start requests are still handled by the central RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4361) Total resource count mistake:NodeRemovedSchedulerEvent in ReconnectNodeTransition will reduce the newNode.getTotalCapability() in Multi-thread model

2015-12-02 Thread jialei weng (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037226#comment-15037226
 ] 

jialei weng commented on YARN-4361:
---

yes, I check the patch, it can also the issue. Thanks.

> Total resource count mistake:NodeRemovedSchedulerEvent in 
> ReconnectNodeTransition will reduce the newNode.getTotalCapability() in 
> Multi-thread model
> 
>
> Key: YARN-4361
> URL: https://issues.apache.org/jira/browse/YARN-4361
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.2
>Reporter: jialei weng
>  Labels: patch
> Attachments: YARN-4361v1.patch
>
>
> Total resource count mistake:
> NodeRemovedSchedulerEvent in ReconnectNodeTransition will reduce the 
> newNode.getTotalCapability() in Multi-thread model. Since the RMNode and 
> scheduler in different queue. So it cannot guarantee the remove-update-add 
> operation in sequence. Sometimes the total resource will reduce the 
> newNode.getTotalCapability() when handling NodeRemovedSchedulerEvent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period

2015-12-02 Thread Xianyin Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037284#comment-15037284
 ] 

Xianyin Xin commented on YARN-4403:
---

Thanks, [~sunilg].

> (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating 
> period
> 
>
> Key: YARN-4403
> URL: https://issues.apache.org/jira/browse/YARN-4403
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4403.patch
>
>
> Currently, (AM/NM/Container)LivelinessMonitor use current system time to 
> calculate a duration of expire which could be broken by settimeofday. We 
> should use Time.monotonicNow() instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4398) Yarn recover functionality causes the cluster running slowly and the cluster usage rate is far below 100

2015-12-02 Thread NING DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037381#comment-15037381
 ] 

NING DING commented on YARN-4398:
-

[~jianhe], thank you.

> Yarn recover functionality causes the cluster running slowly and the cluster 
> usage rate is far below 100
> 
>
> Key: YARN-4398
> URL: https://issues.apache.org/jira/browse/YARN-4398
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: NING DING
>Assignee: NING DING
> Fix For: 2.7.3
>
> Attachments: YARN-4398.2.patch, YARN-4398.3.patch, YARN-4398.4.patch
>
>
> In my hadoop cluster, the resourceManager recover functionality is enabled 
> with FileSystemRMStateStore.
> I found this cause the yarn cluster running slowly and cluster usage rate is 
> just 50 even there are many pending Apps. 
> The scenario is below.
> In thread A, the RMAppImpl$RMAppNewlySavingTransition is calling 
> storeNewApplication method defined in RMStateStore. This storeNewApplication 
> method is synchronized.
> {code:title=RMAppImpl.java|borderStyle=solid}
>   private static final class RMAppNewlySavingTransition extends 
> RMAppTransition {
> @Override
> public void transition(RMAppImpl app, RMAppEvent event) {
>   // If recovery is enabled then store the application information in a
>   // non-blocking call so make sure that RM has stored the information
>   // needed to restart the AM after RM restart without further client
>   // communication
>   LOG.info("Storing application with id " + app.applicationId);
>   app.rmContext.getStateStore().storeNewApplication(app);
> }
>   }
> {code}
> {code:title=RMStateStore.java|borderStyle=solid}
> public synchronized void storeNewApplication(RMApp app) {
> ApplicationSubmissionContext context = app
> 
> .getApplicationSubmissionContext();
> assert context instanceof ApplicationSubmissionContextPBImpl;
> ApplicationStateData appState =
> ApplicationStateData.newInstance(
> app.getSubmitTime(), app.getStartTime(), context, app.getUser());
> dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState));
>   }
> {code}
> In thread B, the FileSystemRMStateStore is calling 
> storeApplicationStateInternal method. It's also synchronized.
> This storeApplicationStateInternal method saves an ApplicationStateData into 
> HDFS and it normally costs 90~300 milliseconds in my hadoop cluster.
> {code:title=FileSystemRMStateStore.java|borderStyle=solid}
> public synchronized void storeApplicationStateInternal(ApplicationId appId,
>   ApplicationStateData appStateDataPB) throws Exception {
> Path appDirPath = getAppDir(rmAppRoot, appId);
> mkdirsWithRetries(appDirPath);
> Path nodeCreatePath = getNodePath(appDirPath, appId.toString());
> LOG.info("Storing info for app: " + appId + " at: " + nodeCreatePath);
> byte[] appStateData = appStateDataPB.getProto().toByteArray();
> try {
>   // currently throw all exceptions. May need to respond differently for 
> HA
>   // based on whether we have lost the right to write to FS
>   writeFileWithRetries(nodeCreatePath, appStateData, true);
> } catch (Exception e) {
>   LOG.info("Error storing info for app: " + appId, e);
>   throw e;
> }
>   }
> {code}
> Think thread B firstly comes into 
> FileSystemRMStateStore.storeApplicationStateInternal method, then thread A 
> will be blocked for a while because of synchronization. In ResourceManager 
> there is only one RMStateStore instance. In my cluster it's 
> FileSystemRMStateStore type.
> Debug the RMAppNewlySavingTransition.transition method, the thread stack 
> shows it's called form AsyncDispatcher.dispatch method. This method code is 
> as below. 
> {code:title=AsyncDispatcher.java|borderStyle=solid}
>   protected void dispatch(Event event) {
> //all events go thru this loop
> if (LOG.isDebugEnabled()) {
>   LOG.debug("Dispatching the event " + event.getClass().getName() + "."
>   + event.toString());
> }
> Class type = event.getType().getDeclaringClass();
> try{
>   EventHandler handler = eventDispatchers.get(type);
>   if(handler != null) {
> handler.handle(event);
>   } else {
> throw new Exception("No handler for registered for " + type);
>   }
> } catch (Throwable t) {
>   //TODO Maybe log the state of the queue
>   LOG.fatal("Error in dispatcher thread", t);
>   // If serviceStop is called, we should exit this thread gracefully.
>   if (exitOnDispatchException
>   && (ShutdownHookManager.get().isShutdownInProgress()) == false

[jira] [Commented] (YARN-2885) Create AMRMProxy request interceptor for distributed scheduling decisions for queueable containers

2015-12-02 Thread Konstantinos Karanasos (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037404#comment-15037404
 ] 

Konstantinos Karanasos commented on YARN-2885:
--

Thank you for the patch, [~asuresh].
Adding some more comments to this first version:
# Given that the list of nodes to be used for distributed scheduling ("top-k 
nodes") is ordered, we need to send the whole list at each AllocateResponse (it 
will become complicated to do so by sending just the delta of the list in the 
form of new/removed nodes).
# Given the above point, we will not need to have a node list in the 
RegisterApplicationMasterResponse.
# I suggest to remove the two parameters for setting limits to the number of 
QUEUEABLE containers from this JIRA, since YARN-2889 targets this functionality.
# I propose to remove the support for locality from this first version of the 
JIRA. Getting it right requires more work (given that each LocalRM only sees a 
subset of the cluster's nodes), and should probably be the objective of a 
separate sub-JIRA.
# When creating the Interceptor chain in the AMRMProxyService, make sure the 
DistSchedulerRequestInterceptor is always placed in the beginning of the chain.
# We could make DistSchedulerParameters a subclass to the 
DistSchedulerRequestInterceptor rather than a separate class.

> Create AMRMProxy request interceptor for distributed scheduling decisions for 
> queueable containers
> --
>
> Key: YARN-2885
> URL: https://issues.apache.org/jira/browse/YARN-2885
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Arun Suresh
> Attachments: YARN-2885-yarn-2877.001.patch
>
>
> We propose to add a Local ResourceManager (LocalRM) to the NM in order to 
> support distributed scheduling decisions. 
> Architecturally we leverage the RMProxy, introduced in YARN-2884. 
> The LocalRM makes distributed decisions for queuable containers requests. 
> Guaranteed-start requests are still handled by the central RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4309) Add debug information to application logs when a container fails

2015-12-02 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037418#comment-15037418
 ] 

Varun Vasudev commented on YARN-4309:
-

bq. do we need to worry about -L following links outside of the current 
directory?

find will follow the links outside the current directory upto the maxdepth. 
This is useful to because we symlink to resources outside the work dir from the 
container work dir(like the mapreduce jar, the job conf, etc).

> Add debug information to application logs when a container fails
> 
>
> Key: YARN-4309
> URL: https://issues.apache.org/jira/browse/YARN-4309
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: YARN-4309.001.patch, YARN-4309.002.patch, 
> YARN-4309.003.patch, YARN-4309.004.patch, YARN-4309.005.patch
>
>
> Sometimes when a container fails, it can be pretty hard to figure out why it 
> failed.
> My proposal is that if a container fails, we collect information about the 
> container local dir and dump it into the container log dir. Ideally, I'd like 
> to tar up the directory entirely, but I'm not sure of the security and space 
> implications of such a approach. At the very least, we can list all the files 
> in the container local dir, and dump the contents of launch_container.sh(into 
> the container log dir).
> When log aggregation occurs, all this information will automatically get 
> collected and make debugging such failures much easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4413) Nodes in the includes list should not be listed as decommissioned in the UI

2015-12-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037232#comment-15037232
 ] 

Sunil G commented on YARN-4413:
---

Hi [~templedf]
Thank you for raising this ticket.
As you mentioned, I could see that a node is moved from exclude to include list 
and performed {{-refreshNodes}}. And this caused some counts still to be 
displayed in UI. But a restart will help here to clear the metrics.

One point to note here. The way I see it, I do not think we can remove or reset 
this decommissioned count directly by only seeing the include list. There can 
be cases where we would have done {{graceful decommissioning}}, and this can 
add few nodes to decommissioned list which is not one-to-one mapped with 
exclude list.
So I feel we could look both lists upon refresh and remove/add nodes based on 
the entries in both files and from memory.

> Nodes in the includes list should not be listed as decommissioned in the UI
> ---
>
> Key: YARN-4413
> URL: https://issues.apache.org/jira/browse/YARN-4413
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>
> If I decommission a node and then move it from the excludes list back to the 
> includes list, but I don't restart the node, the node will still be listed by 
> the web UI as decomissioned until either the NM or RM is restarted.  Ideally, 
> removing the node from the excludes list and putting it back into the 
> includes list should cause the node to be reported as shutdown instead.
> CC [~kshukla]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4392) ApplicationCreatedEvent event time resets after RM restart/failover

2015-12-02 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037280#comment-15037280
 ] 

Naganarasimha G R commented on YARN-4392:
-

[~xgong],
bq, Will it cause any issue if the APP_CREATED event is missing ? If that only 
cause the missing related information in ATS webui/webservice, I am OK with not 
re-sending the ATS events on recovery.
IMO even if it causes any issue we need to correct it, as there is another 
scenario when RM is started much before the ATS server., then there is 
possibility that ATS will miss the App start events but might receive the App 
finish events.

> ApplicationCreatedEvent event time resets after RM restart/failover
> ---
>
> Key: YARN-4392
> URL: https://issues.apache.org/jira/browse/YARN-4392
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0
>Reporter: Xuan Gong
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-4392-2015-11-24.patch, YARN-4392.1.patch, 
> YARN-4392.2.patch
>
>
> {code}2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - 
> Finished time 1437453994768 is ahead of started time 1440308399674 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437454008244 is ahead of started time 1440308399676 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444305171 is ahead of started time 1440308399653 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444293115 is ahead of started time 1440308399647 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444379645 is ahead of started time 1440308399656 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444361234 is ahead of started time 1440308399655 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444342029 is ahead of started time 1440308399654 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444323447 is ahead of started time 1440308399654 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143730006 is ahead of started time 1440308399660 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143715698 is ahead of started time 1440308399659 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143719060 is ahead of started time 1440308399658 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444393931 is ahead of started time 1440308399657
> {code} . 
> From ATS logs, we would see a large amount of 'stale alerts' messages 
> periodically



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-4397) if this addAll() function`s params is fault? @NodeListManager#getUnusableNodes()

2015-12-02 Thread Feng Yuan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Yuan resolved YARN-4397.
-
Resolution: Not A Problem

> if this addAll() function`s params is fault? 
> @NodeListManager#getUnusableNodes()
> 
>
> Key: YARN-4397
> URL: https://issues.apache.org/jira/browse/YARN-4397
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.6.0
>Reporter: Feng Yuan
> Fix For: 2.8.0
>
>
> code in NodeListManager#144L:
>   /**
>* Provides the currently unusable nodes. Copies it into provided 
> collection.
>* @param unUsableNodes
>*  Collection to which the unusable nodes are added
>* @return number of unusable nodes added
>*/
>   public int getUnusableNodes(Collection unUsableNodes) {
> unUsableNodes.addAll(unusableRMNodesConcurrentSet);
> return unusableRMNodesConcurrentSet.size();
>   }
> unUsableNodes and unusableRMNodesConcurrentSet's sequence is wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2885) Create AMRMProxy request interceptor for distributed scheduling decisions for queueable containers

2015-12-02 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037386#comment-15037386
 ] 

Arun Suresh commented on YARN-2885:
---

Thank you for the review [~leftnoteasy] !! let me try to clarify your 
concerns.. [~kkaranasos], correct me if im wrong..

bq. I'm not sure if it is possibly that queueable resource requests could be 
also sent to RM with this implementation.
What we were aiming for is to not send any Queueable resource reqs to the RM. 
The Local RM (the core functionality of which is now encapsulated in the 
DistSchedulerRequestInterceptor class). As [~sriramrao] had mentioned, we do 
plan to enforce policies around how the Distributed Scheduling is actually done 
on the NM. In the first cut (this JIRA), these policies, which WILL be pushed 
down from the RM, would be stuff like *Maximum resource capability of 
containers allocated* or *set of nodes on which to target Queuable containers*. 
These would be computed at the RM and sent back as part of the AllocateResponse 
and the RegisterResponse. The plan is to have that actual computation happen in 
in the Coordinator running in the RM which we plan to tackle as part of 
YARN-4412.

bq.  I'm not quite sure why isDistributedSchedulingEnabled is required for AM's 
AllocateRequest and RegisterRequest
I totally agree that the AM should not be bothered with this.. But if you 
notice, It is actually not set by the AM, it set by the 
DistSchedulerReqeustInterceptor when it proxies the AM calls. Also, to further 
your point, I am not really happy with putting stuff in the Allocation/Register 
response, that can be seen by the AM which is only relevant to the 
DistScheduler framework. Again, I’m not really happy with this either… I was 
thinking of alternatively the following :
# creating a Wrapper Protocol (Distributed Scheduling AM Protocol) over the AM 
protocol, which basically Wraps each request/response with additional info 
which will be seen only by the DistScheduler running on the NM
# Have an Distributed Scheduler AM Service running on the RM if DS is enabled. 
This will implement the new protocol (it will delegate all the AMProtocol stuff 
to the AMService and will handle DistScheduler specific stuff)
# Instead of having the DSReqInterceptor at the begining of the AMRMProxy 
pipeline, add it to the end (or replace the DefaultReqInterceptor) and have it 
talk the new DistSchedulerAMProtocol (which wraps the Allocate/Register 
requests with the extra DS stuff)
What do you think ? will take a crack at this in the next patch.



Regarding #3, I just wanted a conf to specify that Dist Scheuling has been 
'turned on’.. which if set to false, will revert to default behavior of sending 
even the Queuable reqs to the RM.


I think most of #4 will be taken care of if we create a Wrapper protocol as I 
mentioned earlier..
.. w.r.t getContainerIdStart, technically, the containerId for each app starts 
from the RM epoch.. which is what I wanted to pass on to the NM..
.. agreed, will change the name of getNodeList
.. w.r.t containerTokenExpiryInterval.. so this gets sent from the RM and 
signifies the token expiry for allocated queue able containers.. don’t think it 
might vary per NM
.. w.r.t getMin/MaxAllocatableCapabilty.. we wanted this to be something that 
is specific to the Queueable containers and with is policy driven (or decided 
by the Dist coordinator).. I agree, we can change its name.

Regarding #5, Agreed, will make the changes to public APIs in a separate JIRA.

Hope this makes sense ?






> Create AMRMProxy request interceptor for distributed scheduling decisions for 
> queueable containers
> --
>
> Key: YARN-2885
> URL: https://issues.apache.org/jira/browse/YARN-2885
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Arun Suresh
> Attachments: YARN-2885-yarn-2877.001.patch
>
>
> We propose to add a Local ResourceManager (LocalRM) to the NM in order to 
> support distributed scheduling decisions. 
> Architecturally we leverage the RMProxy, introduced in YARN-2884. 
> The LocalRM makes distributed decisions for queuable containers requests. 
> Guaranteed-start requests are still handled by the central RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period

2015-12-02 Thread Xianyin Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037180#comment-15037180
 ] 

Xianyin Xin commented on YARN-4403:
---

hi [~djp], this is a good suggestion, and YARN-4177 provides some discussion on 
this, so link it.

> (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating 
> period
> 
>
> Key: YARN-4403
> URL: https://issues.apache.org/jira/browse/YARN-4403
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4403.patch
>
>
> Currently, (AM/NM/Container)LivelinessMonitor use current system time to 
> calculate a duration of expire which could be broken by settimeofday. We 
> should use Time.monotonicNow() instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4002) make ResourceTrackerService.nodeHeartbeat more concurrent

2015-12-02 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-4002:
-
Attachment: YARN-4002-v0.patch

Added a patch for this.

> make ResourceTrackerService.nodeHeartbeat more concurrent
> -
>
> Key: YARN-4002
> URL: https://issues.apache.org/jira/browse/YARN-4002
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Hong Zhiguo
>Assignee: Hong Zhiguo
>Priority: Critical
> Attachments: YARN-4002-v0.patch
>
>
> We have multiple RPC threads to handle NodeHeartbeatRequest from NMs. By 
> design the method ResourceTrackerService.nodeHeartbeat should be concurrent 
> enough to scale for large clusters.
> But we have a "BIG" lock in NodesListManager.isValidNode which I think it's 
> unnecessary.
> First, the fields "includes" and "excludes" of HostsFileReader are only 
> updated on "refresh nodes".  All RPC threads handling node heartbeats are 
> only readers.  So RWLock could be used to  alow concurrent access by RPC 
> threads.
> Second, since he fields "includes" and "excludes" of HostsFileReader are 
> always updated by "reference assignment", which is atomic in Java, the reader 
> side lock could just be skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4405) Support node label store in non-appendable file system

2015-12-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037246#comment-15037246
 ] 

Hadoop QA commented on YARN-4405:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
1s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 1s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 15s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
29s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 40s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
40s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 2s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 29s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 53s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
32s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 58s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 58s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 14s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 14s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 29s 
{color} | {color:red} Patch generated 7 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 264, now 267). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 41s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
41s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 14 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
29s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 30s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 55s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 24s {color} 
| {color:red} hadoop-yarn-api in the patch failed with JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 59s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 59m 59s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 26s {color} 
| {color:red} hadoop-yarn-api in the patch failed with JDK v1.7.0_85. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 14s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_85. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 60m 50s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_85. {color} |
| 

[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling

2015-12-02 Thread Konstantinos Karanasos (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037335#comment-15037335
 ] 

Konstantinos Karanasos commented on YARN-2877:
--

Thank you for the detailed comments, [~leftnoteasy].

Regarding #1:
- Indeed the AM-LocalRM communication should be much more frequent than the 
LocalRM-RM (and subsequently AM-RM) communication, in order to achieve 
mili-second latency allocations.
We are planning to address this by having smaller heartbeat intervals in the 
AM-LocalRM communication when compared to the LocalRM-RM. For instance, the 
AM-LocalRM heartbeat interval can be set to 50ms, while the LocalRM-RM interval 
to 200ms (in other words, we will only propagate to the RM only one in every 
four heartbeats).
We will soon create a sub-JIRA for this.
- Each NM will periodically estimate its expected queue wait time (YARN-2886). 
This can simply be based on the number of tasks currently in its queue, or 
(even better) based on the estimated execution times of those tasks (in case 
they are available). Then, this expected queue wait time is pushed through the 
NM-RM heartbeats to the ClusterMonitor (YARN-4412) that is running as a service 
in the RM. The ClusterMonitor gathers this information from all nodes, 
periodically computes the least loaded nodes (i.e., with the smallest queue 
wait times), and adds that list to the heartbeat response, so that all nodes 
(and in turn LocalRMs) get the list. This list is then used during scheduling 
in the LocalRM.
Note that simpler solutions (such as the power of two choices used in Sparrow) 
could be employed, but our experiments have shown that the above "top-k node 
list" leads to considerably better placement (and thus load balancing), 
especially when task durations are heterogeneous.

Regarding #2:
This is a valid concern. The best way to minimize preemption is through the 
"top-k node list" technique described above. As the LocalRM will be placing the 
QUEUEABLE containers to the least loaded nodes, preemption will be minimized.
More techniques can be used to further mitigate the problem. For instance, we 
can "promote" a QUEUEABLE container to a GUARANTEED one in case it has been 
preempted more than k times.
Moreover, we can dynamically set limits to the number of QUEUEABLE containers 
accepted by a node in case of excessive load due to GUARANTEED containers.
That said, as you also mention, QUEUEABLE containers are more suitable for 
short-running tasks, where the probability of a container being preempted is 
smaller.

> Extend YARN to support distributed scheduling
> -
>
> Key: YARN-2877
> URL: https://issues.apache.org/jira/browse/YARN-2877
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager
>Reporter: Sriram Rao
>Assignee: Konstantinos Karanasos
> Attachments: distributed-scheduling-design-doc_v1.pdf
>
>
> This is an umbrella JIRA that proposes to extend YARN to support distributed 
> scheduling.  Briefly, some of the motivations for distributed scheduling are 
> the following:
> 1. Improve cluster utilization by opportunistically executing tasks otherwise 
> idle resources on individual machines.
> 2. Reduce allocation latency.  Tasks where the scheduling time dominates 
> (i.e., task execution time is much less compared to the time required for 
> obtaining a container from the RM).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period

2015-12-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037208#comment-15037208
 ] 

Sunil G commented on YARN-4403:
---

Thanks [~xinxianyin] for updating this, I missed this somehow.
It seems like you are handling Monotonic Clock in various YARN code which uses 
clock.
So it can be made as general ticket in YARN.
Meantime I will raise an MR ticket to handle this for MapReduce. MAPREDUCE-6562 
is linked for the same.

> (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating 
> period
> 
>
> Key: YARN-4403
> URL: https://issues.apache.org/jira/browse/YARN-4403
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4403.patch
>
>
> Currently, (AM/NM/Container)LivelinessMonitor use current system time to 
> calculate a duration of expire which could be broken by settimeofday. We 
> should use Time.monotonicNow() instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-4361) Total resource count mistake:NodeRemovedSchedulerEvent in ReconnectNodeTransition will reduce the newNode.getTotalCapability() in Multi-thread model

2015-12-02 Thread jialei weng (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jialei weng resolved YARN-4361.
---
Resolution: Duplicate

> Total resource count mistake:NodeRemovedSchedulerEvent in 
> ReconnectNodeTransition will reduce the newNode.getTotalCapability() in 
> Multi-thread model
> 
>
> Key: YARN-4361
> URL: https://issues.apache.org/jira/browse/YARN-4361
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.2
>Reporter: jialei weng
>  Labels: patch
> Attachments: YARN-4361v1.patch
>
>
> Total resource count mistake:
> NodeRemovedSchedulerEvent in ReconnectNodeTransition will reduce the 
> newNode.getTotalCapability() in Multi-thread model. Since the RMNode and 
> scheduler in different queue. So it cannot guarantee the remove-update-add 
> operation in sequence. Sometimes the total resource will reduce the 
> newNode.getTotalCapability() when handling NodeRemovedSchedulerEvent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4340) Add "list" API to reservation system

2015-12-02 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036873#comment-15036873
 ] 

Subru Krishnan commented on YARN-4340:
--

Thinking more about it, I am not sure we should have user in the interface. We 
should automatically pick up user from context. If we want to allow user, then 
it should be an Admin API and not a Client API IMHO. 

> Add "list" API to reservation system
> 
>
> Key: YARN-4340
> URL: https://issues.apache.org/jira/browse/YARN-4340
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Reporter: Carlo Curino
>Assignee: Sean Po
> Attachments: YARN-4340.v1.patch, YARN-4340.v2.patch, 
> YARN-4340.v3.patch, YARN-4340.v4.patch
>
>
> This JIRA tracks changes to the APIs of the reservation system, and enables 
> querying the reservation system on which reservation exists by "time-range, 
> reservation-id, username".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2877) Extend YARN to support distributed scheduling

2015-12-02 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036916#comment-15036916
 ] 

Wangda Tan commented on YARN-2877:
--

Thanks [~kkaranasos], [~asuresh],

I just caught up with latest design doc, my 2 cents:
There're two major purpose of distributed RM,  1) get better allocation latency 
2) leverage idle resources.

#1 will be achieved when
- AM -> LocalRM communication can be done within a single RPC call. (Doesn't do 
heartbeat like normal AM-RM allocation), otherwise it will be hard to achieve 
milli-seconds level latency.
- LocalRM has enough information to allocate resource on a NM which could be 
directly used without waiting. I think stochastic + caching some information of 
other LocalRM could solve the problem.

#2 can be achieved, but since the distributed RM solution doesn't have a global 
picture of resources and guaranteed containers can always preempt queueable 
containers. This could lead to excessive queueable containers preempted.
If we can decide where to allocate queueable container from RM, RM could avoid 
a lots of such preemptions. (Instead of allocating on a node has lots of 
queueable containers, allocate on node with "real" idle resources).
To me, this becomes a bigger issue if application wants to use opportunistic 
resources to run normal containers (such as a 10 min MR task). How to guarantee 
RM doesn't allocate more resources for a long time to LocalRM is a problem. IMO 
distributed RM is more suitable for short-lifed (few seconds) and low latency 
tasks.

> Extend YARN to support distributed scheduling
> -
>
> Key: YARN-2877
> URL: https://issues.apache.org/jira/browse/YARN-2877
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager
>Reporter: Sriram Rao
>Assignee: Konstantinos Karanasos
> Attachments: distributed-scheduling-design-doc_v1.pdf
>
>
> This is an umbrella JIRA that proposes to extend YARN to support distributed 
> scheduling.  Briefly, some of the motivations for distributed scheduling are 
> the following:
> 1. Improve cluster utilization by opportunistically executing tasks otherwise 
> idle resources on individual machines.
> 2. Reduce allocation latency.  Tasks where the scheduling time dominates 
> (i.e., task execution time is much less compared to the time required for 
> obtaining a container from the RM).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4408) NodeManager still reports negative running containers

2015-12-02 Thread Robert Kanter (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036857#comment-15036857
 ] 

Robert Kanter commented on YARN-4408:
-

Test failure looks unrelated.

> NodeManager still reports negative running containers
> -
>
> Key: YARN-4408
> URL: https://issues.apache.org/jira/browse/YARN-4408
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: YARN-4408.001.patch, YARN-4408.002.patch, 
> YARN-4408.003.patch
>
>
> YARN-1697 fixed a problem where the NodeManager metrics could report a 
> negative number of running containers.  However, it missed a rare case where 
> this can still happen.
> YARN-1697 added a flag to indicate if the container was actually launched 
> ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which 
> is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to 
> {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge 
> if we actually ran the container and incremented the gauge .  However, this 
> flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to 
> {{DONE}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-12-02 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036864#comment-15036864
 ] 

Wangda Tan commented on YARN-4225:
--

Thanks [~eepayne],
bq. The use case is a newer client is querying an older server...
I'm wondering if this is a valid use case: IMHO, rolling upgrade should be 
always server-first. If we plan support newer client talks to older server, we 
may experience many issues AND we need to add this to Hadoop's code 
compatibility policy.

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN-4225.001.patch, YARN-4225.002.patch, 
> YARN-4225.003.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4409) Fix javadoc and checkstyle issues in timelineservice code

2015-12-02 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-4409:
---
Description: 
There are a large number of javadoc and checkstyle issues currently open in 
timelineservice code. We need to fix them before we merge it into trunk.

Refer to 
https://issues.apache.org/jira/browse/YARN-3862?focusedCommentId=15035267=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15035267
We still have 94 open checkstyle issues and javadocs failing for Java 8.

  was:There are a large number of javadoc and checkstyle issues currently open 
in timelineservice code. We need to fix them before we merge it into trunk.


> Fix javadoc and checkstyle issues in timelineservice code
> -
>
> Key: YARN-4409
> URL: https://issues.apache.org/jira/browse/YARN-4409
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>
> There are a large number of javadoc and checkstyle issues currently open in 
> timelineservice code. We need to fix them before we merge it into trunk.
> Refer to 
> https://issues.apache.org/jira/browse/YARN-3862?focusedCommentId=15035267=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15035267
> We still have 94 open checkstyle issues and javadocs failing for Java 8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4309) Add debug information to application logs when a container fails

2015-12-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035588#comment-15035588
 ] 

Hadoop QA commented on YARN-4309:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
28s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 1s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 16s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
30s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 33s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
39s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
44s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 29s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 46s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
26s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 0s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 0s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 21s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 21s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 30s 
{color} | {color:red} Patch generated 3 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 356, now 356). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 33s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
39s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
25s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 30s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 50s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 59s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 56s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 26s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_85. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 16s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_85. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 22s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.7.0_85. {color} |
| 

[jira] [Resolved] (YARN-4410) hadoop

2015-12-02 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S resolved YARN-4410.
-
Resolution: Invalid

It looks to be created by mistake. Closing as invalid. 

> hadoop
> --
>
> Key: YARN-4410
> URL: https://issues.apache.org/jira/browse/YARN-4410
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: qeko
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4411) ResourceManager IllegalArgumentException error

2015-12-02 Thread yarntime (JIRA)
yarntime created YARN-4411:
--

 Summary: ResourceManager IllegalArgumentException error
 Key: YARN-4411
 URL: https://issues.apache.org/jira/browse/YARN-4411
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.1
Reporter: yarntime


in version 2.7.1, line 1914  may cause IllegalArgumentException in 
RMAppAttemptImpl:
  YarnApplicationAttemptState.valueOf(this.getState().toString())
cause by this.getState() returns type RMAppAttemptState which may not be 
converted to YarnApplicationAttemptState.

java.lang.IllegalArgumentException: No enum constant 
org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.LAUNCHED_UNMANAGED_SAVING
at java.lang.Enum.valueOf(Enum.java:236)
at 
org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.createApplicationAttemptReport(RMAppAttemptImpl.java:1870)
at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationAttemptReport(ClientRMService.java:355)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationAttemptReport(ApplicationClientProtocolPBServiceImpl.java:355)
at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:425)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >