[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-11-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996545#comment-14996545
 ] 

Hadoop QA commented on YARN-2934:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 6s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
57s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 13s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 7s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
56s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
53s {color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 16s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common in 
trunk has 3 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 15s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 55s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
9s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 34s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 34s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 11s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 11s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 58s 
{color} | {color:red} Patch generated 4 new checkstyle issues in root (total 
was 453, now 454). {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
52s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 1s 
{color} | {color:red} The patch has 2 line(s) with tabs. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 6s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 introduced 1 new FindBugs issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 12s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 49s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 46s 
{color} | {color:green} hadoop-common in the patch passed with JDK v1.8.0_60. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 21s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_60. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 47s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_60. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 26s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_60. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 7s 
{color} | {color:green} hadoop-common in the patch passed with JDK v1.7.0_79. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 22s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_79. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 2s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 

[jira] [Commented] (YARN-4331) Restarting NodeManager leaves orphaned containers

2015-11-09 Thread Joseph Francis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996559#comment-14996559
 ] 

Joseph Francis commented on YARN-4331:
--

[~jlowe] Setting yarn.nodemanager.recovery.enabled=true does solve the issue 
with orphaned containers.
Note that the SIGKILL was only done locally to emulate few production issues we 
had that caused nodemanagers to fall over.
Thanks very much for your clear explanation!

> Restarting NodeManager leaves orphaned containers
> -
>
> Key: YARN-4331
> URL: https://issues.apache.org/jira/browse/YARN-4331
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.7.1
>Reporter: Joseph Francis
>Priority: Critical
>
> We are seeing a lot of orphaned containers running in our production clusters.
> I tried to simulate this locally on my machine and can replicate the issue by 
> killing nodemanager.
> I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza 
> jobs.
> Steps:
> {quote}1. Deploy a job 
> 2. Issue a kill -9 signal to nodemanager 
> 3. We should see the AM and its container running without nodemanager
> 4. AM should die but the container still keeps running
> 5. Restarting nodemanager brings up new AM and container but leaves the 
> orphaned container running in the background
> {quote}
> This is effectively causing double processing of data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart

2015-11-09 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996513#comment-14996513
 ] 

Jun Gong commented on YARN-2047:


Sorry for the late reply. 

The issue aims to make sure that a lost NM's containers are marked expired by 
the RM even across RM restart. What I said aims to solve the problem it caused 
in another way. Any thought?

{quote}
If this is a required action then it would also imply that saving a such nodes 
would be a critical state change operation. So, e.g. decommission command from 
the admin should not complete until the store has been updated. Is that the 
case?
{quote}
Yes, it is. However the store process is often very fast, it might be 
acceptable.

> RM should honor NM heartbeat expiry after RM restart
> 
>
> Key: YARN-2047
> URL: https://issues.apache.org/jira/browse/YARN-2047
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>
> After the RM restarts, it forgets about existing NM's (and their potentially 
> decommissioned status too). After restart, the RM cannot maintain the 
> contract to the AM's that a lost NM's containers will be marked finished 
> within the expiry time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart

2015-11-09 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996539#comment-14996539
 ] 

Jun Gong commented on YARN-2047:


Another thought: RM rebuilds containers' information form AMs.  

When AM re-register with RM, AM tells its running containers' information to 
RM. Then RM records them in a HashSet *amRunningContainers*, queries them by 
calling *getRMContainer(containerId)*, and deletes them from 
*amRunningContainers* if the RMContainer exists.  When NM re-register with RM, 
RM deletes all the containers that NM reports from *amRunningContainers*. After 
some time(NM expiry time), RM iterates *amRunningContainers*, and tells 
corresponding AM they have finished.

The result seems same as the issue aims. However it needs add or modify AM's 
register RPC.

> RM should honor NM heartbeat expiry after RM restart
> 
>
> Key: YARN-2047
> URL: https://issues.apache.org/jira/browse/YARN-2047
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>
> After the RM restarts, it forgets about existing NM's (and their potentially 
> decommissioned status too). After restart, the RM cannot maintain the 
> contract to the AM's that a lost NM's containers will be marked finished 
> within the expiry time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3946) Allow fetching exact reason as to why a submitted app is in ACCEPTED state in CS

2015-11-09 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-3946:

Attachment: (was: YARN3946_attemptDiagnistic message.png)

> Allow fetching exact reason as to why a submitted app is in ACCEPTED state in 
> CS
> 
>
> Key: YARN-3946
> URL: https://issues.apache.org/jira/browse/YARN-3946
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Sumit Nigam
>Assignee: Naganarasimha G R
> Attachments: 3946WebImages.zip, YARN-3946.v1.001.patch, 
> YARN-3946.v1.002.patch
>
>
> Currently there is no direct way to get the exact reason as to why a 
> submitted app is still in ACCEPTED state. It should be possible to know 
> through RM REST API as to what aspect is not being met - say, queue limits 
> being reached, or core/ memory requirement not being met, or AM limit being 
> reached, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-4331) Restarting NodeManager leaves orphaned containers

2015-11-09 Thread Joseph Francis (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Francis resolved YARN-4331.
--
Resolution: Not A Problem

> Restarting NodeManager leaves orphaned containers
> -
>
> Key: YARN-4331
> URL: https://issues.apache.org/jira/browse/YARN-4331
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.7.1
>Reporter: Joseph Francis
>Priority: Critical
>
> We are seeing a lot of orphaned containers running in our production clusters.
> I tried to simulate this locally on my machine and can replicate the issue by 
> killing nodemanager.
> I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza 
> jobs.
> Steps:
> {quote}1. Deploy a job 
> 2. Issue a kill -9 signal to nodemanager 
> 3. We should see the AM and its container running without nodemanager
> 4. AM should die but the container still keeps running
> 5. Restarting nodemanager brings up new AM and container but leaves the 
> orphaned container running in the background
> {quote}
> This is effectively causing double processing of data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2015-11-09 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997478#comment-14997478
 ] 

Jason Lowe commented on YARN-4311:
--

Sorry, I couldn't find a reference to {{isUntracked}} in trunk nor in 
YARN-3223.  So not sure if I understand exactly what is being asked.

To be consistent with HDFS the node should be gracefully decommissioned if it 
it appears in the include and exclude list simultaneously, otherwise once it's 
removed from the include list it's a hard decommission.

We could implement a "grace period" where nodes that were removed from the 
cluster are still "tracked" in the UI for a while before being removed.  That 
may help with some of the potentially confusing cases where a node is 
accidentally booted from the cluster.

> Removing nodes from include and exclude lists will not remove them from 
> decommissioned nodes list
> -
>
> Key: YARN-4311
> URL: https://issues.apache.org/jira/browse/YARN-4311
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: YARN-4311-v1.patch
>
>
> In order to fully forget about a node, removing the node from include and 
> exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The 
> tricky part that [~jlowe] pointed out was the case when include lists are not 
> used, in that case we don't want the nodes to fall off if they are not active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2015-11-09 Thread Kuhu Shukla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997366#comment-14997366
 ] 

Kuhu Shukla commented on YARN-4311:
---

Thank you [~jlowe] for the comments.
For graceful refresh nodes, I was looking at YARN-41 and YARN-3223. For this 
fix, if we remove the node from all lists when isUntracked is true, the 
decommissioning node falls back to the same behavior as a decommissioned node. 
Would it be better for both {{refreshNodes}} and {{refreshNodesGracefully}} 
that if the node is 'untracked' it should be moved to shutdown nodes, 
irrespective of its previous state and then be taken out of shutdown nodes 
after a timeout?

Let me know if this makes more sense. Thanks!

> Removing nodes from include and exclude lists will not remove them from 
> decommissioned nodes list
> -
>
> Key: YARN-4311
> URL: https://issues.apache.org/jira/browse/YARN-4311
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: YARN-4311-v1.patch
>
>
> In order to fully forget about a node, removing the node from include and 
> exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The 
> tricky part that [~jlowe] pointed out was the case when include lists are not 
> used, in that case we don't want the nodes to fall off if they are not active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)

2015-11-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997342#comment-14997342
 ] 

Hudson commented on YARN-3840:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #658 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/658/])
YARN-3840. Resource Manager web ui issue when sorting application by id 
(jianhe: rev 8fbea531d7f7b665f6f55af54c8ebf330118ff37)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AppPage.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/TaskPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebApp.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebApp.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/static/dt-plugin-1.10.7/sorting/natural.js.gz
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/AllApplicationsPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebPageUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/view/JQueryUI.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/TasksPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/AllContainersPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AppAttemptPage.java


> Resource Manager web ui issue when sorting application by id (with 
> application having id > )
> 
>
> Key: YARN-3840
> URL: https://issues.apache.org/jira/browse/YARN-3840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: LINTE
>Assignee: Mohammad Shahid Khan
> Attachments: RMApps.png, YARN-3840-1.patch, YARN-3840-2.patch, 
> YARN-3840-3.patch, YARN-3840-4.patch, YARN-3840-5.patch, YARN-3840-6.patch, 
> yarn-3840-7.patch
>
>
> On the WEBUI, the global main view page : 
> http://resourcemanager:8088/cluster/apps doesn't display applications over 
> .
> With command line it works (# yarn application -list).
> Regards,
> Alexandre



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)

2015-11-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997805#comment-14997805
 ] 

Hudson commented on YARN-3840:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2528 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2528/])
YARN-3840. Resource Manager web ui issue when sorting application by id 
(jianhe: rev 8fbea531d7f7b665f6f55af54c8ebf330118ff37)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/view/JQueryUI.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebApp.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/static/dt-plugin-1.10.7/sorting/natural.js.gz
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/TasksPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebApp.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AppPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AppAttemptPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/AllContainersPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/AllApplicationsPage.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/TaskPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebPageUtils.java


> Resource Manager web ui issue when sorting application by id (with 
> application having id > )
> 
>
> Key: YARN-3840
> URL: https://issues.apache.org/jira/browse/YARN-3840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: LINTE
>Assignee: Mohammad Shahid Khan
> Fix For: 2.8.0, 2.7.3
>
> Attachments: RMApps.png, YARN-3840-1.patch, YARN-3840-2.patch, 
> YARN-3840-3.patch, YARN-3840-4.patch, YARN-3840-5.patch, YARN-3840-6.patch, 
> yarn-3840-7.patch
>
>
> On the WEBUI, the global main view page : 
> http://resourcemanager:8088/cluster/apps doesn't display applications over 
> .
> With command line it works (# yarn application -list).
> Regards,
> Alexandre



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-11-09 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-4051:
---
Attachment: YARN-4051.04.patch

NM register to RM after all containers are recovered by default, and user could 
set a timeout vaule.

> ContainerKillEvent is lost when container is  In New State and is recovering
> 
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: sandflee
>Assignee: sandflee
>Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch, YARN-4051.04.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1565) Add a way for YARN clients to get critical YARN system properties from the RM

2015-11-09 Thread Pradeep Subrahmanion (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997977#comment-14997977
 ] 

Pradeep Subrahmanion commented on YARN-1565:


Can anybody help me on how to proceed on this one ?

> Add a way for YARN clients to get critical YARN system properties from the RM
> -
>
> Key: YARN-1565
> URL: https://issues.apache.org/jira/browse/YARN-1565
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.2.0
>Reporter: Steve Loughran
> Attachments: YARN-1565-001.patch, YARN-1565-002.patch, 
> YARN-1565-003.patch, YARN-1565-004.patch
>
>
> If you are trying to build up an AM request, you need to know
> # the limits of memory, core  for the chosen queue
> # the existing YARN classpath
> # the path separator for the target platform (so your classpath comes out 
> right)
> # cluster OS: in case you need some OS-specific changes
> The classpath can be in yarn-site.xml, but a remote client may not have that. 
> The site-xml file doesn't list Queue resource limits, cluster OS or the path 
> separator.
> A way to query the RM for these values would make it easier for YARN clients 
> to build up AM submissions with less guesswork and client-side config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4325) purge app state from NM state-store should be independent of log aggregation

2015-11-09 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997932#comment-14997932
 ] 

zhangshilong commented on YARN-4325:


If permissions with hdfs is right,  is there any other problem?
If set yarn.log-aggregation-enable = false, does NM recovery work well?

> purge app state from NM state-store should be independent of log aggregation
> 
>
> Key: YARN-4325
> URL: https://issues.apache.org/jira/browse/YARN-4325
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
>
> From a long running cluster, we found tens of thousands of stale apps still 
> be recovered in NM restart recovery. The reason is some wrong configuration 
> setting to log aggregation so the end of log aggregation events are not 
> received so stale apps are not purged properly. We should make sure the 
> removal of app state to be independent of log aggregation life cycle. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-11-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997974#comment-14997974
 ] 

Hadoop QA commented on YARN-4051:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 6s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
12s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 50s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
27s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
37s {color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 17s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common in 
trunk has 3 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 23s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 45s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
13s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 46s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 46s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 25s 
{color} | {color:red} Patch generated 1 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 265, now 265). {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
37s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
54s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 18s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 48s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 19s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_60. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 46s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_60. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 22m 50s {color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK 
v1.8.0_60. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 21s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_79. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 2s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_79. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 23m 23s {color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK 
v1.7.0_79. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 

[jira] [Commented] (YARN-4338) NPE in RegularContainerAllocator.preCheckForNewContainer()

2015-11-09 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997807#comment-14997807
 ] 

Naganarasimha G R commented on YARN-4338:
-

IMHO, i think its worth a try as anyway null is treated as default Label so 
funcationally its fine. Even if it fails I expect some test cases failing but 
it will prevent future testcases not require to handle this explicitly. 
Thoughts ?


> NPE in RegularContainerAllocator.preCheckForNewContainer()
> --
>
> Key: YARN-4338
> URL: https://issues.apache.org/jira/browse/YARN-4338
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xinwei Qin 
>Priority: Minor
> Attachments: YARN-4338.001.patch
>
>
> The codes in RegularContainerAllocator.preCheckForNewContainer():
> {code}
> if (anyRequest.getNodeLabelExpression()
> .equals(RMNodeLabelsManager.NO_LABEL)) {
>   missedNonPartitionedRequestSchedulingOpportunity =
>   application
>   .addMissedNonPartitionedRequestSchedulingOpportunity(priority);
> }
> {code}
> {code}anyRequest.getNodeLabelExpression(){code}may return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4338) NPE in RegularContainerAllocator.preCheckForNewContainer()

2015-11-09 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997808#comment-14997808
 ] 

Naganarasimha G R commented on YARN-4338:
-

IMHO, i think its worth a try as anyway null is treated as default Label so 
funcationally its fine. Even if it fails I expect some test cases failing but 
it will prevent future testcases not require to handle this explicitly. 
Thoughts ?


> NPE in RegularContainerAllocator.preCheckForNewContainer()
> --
>
> Key: YARN-4338
> URL: https://issues.apache.org/jira/browse/YARN-4338
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xinwei Qin 
>Priority: Minor
> Attachments: YARN-4338.001.patch
>
>
> The codes in RegularContainerAllocator.preCheckForNewContainer():
> {code}
> if (anyRequest.getNodeLabelExpression()
> .equals(RMNodeLabelsManager.NO_LABEL)) {
>   missedNonPartitionedRequestSchedulingOpportunity =
>   application
>   .addMissedNonPartitionedRequestSchedulingOpportunity(priority);
> }
> {code}
> {code}anyRequest.getNodeLabelExpression(){code}may return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4338) NPE in RegularContainerAllocator.preCheckForNewContainer()

2015-11-09 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997806#comment-14997806
 ] 

Naganarasimha G R commented on YARN-4338:
-

IMHO, i think its worth a try as anyway null is treated as default Label so 
funcationally its fine. Even if it fails I expect some test cases failing but 
it will prevent future testcases not require to handle this explicitly. 
Thoughts ?


> NPE in RegularContainerAllocator.preCheckForNewContainer()
> --
>
> Key: YARN-4338
> URL: https://issues.apache.org/jira/browse/YARN-4338
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xinwei Qin 
>Priority: Minor
> Attachments: YARN-4338.001.patch
>
>
> The codes in RegularContainerAllocator.preCheckForNewContainer():
> {code}
> if (anyRequest.getNodeLabelExpression()
> .equals(RMNodeLabelsManager.NO_LABEL)) {
>   missedNonPartitionedRequestSchedulingOpportunity =
>   application
>   .addMissedNonPartitionedRequestSchedulingOpportunity(priority);
> }
> {code}
> {code}anyRequest.getNodeLabelExpression(){code}may return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4341) add doc about timeline performance tool usage

2015-11-09 Thread Chang Li (JIRA)
Chang Li created YARN-4341:
--

 Summary: add doc about timeline performance tool usage
 Key: YARN-4341
 URL: https://issues.apache.org/jira/browse/YARN-4341
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Chang Li
Assignee: Chang Li






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4339) optimize timeline server performance tool

2015-11-09 Thread Chang Li (JIRA)
Chang Li created YARN-4339:
--

 Summary: optimize timeline server performance tool
 Key: YARN-4339
 URL: https://issues.apache.org/jira/browse/YARN-4339
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Chang Li
Assignee: Chang Li


As [~Naganarasimha] suggest in YARN-2556 that test could be optimized by having 
some initial Level DB data before testing the performance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted

2015-11-09 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4218:
---
Attachment: YARN-4218.2.patch

> Metric for resource*time that was preempted
> ---
>
> Key: YARN-4218
> URL: https://issues.apache.org/jira/browse/YARN-4218
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.2.patch, 
> YARN-4218.patch, YARN-4218.wip.patch, screenshot-1.png, screenshot-2.png, 
> screenshot-3.png
>
>
> After YARN-415 we have the ability to track the resource*time footprint of a 
> job and preemption metrics shows how many containers were preempted on a job. 
> However we don't have a metric showing the resource*time footprint cost of 
> preemption. In other words, we know how many containers were preempted but we 
> don't have a good measure of how much work was lost as a result of preemption.
> We should add this metric so we can analyze how much work preemption is 
> costing on a grid and better track which jobs were heavily impacted by it. A 
> job that has 100 containers preempted that only lasted a minute each and were 
> very small is going to be less impacted than a job that only lost a single 
> container but that container was huge and had been running for 3 days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4340) Add "list" API to reservation system

2015-11-09 Thread Carlo Curino (JIRA)
Carlo Curino created YARN-4340:
--

 Summary: Add "list" API to reservation system
 Key: YARN-4340
 URL: https://issues.apache.org/jira/browse/YARN-4340
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Carlo Curino
Assignee: Sean Po


This JIRA tracks changes to the APIs of the reservation system, and enables 
querying the reservation system on which reservation exists by "time-range, 
reservation-id, username".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3862) Decide which contents to retrieve and send back in response in TimelineReader

2015-11-09 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3862:
---
Attachment: YARN-3862-YARN-2928.wip.03.patch

> Decide which contents to retrieve and send back in response in TimelineReader
> -
>
> Key: YARN-3862
> URL: https://issues.apache.org/jira/browse/YARN-3862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-3862-YARN-2928.wip.01.patch, 
> YARN-3862-YARN-2928.wip.02.patch, YARN-3862-YARN-2928.wip.03.patch
>
>
> Currently, we will retrieve all the contents of the field if that field is 
> specified in the query API. In case of configs and metrics, this can become a 
> lot of data even though the user doesn't need it. So we need to provide a way 
> to query only a set of configs or metrics.
> As a comma spearated list of configs/metrics to be returned will be quite 
> cumbersome to specify, we have to support either of the following options :
> # Prefix match
> # Regex
> # Group the configs/metrics and query that group.
> We also need a facility to specify a metric time window to return metrics in 
> a that window. This may be useful in plotting graphs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2556) Tool to measure the performance of the timeline server

2015-11-09 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997139#comment-14997139
 ] 

Chang Li commented on YARN-2556:


create YARN-4341 to track work of add doc about timeline performance tool usage

> Tool to measure the performance of the timeline server
> --
>
> Key: YARN-2556
> URL: https://issues.apache.org/jira/browse/YARN-2556
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Jonathan Eagles
>Assignee: Chang Li
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0
>
> Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, 
> YARN-2556.1.patch, YARN-2556.10.patch, YARN-2556.11.patch, 
> YARN-2556.12.patch, YARN-2556.13.patch, YARN-2556.13.whitespacefix.patch, 
> YARN-2556.14.patch, YARN-2556.14.whitespacefix.patch, YARN-2556.15.patch, 
> YARN-2556.2.patch, YARN-2556.3.patch, YARN-2556.4.patch, YARN-2556.5.patch, 
> YARN-2556.6.patch, YARN-2556.7.patch, YARN-2556.8.patch, YARN-2556.9.patch, 
> YARN-2556.patch, yarn2556.patch, yarn2556.patch, yarn2556_wip.patch
>
>
> We need to be able to understand the capacity model for the timeline server 
> to give users the tools they need to deploy a timeline server with the 
> correct capacity.
> I propose we create a mapreduce job that can measure timeline server write 
> and read performance. Transactions per second, I/O for both read and write 
> would be a good start.
> This could be done as an example or test job that could be tied into gridmix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3862) Decide which contents to retrieve and send back in response in TimelineReader

2015-11-09 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997161#comment-14997161
 ] 

Varun Saxena commented on YARN-3862:


Attached a WIP patch.

This patch attempts to do handling for all the tables(creation of filter list 
based on fields) and do prefix matching for configs and filters. Previous WIP 
patch was only attempting to handle for entity table because of the 
implications of this patch on config and metric filters' matching. Have handled 
this scenario in this patch. 
When YARN-3863 is done, some changes will be warranted though(some conditions 
to pass config and metric filters will have to be removed). 
Have added a few tests to test the change as well.

I have still not hooked up this code to REST API layer.
For that, we first need to decide as to whether the TimelineFilter code will be 
part of our object model or not.
For prefix matching of configs and metrics to return, at the REST layer this 
can simply come as a query param (a comma separated list)

But when we code for complex filters (especially metric filters) in YARN-3863 
we will have to support SQL type queries with ANDs', ORs', >,<,=operators, etc.
If we make TimelineFilter as part of our client object model and interpret 
filters as a JSON string associated with a query param, we might have to 
rethink a bit on few of the classes and  including additional checks(as this 
will be used by client).
This can increase size of the URL though.

If we do not include filter as part of our object model, we will have to decide 
how to specify complex config and metric filters containing ANDs' and ORs' and 
different relational operators(because some of the symbols will be reserved) 
and reach a consensus on that.


> Decide which contents to retrieve and send back in response in TimelineReader
> -
>
> Key: YARN-3862
> URL: https://issues.apache.org/jira/browse/YARN-3862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-3862-YARN-2928.wip.01.patch, 
> YARN-3862-YARN-2928.wip.02.patch, YARN-3862-YARN-2928.wip.03.patch
>
>
> Currently, we will retrieve all the contents of the field if that field is 
> specified in the query API. In case of configs and metrics, this can become a 
> lot of data even though the user doesn't need it. So we need to provide a way 
> to query only a set of configs or metrics.
> As a comma spearated list of configs/metrics to be returned will be quite 
> cumbersome to specify, we have to support either of the following options :
> # Prefix match
> # Regex
> # Group the configs/metrics and query that group.
> We also need a facility to specify a metric time window to return metrics in 
> a that window. This may be useful in plotting graphs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4132) Nodemanagers should try harder to connect to the RM

2015-11-09 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996997#comment-14996997
 ] 

Jason Lowe commented on YARN-4132:
--

Thanks for updating the patch, Chang!

createRMProxy(conf, protocol, instance) should be implemented in terms of 
createRMProxy(retryTime, retryInterval, conf, protocol, instance) rather than 
copying the code.  It can do the conf lookups to get the retry values and call 
the other.  Then I don't see a need to check for -1 values.

".rm." should be ".resourcemanager.".  There's already precedent in the 
nodemanager.resourcemanager.minimum.version property.  Similarly "retry.ms" 
should be "retry-interval.ms" to be consistent with the existing 
resourcemanager properties.

The added test take a long time to run for just one test (around 25 seconds), 
please tune down the retry intervals.

Style nit: usually extra parameters for a function overload of an existing 
function are passed at the end of the other form.  Not a must-fix.

> Nodemanagers should try harder to connect to the RM
> ---
>
> Key: YARN-4132
> URL: https://issues.apache.org/jira/browse/YARN-4132
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4132.2.patch, YARN-4132.3.patch, YARN-4132.4.patch, 
> YARN-4132.patch
>
>
> Being part of the cluster, nodemanagers should try very hard (and possibly 
> never give up) to connect to a resourcemanager. Minimally we should have a 
> separate config to set how aggressively a nodemanager will connect to the RM 
> separate from what clients will do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4053) Change the way metric values are stored in HBase Storage

2015-11-09 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997121#comment-14997121
 ] 

Varun Saxena commented on YARN-4053:


Vrushali, thanks for your comments.
I would like to work on this. Let me take a stab on this one. Will have the 
bandwidth.
I hope its fine. You can help me with the reviews.

Coming to the points, 
I agree that flag is not good for extensibility. As I said earlier, flag should 
be fine for now as we have only 2 choices(generic or long) and we can extend 
later. 
But eventually will have to have different handlers for different types. So why 
not do it now. Hence, lets go with proposal above.

Moreover, yes, we need to have proper handling based on data type or conversion 
mechanism in FlowScanner too. As mentioned in an earlier comment, I was 
thinking we can indicate this in attributes. But I guess your proposal sounds 
better. We can identify the column/column prefix in flow scanner as well and 
convert based on the converter attached to it.

bq. it missed one of the places in the current patch for example
Which place ? MIN/MAX handling ?

bq. For single value vs time series, we suggest using a column prefix to 
distinguish them
Do we need to have a differentiation between SINGLE_VALUE and TIME_SERIES if by 
default it will be read as SINGLE_VALUE ? Because we will be storing multiple 
values even for metric of type SINGLE_VALUE. Do you mean on the read side, only 
the latest value of a metric is to be returned if its of type SINGLE_VALUE 
(even if client asks for TIME_SERIES) ? Again the assumption here is that 
client will always send the metric type(SINGLE_VALUE or TIME_SERIES) 
consistently.

bq. For the read path, we can assume it is a single value unless specifically 
specified by the client as a time series (as clients would need to intend to 
read time series explicitly).
We can return TIME_SERIES by indicating something like METRICS_TIME_SERIES as 
fields. If we do so, it will have implications on YARN-3862.
Now the question is whether to return values for multiple timestamps even for 
metric type of SINGLE_VALUE if client asks for it ? What if client wants to see 
values of a gauge(which might be considered as a SINGLE_VALUE) over a period of 
time, for instance. If yes, do we need to even differentiate between the 2 
types ?

bq. We finally concluded that we should start with storing longs only and make 
the code strictly accept longs 
JAX-RS i.e. the REST API layer will convert an integral value to Integer 
automatically if its less than Integer.MAX_VALUE so I guess we will have to 
handle ints and shorts as well i.e. if its an Integer for instance, we can call 
Integer#longValue to convert it to long.

bq. Regarding indicating whether to aggregate or not, we suggest to rely mostly 
on the flow run aggregation. For those use cases that need to access metrics 
off of tables other than the flow run table (e.g. time-based aggregation), we 
need to explore ways to specify this information as input (config, etc.)
I hope Li Lu is fine with this because I remember him saying on YARN-3816 that 
he will be using it for offline aggregation in YARN-3817. I think rows from 
application table are being used in the MR job there. Are you suggesting that 
for offline aggregation, based on config, we aggregate all the application 
metrics(to flow or user) or nothing ?
Or configure a set of metrics to aggregate in some config ?

> Change the way metric values are stored in HBase Storage
> 
>
> Key: YARN-4053
> URL: https://issues.apache.org/jira/browse/YARN-4053
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-4053-YARN-2928.01.patch, 
> YARN-4053-YARN-2928.02.patch
>
>
> Currently HBase implementation uses GenericObjectMapper to convert and store 
> values in backend HBase storage. This converts everything into a string 
> representation(ASCII/UTF-8 encoded byte array).
> While this is fine in most cases, it does not quite serve our use case for 
> metrics. 
> So we need to decide how are we going to encode and decode metric values and 
> store them in HBase.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3862) Decide which contents to retrieve and send back in response in TimelineReader

2015-11-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997153#comment-14997153
 ] 

Hadoop QA commented on YARN-3862:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 7s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 6m 21s 
{color} | {color:red} root in YARN-2928 failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red} 2m 32s 
{color} | {color:red} hadoop-yarn-server-timelineservice in YARN-2928 failed 
with JDK v1.8.0_60. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 12s 
{color} | {color:red} hadoop-yarn-server-timelineservice in YARN-2928 failed 
with JDK v1.7.0_79. {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
12s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
21s {color} | {color:green} YARN-2928 passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 12s 
{color} | {color:red} hadoop-yarn-server-timelineservice in YARN-2928 failed. 
{color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 15s 
{color} | {color:red} hadoop-yarn-server-timelineservice in YARN-2928 failed 
with JDK v1.8.0_60. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s 
{color} | {color:green} YARN-2928 passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 12s 
{color} | {color:red} hadoop-yarn-server-timelineservice in the patch failed. 
{color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 9s 
{color} | {color:red} hadoop-yarn-server-timelineservice in the patch failed 
with JDK v1.8.0_60. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 9s {color} 
| {color:red} hadoop-yarn-server-timelineservice in the patch failed with JDK 
v1.8.0_60. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 13s 
{color} | {color:red} hadoop-yarn-server-timelineservice in the patch failed 
with JDK v1.7.0_79. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 13s {color} 
| {color:red} hadoop-yarn-server-timelineservice in the patch failed with JDK 
v1.7.0_79. {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 10s 
{color} | {color:red} Patch generated 43 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice
 (total was 102, now 128). {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 1 line(s) with tabs. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 13s 
{color} | {color:red} hadoop-yarn-server-timelineservice in the patch failed. 
{color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 11s 
{color} | {color:red} hadoop-yarn-server-timelineservice in the patch failed 
with JDK v1.8.0_60. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 9s {color} | 
{color:red} hadoop-yarn-server-timelineservice in the patch failed with JDK 
v1.8.0_60. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 12s {color} 
| {color:red} hadoop-yarn-server-timelineservice in the patch failed with JDK 
v1.7.0_79. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 18s 
{color} | {color:red} Patch generated 1 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 13m 24s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.7.1 Server=1.7.1 
Image:test-patch-base-hadoop-date2015-11-09 |
| JIRA Patch URL | 

[jira] [Commented] (YARN-2556) Tool to measure the performance of the timeline server

2015-11-09 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996875#comment-14996875
 ] 

Chang Li commented on YARN-2556:


Thanks [~Naganarasimha] for suggesting optimization! +1 on the idea of creating 
some initial leveldb data before test the performance. Create YARN-4339 to work 
on this idea.

> Tool to measure the performance of the timeline server
> --
>
> Key: YARN-2556
> URL: https://issues.apache.org/jira/browse/YARN-2556
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Jonathan Eagles
>Assignee: Chang Li
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0
>
> Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, 
> YARN-2556.1.patch, YARN-2556.10.patch, YARN-2556.11.patch, 
> YARN-2556.12.patch, YARN-2556.13.patch, YARN-2556.13.whitespacefix.patch, 
> YARN-2556.14.patch, YARN-2556.14.whitespacefix.patch, YARN-2556.15.patch, 
> YARN-2556.2.patch, YARN-2556.3.patch, YARN-2556.4.patch, YARN-2556.5.patch, 
> YARN-2556.6.patch, YARN-2556.7.patch, YARN-2556.8.patch, YARN-2556.9.patch, 
> YARN-2556.patch, yarn2556.patch, yarn2556.patch, yarn2556_wip.patch
>
>
> We need to be able to understand the capacity model for the timeline server 
> to give users the tools they need to deploy a timeline server with the 
> correct capacity.
> I propose we create a mapreduce job that can measure timeline server write 
> and read performance. Transactions per second, I/O for both read and write 
> would be a good start.
> This could be done as an example or test job that could be tied into gridmix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-11-09 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996886#comment-14996886
 ] 

Eric Payne commented on YARN-3769:
--

bq. you don't need to do componmentwiseMax here, since minPendingAndPreemptable 
<= headroom, and you can use substractFrom to make code simpler.
[~leftnoteasy], you are right, we do know that {{minPendingAndPreemptable <= 
headroom}}. Thanks for the catch. I will make those changes.

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, 
> YARN-3769.003.patch, YARN-3769.004.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-11-09 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996687#comment-14996687
 ] 

Jason Lowe commented on YARN-4051:
--

If I understand this correctly, we're saying that the problem described in 
YARN-4050 is holding up the main event dispatcher and the NM is semi-hung, yet 
we want to hurry and register with the ResourceManager before containers have 
recovered?  Seems to me we need to address the problem described in YARN-4050 
if possible (e.g.: skip HDFS operations if we recovered at least one container 
in the running or completed states since we know it must have done HDFS init in 
the previous NM instance).  Otherwise we are hacking around the fact that we 
registered too soon and aren't able to properly handle the out-of-order events. 
 I'd much rather deal with the root cause if possible than patch all the 
separate symptoms.

> ContainerKillEvent is lost when container is  In New State and is recovering
> 
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: sandflee
>Assignee: sandflee
>Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2556) Tool to measure the performance of the timeline server

2015-11-09 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996832#comment-14996832
 ] 

Chang Li commented on YARN-2556:


Hi [~xgong], here is the usage print out by the tool  {code} 
Usage: [-m ] number of mappers (default: 1)
 [-v] timeline service version
 [-mtype ]
  1. simple entity write mapper
  2. jobhistory files replay mapper
 [-s <(KBs)test>] number of KB per put (mtype=1, default: 1 KB)
 [-t] package sending iterations per mapper (mtype=1, default: 100)
 [-d ] root path of job history files (mtype=2)
 [-r ] (mtype=2)
  1. write all entities for a job in one put (default)
  2. write one entity at a time{code}
there are two different modes to test, one is simple entity writer, where each 
mapper create your specified size of entities and put them to timeline server. 
The other mode of test is by replaying jobhistory files, which offer a more 
realistic test. In the case of jobhistory file replay test, you put testing 
jobhistory files(both the job history file and job conf file) under a 
directory, and then you specify the testing dir by -d option. You specify the 
test mode by -mtype option. 
Right now the usage won't get printed out if you pass no options, but only 
print out when you pass the wrong options. When you give no parameters, the 
test run with simple entity write mode and default setting. So maybe we want to 
print out this usage if we don't pass any parameter?

> Tool to measure the performance of the timeline server
> --
>
> Key: YARN-2556
> URL: https://issues.apache.org/jira/browse/YARN-2556
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Jonathan Eagles
>Assignee: Chang Li
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0
>
> Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, 
> YARN-2556.1.patch, YARN-2556.10.patch, YARN-2556.11.patch, 
> YARN-2556.12.patch, YARN-2556.13.patch, YARN-2556.13.whitespacefix.patch, 
> YARN-2556.14.patch, YARN-2556.14.whitespacefix.patch, YARN-2556.15.patch, 
> YARN-2556.2.patch, YARN-2556.3.patch, YARN-2556.4.patch, YARN-2556.5.patch, 
> YARN-2556.6.patch, YARN-2556.7.patch, YARN-2556.8.patch, YARN-2556.9.patch, 
> YARN-2556.patch, yarn2556.patch, yarn2556.patch, yarn2556_wip.patch
>
>
> We need to be able to understand the capacity model for the timeline server 
> to give users the tools they need to deploy a timeline server with the 
> correct capacity.
> I propose we create a mapreduce job that can measure timeline server write 
> and read performance. Transactions per second, I/O for both read and write 
> would be a good start.
> This could be done as an example or test job that could be tied into gridmix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4183) Enabling generic application history forces every job to get a timeline service delegation token

2015-11-09 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997327#comment-14997327
 ] 

Jonathan Eagles commented on YARN-4183:
---

Here are the requirements that users at scale need, and unfortunately the 
config design does not allow for this properly. Let me draw up what the 
requirements in my mind should be based my current knowledge. This is by no 
means an edict, but just a conversation starting point, so you know where I'm 
coming from.

# Jobs that make use of the timeline service, may have a hard or soft runtime 
on the timeline service
-- Jobs that interact directly with the timeline service (TimelineClient) 
should obtain delegation token to use the service and optionally allow for 
non-fatal runtime dependency (job is allowed to run, but no history is written)
-- Jobs that don't interact with the timeline service 
(EntityFileTimelineClient), should obtain HDFS delegation tokens, but should 
not obtain timeline service delegation tokens.
# Jobs that don't make user of the timeline service, should have no runtime 
dependency on the timeline service and should be allowed freely to submit and 
run jobs if the regardless of the timeline service status.
# YARN services that interact with the timeline server (Generic History 
Server), may have runtime dependency of the timeline service that does not 
disrupt job submission.

The issue regarding this jira is that putting yarn.timeline-service.enabled in 
the client xml (breaks #2 above) forces every job (both MR (not using timeline 
service) and Tez (using timeline service)) to have a runtime dependency on the 
timeline service. This places an artificial runtime dependency on the timeline 
service which is not highly available or highly scalable until v2.0.

The issue regarding putting the yarn.timeline-service.enabled in the resource 
manager  (breaks #3 above) is that every YarnClientImpl (used in job status, 
used in job submission) now reaches out to get a delegation token token. This 
places the timeline service (neither highly scalable or highly available until 
v2.0) as a runtime dependency for job submission and get many unnecessary 
delegation token for YarnClients that never intent to use them.

> Enabling generic application history forces every job to get a timeline 
> service delegation token
> 
>
> Key: YARN-4183
> URL: https://issues.apache.org/jira/browse/YARN-4183
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Mit Desai
>Assignee: Mit Desai
> Fix For: 3.0.0, 2.8.0, 2.7.2
>
> Attachments: YARN-4183.1.patch
>
>
> When enabling just the Generic History Server and not the timeline server, 
> the system metrics publisher will not publish the events to the timeline 
> store as it checks if the timeline server and system metrics publisher are 
> enabled before creating a timeline client.
> To make it work, if the timeline service flag is turned on, it will force 
> every yarn application to get a delegation token.
> Instead of checking if timeline service is enabled, we should be checking if 
> application history server is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart

2015-11-09 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997181#comment-14997181
 ] 

Bikas Saha commented on YARN-2047:
--

I think the general idea is that the AM cannot be trusted about allocated 
resources or running containers.

> RM should honor NM heartbeat expiry after RM restart
> 
>
> Key: YARN-2047
> URL: https://issues.apache.org/jira/browse/YARN-2047
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>
> After the RM restarts, it forgets about existing NM's (and their potentially 
> decommissioned status too). After restart, the RM cannot maintain the 
> contract to the AM's that a lost NM's containers will be marked finished 
> within the expiry time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"

2015-11-09 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997296#comment-14997296
 ] 

Jason Lowe commented on YARN-4334:
--

Thanks for the prototype, Chang!

Ideally when attempting to recover from an old state we should still remember 
the apps but recover them in a completed state (either killed or failed).  It 
looks like the prototype will cause the RM to completely forget everything 
which isn't ideal.  WIthout recovering the state but yet leaving it in the 
state store then we risk a situation like the following:
# RM restarts late, recovers nothing
# RM updates the store timestamp
# RM restarts 
# RM tries to recover all the old state left from the first instance that 
wasn't cleaned up in the second

Was there a reason to use a raw thread and sleeps for the update rather than a 
Timer?  In either case it needs to be a daemon thread.

The recovery code should check the version first before doing anything else 
with the state store.

The conf settings give no hints in their name nor any documentation as to what 
units to use.  Is it millseconds?  minutes?  hours?  Why a default of 1?

"RMLivenessKey" should be a static final constant to avoid the chance of typos.

The code has no check for the key missing a value -- db.get will return NULL if 
the 

Nit: a setting of zero should be equivalent to a -1 setting.  It makes no sense 
to configure it so the store is always expired.



> Ability to avoid ResourceManager recovery if state store is "too old"
> -
>
> Key: YARN-4334
> URL: https://issues.apache.org/jira/browse/YARN-4334
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4334.wip.patch
>
>
> There are times when a ResourceManager has been down long enough that 
> ApplicationMasters and potentially external client-side monitoring mechanisms 
> have given up completely.  If the ResourceManager starts back up and tries to 
> recover we can get into situations where the RM launches new application 
> attempts for the AMs that gave up, but then the client _also_ launches 
> another instance of the app because it assumed everything was dead.
> It would be nice if the RM could be optionally configured to avoid trying to 
> recover if the state store was "too old."  The RM would come up without any 
> applications recovered, but we would avoid a double-submission situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)

2015-11-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997214#comment-14997214
 ] 

Hudson commented on YARN-3840:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #647 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/647/])
YARN-3840. Resource Manager web ui issue when sorting application by id 
(jianhe: rev 8fbea531d7f7b665f6f55af54c8ebf330118ff37)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/AllContainersPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebPageUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/static/dt-plugin-1.10.7/sorting/natural.js.gz
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/TaskPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebApp.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/AllApplicationsPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AppAttemptPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AppPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/view/JQueryUI.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebApp.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/TasksPage.java


> Resource Manager web ui issue when sorting application by id (with 
> application having id > )
> 
>
> Key: YARN-3840
> URL: https://issues.apache.org/jira/browse/YARN-3840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: LINTE
>Assignee: Mohammad Shahid Khan
> Attachments: RMApps.png, YARN-3840-1.patch, YARN-3840-2.patch, 
> YARN-3840-3.patch, YARN-3840-4.patch, YARN-3840-5.patch, YARN-3840-6.patch, 
> yarn-3840-7.patch
>
>
> On the WEBUI, the global main view page : 
> http://resourcemanager:8088/cluster/apps doesn't display applications over 
> .
> With command line it works (# yarn application -list).
> Regards,
> Alexandre



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2556) Tool to measure the performance of the timeline server

2015-11-09 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997275#comment-14997275
 ] 

Sangjin Lee commented on YARN-2556:
---

+1 with the proposal to add documentation. The command line help is useful, but 
it would be good to have a little more detail in the documentation.

> Tool to measure the performance of the timeline server
> --
>
> Key: YARN-2556
> URL: https://issues.apache.org/jira/browse/YARN-2556
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Jonathan Eagles
>Assignee: Chang Li
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0
>
> Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, 
> YARN-2556.1.patch, YARN-2556.10.patch, YARN-2556.11.patch, 
> YARN-2556.12.patch, YARN-2556.13.patch, YARN-2556.13.whitespacefix.patch, 
> YARN-2556.14.patch, YARN-2556.14.whitespacefix.patch, YARN-2556.15.patch, 
> YARN-2556.2.patch, YARN-2556.3.patch, YARN-2556.4.patch, YARN-2556.5.patch, 
> YARN-2556.6.patch, YARN-2556.7.patch, YARN-2556.8.patch, YARN-2556.9.patch, 
> YARN-2556.patch, yarn2556.patch, yarn2556.patch, yarn2556_wip.patch
>
>
> We need to be able to understand the capacity model for the timeline server 
> to give users the tools they need to deploy a timeline server with the 
> correct capacity.
> I propose we create a mapreduce job that can measure timeline server write 
> and read performance. Transactions per second, I/O for both read and write 
> would be a good start.
> This could be done as an example or test job that could be tied into gridmix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)

2015-11-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997185#comment-14997185
 ] 

Hudson commented on YARN-3840:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #8780 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8780/])
YARN-3840. Resource Manager web ui issue when sorting application by id 
(jianhe: rev 8fbea531d7f7b665f6f55af54c8ebf330118ff37)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AppAttemptPage.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/view/JQueryUI.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/AllApplicationsPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/static/dt-plugin-1.10.7/sorting/natural.js.gz
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/TasksPage.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/TaskPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebPageUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/AllContainersPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebApp.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AppPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebApp.java


> Resource Manager web ui issue when sorting application by id (with 
> application having id > )
> 
>
> Key: YARN-3840
> URL: https://issues.apache.org/jira/browse/YARN-3840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: LINTE
>Assignee: Mohammad Shahid Khan
> Attachments: RMApps.png, YARN-3840-1.patch, YARN-3840-2.patch, 
> YARN-3840-3.patch, YARN-3840-4.patch, YARN-3840-5.patch, YARN-3840-6.patch, 
> yarn-3840-7.patch
>
>
> On the WEBUI, the global main view page : 
> http://resourcemanager:8088/cluster/apps doesn't display applications over 
> .
> With command line it works (# yarn application -list).
> Regards,
> Alexandre



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4183) Enabling generic application history forces every job to get a timeline service delegation token

2015-11-09 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997267#comment-14997267
 ] 

Sangjin Lee commented on YARN-4183:
---

Sorry I missed this one as well.

Maybe this is a FAQ somewhere, but what are the relationships among the 
following 3 settings?
# yarn.timeline-service.enabled
# yarn.timeline-service.generic-application-history.enabled
# yarn.resourcemanager.system-metrics-publisher.enabled

Can (1) and (2) be set independently, or does setting one have an implication 
on the other? How about (3)?

>From the v.2 perspective, there is no separate "generic application history 
>service" any way, and we will have to handle this problem in a different 
>manner.

> Enabling generic application history forces every job to get a timeline 
> service delegation token
> 
>
> Key: YARN-4183
> URL: https://issues.apache.org/jira/browse/YARN-4183
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Mit Desai
>Assignee: Mit Desai
> Fix For: 3.0.0, 2.8.0, 2.7.2
>
> Attachments: YARN-4183.1.patch
>
>
> When enabling just the Generic History Server and not the timeline server, 
> the system metrics publisher will not publish the events to the timeline 
> store as it checks if the timeline server and system metrics publisher are 
> enabled before creating a timeline client.
> To make it work, if the timeline service flag is turned on, it will force 
> every yarn application to get a delegation token.
> Instead of checking if timeline service is enabled, we should be checking if 
> application history server is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (YARN-4334) Ability to avoid ResourceManager recovery if state store is "too old"

2015-11-09 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997296#comment-14997296
 ] 

Jason Lowe edited comment on YARN-4334 at 11/9/15 8:31 PM:
---

Thanks for the prototype, Chang!

Ideally when attempting to recover from an old state we should still remember 
the apps but recover them in a completed state (either killed or failed).  It 
looks like the prototype will cause the RM to completely forget everything 
which isn't ideal.  WIthout recovering the state but yet leaving it in the 
state store then we risk a situation like the following:
# RM restarts late, recovers nothing
# RM updates the store timestamp
# RM restarts 
# RM tries to recover all the old state left from the first instance that 
wasn't cleaned up in the second

Was there a reason to use a raw thread and sleeps for the update rather than a 
Timer?  In either case it needs to be a daemon thread.

The recovery code should check the version first before doing anything else 
with the state store.

The conf settings give no hints in their name nor any documentation as to what 
units to use.  Is it millseconds?  minutes?  hours?  Why a default of 1?

"RMLivenessKey" should be a static final constant to avoid the chance of typos.

The code has no check for the key missing a value -- db.get will return null if 
the key is missing.

Nit: a setting of zero should be equivalent to a -1 setting.  It makes no sense 
to configure it so the store is always expired.




was (Author: jlowe):
Thanks for the prototype, Chang!

Ideally when attempting to recover from an old state we should still remember 
the apps but recover them in a completed state (either killed or failed).  It 
looks like the prototype will cause the RM to completely forget everything 
which isn't ideal.  WIthout recovering the state but yet leaving it in the 
state store then we risk a situation like the following:
# RM restarts late, recovers nothing
# RM updates the store timestamp
# RM restarts 
# RM tries to recover all the old state left from the first instance that 
wasn't cleaned up in the second

Was there a reason to use a raw thread and sleeps for the update rather than a 
Timer?  In either case it needs to be a daemon thread.

The recovery code should check the version first before doing anything else 
with the state store.

The conf settings give no hints in their name nor any documentation as to what 
units to use.  Is it millseconds?  minutes?  hours?  Why a default of 1?

"RMLivenessKey" should be a static final constant to avoid the chance of typos.

The code has no check for the key missing a value -- db.get will return NULL if 
the 

Nit: a setting of zero should be equivalent to a -1 setting.  It makes no sense 
to configure it so the store is always expired.



> Ability to avoid ResourceManager recovery if state store is "too old"
> -
>
> Key: YARN-4334
> URL: https://issues.apache.org/jira/browse/YARN-4334
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Jason Lowe
>Assignee: Chang Li
> Attachments: YARN-4334.wip.patch
>
>
> There are times when a ResourceManager has been down long enough that 
> ApplicationMasters and potentially external client-side monitoring mechanisms 
> have given up completely.  If the ResourceManager starts back up and tries to 
> recover we can get into situations where the RM launches new application 
> attempts for the AMs that gave up, but then the client _also_ launches 
> another instance of the app because it assumed everything was dead.
> It would be nice if the RM could be optionally configured to avoid trying to 
> recover if the state store was "too old."  The RM would come up without any 
> applications recovered, but we would avoid a double-submission situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4287) Capacity Scheduler: Rack Locality improvement

2015-11-09 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated YARN-4287:
-
Attachment: YARN-4287-minimal-v3.patch

Noticed simple spelling error

> Capacity Scheduler: Rack Locality improvement
> -
>
> Key: YARN-4287
> URL: https://issues.apache.org/jira/browse/YARN-4287
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.7.1
>Reporter: Nathan Roberts
>Assignee: Nathan Roberts
> Attachments: YARN-4287-minimal-v2.patch, YARN-4287-minimal-v3.patch, 
> YARN-4287-minimal.patch, YARN-4287-v2.patch, YARN-4287-v3.patch, 
> YARN-4287-v4.patch, YARN-4287.patch
>
>
> YARN-4189 does an excellent job describing the issues with the current delay 
> scheduling algorithms within the capacity scheduler. The design proposal also 
> seems like a good direction.
> This jira proposes a simple interim solution to the key issue we've been 
> experiencing on a regular basis:
>  - rackLocal assignments trickle out due to nodeLocalityDelay. This can have 
> significant impact on things like CombineFileInputFormat which targets very 
> specific nodes in its split calculations.
> I'm not sure when YARN-4189 will become reality so I thought a simple interim 
> patch might make sense. The basic idea is simple: 
> 1) Separate delays for rackLocal, and OffSwitch (today there is only 1)
> 2) When we're getting rackLocal assignments, subsequent rackLocal assignments 
> should not be delayed
> Patch will be uploaded shortly. No big deal if the consensus is to go 
> straight to YARN-4189. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4234) New put APIs in TimelineClient for ats v1.5

2015-11-09 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-4234:

Attachment: YARN-4234.20151109.patch

> New put APIs in TimelineClient for ats v1.5
> ---
>
> Key: YARN-4234
> URL: https://issues.apache.org/jira/browse/YARN-4234
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-4234.1.patch, YARN-4234.2.patch, 
> YARN-4234.20151109.patch, YARN-4234.3.patch
>
>
> In this ticket, we will add new put APIs in timelineClient to let 
> clients/applications have the option to use ATS v1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)

2015-11-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997899#comment-14997899
 ] 

Hudson commented on YARN-3840:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #589 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/589/])
YARN-3840. Resource Manager web ui issue when sorting application by id 
(jianhe: rev 8fbea531d7f7b665f6f55af54c8ebf330118ff37)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/AllApplicationsPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/static/dt-plugin-1.10.7/sorting/natural.js.gz
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebApp.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/TasksPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AppPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebPageUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AppAttemptPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/view/JQueryUI.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebApp.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/TaskPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/AllContainersPage.java


> Resource Manager web ui issue when sorting application by id (with 
> application having id > )
> 
>
> Key: YARN-3840
> URL: https://issues.apache.org/jira/browse/YARN-3840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: LINTE
>Assignee: Mohammad Shahid Khan
> Fix For: 2.8.0, 2.7.3
>
> Attachments: RMApps.png, YARN-3840-1.patch, YARN-3840-2.patch, 
> YARN-3840-3.patch, YARN-3840-4.patch, YARN-3840-5.patch, YARN-3840-6.patch, 
> yarn-3840-7.patch
>
>
> On the WEBUI, the global main view page : 
> http://resourcemanager:8088/cluster/apps doesn't display applications over 
> .
> With command line it works (# yarn application -list).
> Regards,
> Alexandre



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4339) optimize timeline server performance tool

2015-11-09 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997542#comment-14997542
 ] 

Naganarasimha G R commented on YARN-4339:
-

Thanks for raising this issue [~lichangleo], 
I would like following along with it
* It should be configurable whether to enable or disable populating the data 
(As it doesn't have any impact on ATS V2 and not sure about ATSv1.5)
* Amount of data to be populated (number and size) can also be captured.

> optimize timeline server performance tool
> -
>
> Key: YARN-4339
> URL: https://issues.apache.org/jira/browse/YARN-4339
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>
> As [~Naganarasimha] suggest in YARN-2556 that test could be optimized by 
> having some initial Level DB data before testing the performance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2015-11-09 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997648#comment-14997648
 ] 

Jason Lowe edited comment on YARN-4311 at 11/9/15 11:31 PM:


bq. Could these be part of shutdown nodes or do we need a separate category for 
such nodes? Would just the count of such nodes suffice or do we want to view 
them while its within the grace period?

The intent is the list of nodes would be visible from the UI for some period of 
time, so users can see where a particular node went after the update.  I think 
these nodes could be part of the shutdown category since they were told to 
shutdown and leave the cluster.


was (Author: jlowe):
bq. Could these be part of shutdown nodes or do we need a separate category for 
such nodes? Would just the count of such nodes suffice or do we want to view 
them while its within the grace period?

The intent is the list of nodes would be visible from the UI for some period of 
time, so they can see where a particular node went after the update.  I think 
they could be part of the shutdown category since they were told to shutdown 
and leave the cluster.

> Removing nodes from include and exclude lists will not remove them from 
> decommissioned nodes list
> -
>
> Key: YARN-4311
> URL: https://issues.apache.org/jira/browse/YARN-4311
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: YARN-4311-v1.patch
>
>
> In order to fully forget about a node, removing the node from include and 
> exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The 
> tricky part that [~jlowe] pointed out was the case when include lists are not 
> used, in that case we don't want the nodes to fall off if they are not active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2015-11-09 Thread Kuhu Shukla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997655#comment-14997655
 ] 

Kuhu Shukla commented on YARN-4311:
---

Thanks [~jlowe]. Will rework my patch accordingly.

> Removing nodes from include and exclude lists will not remove them from 
> decommissioned nodes list
> -
>
> Key: YARN-4311
> URL: https://issues.apache.org/jira/browse/YARN-4311
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: YARN-4311-v1.patch
>
>
> In order to fully forget about a node, removing the node from include and 
> exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The 
> tricky part that [~jlowe] pointed out was the case when include lists are not 
> used, in that case we don't want the nodes to fall off if they are not active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4338) NPE in RegularContainerAllocator.preCheckForNewContainer()

2015-11-09 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997654#comment-14997654
 ] 

Wangda Tan commented on YARN-4338:
--

Thanks for comments: [~sunilg]/[~Naganarasimha].

[~xinwei], I would prefer to keep main logic as-is and fix tests, the major 
concern is people may think node label expression required to check null in CS 
logic, which could reduce code readability. I'm OK with common code (such as 
AppSchedulingInfo) check null for nodeLabelExpression. Could you fix tests of 
YARN-2618 instead of updating RegularContainerAllocator?

> NPE in RegularContainerAllocator.preCheckForNewContainer()
> --
>
> Key: YARN-4338
> URL: https://issues.apache.org/jira/browse/YARN-4338
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xinwei Qin 
>Priority: Minor
> Attachments: YARN-4338.001.patch
>
>
> The codes in RegularContainerAllocator.preCheckForNewContainer():
> {code}
> if (anyRequest.getNodeLabelExpression()
> .equals(RMNodeLabelsManager.NO_LABEL)) {
>   missedNonPartitionedRequestSchedulingOpportunity =
>   application
>   .addMissedNonPartitionedRequestSchedulingOpportunity(priority);
> }
> {code}
> {code}anyRequest.getNodeLabelExpression(){code}may return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2015-11-09 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997648#comment-14997648
 ] 

Jason Lowe commented on YARN-4311:
--

bq. Could these be part of shutdown nodes or do we need a separate category for 
such nodes? Would just the count of such nodes suffice or do we want to view 
them while its within the grace period?

The intent is the list of nodes would be visible from the UI for some period of 
time, so they can see where a particular node went after the update.  I think 
they could be part of the shutdown category since they were told to shutdown 
and leave the cluster.

> Removing nodes from include and exclude lists will not remove them from 
> decommissioned nodes list
> -
>
> Key: YARN-4311
> URL: https://issues.apache.org/jira/browse/YARN-4311
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: YARN-4311-v1.patch
>
>
> In order to fully forget about a node, removing the node from include and 
> exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The 
> tricky part that [~jlowe] pointed out was the case when include lists are not 
> used, in that case we don't want the nodes to fall off if they are not active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4287) Capacity Scheduler: Rack Locality improvement

2015-11-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997650#comment-14997650
 ] 

Hadoop QA commented on YARN-4287:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
48s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
17s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
25s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
34s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 15s 
{color} | {color:red} Patch generated 4 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 (total was 198, now 202). {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
17s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
36s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 10s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 41s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_79. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
29s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 151m 16s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_66 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK v1.7.0_79 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.7.0 Server=1.7.0 
Image:test-patch-base-hadoop-date2015-11-09 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12771428/YARN-4287-minimal-v3.patch
 |
| JIRA Issue | YARN-4287 |
| Optional Tests |  asflicense  javac  javadoc  mvninstall  unit  findbugs  
checkstyle  compile  |
| uname | Linux 3182d018451a 

[jira] [Created] (YARN-4338) NPE in RegularContainerAllocator.preCheckForNewContainer()

2015-11-09 Thread Xinwei Qin (JIRA)
Xinwei Qin  created YARN-4338:
-

 Summary: NPE in RegularContainerAllocator.preCheckForNewContainer()
 Key: YARN-4338
 URL: https://issues.apache.org/jira/browse/YARN-4338
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xinwei Qin 
Priority: Minor


The codes in RegularContainerAllocator.preCheckForNewContainer():
{code}
if (anyRequest.getNodeLabelExpression()
.equals(RMNodeLabelsManager.NO_LABEL)) {
  missedNonPartitionedRequestSchedulingOpportunity =
  application
  .addMissedNonPartitionedRequestSchedulingOpportunity(priority);
}
{code}
{code}anyRequest.getNodeLabelExpression(){code}may return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4338) NPE in RegularContainerAllocator.preCheckForNewContainer()

2015-11-09 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996185#comment-14996185
 ] 

Sunil G commented on YARN-4338:
---

Recently in YARN-4250, there were a chance that 
{{anyRequest.getNodeLabelExpression()}} become null becaue 
ApplicationMasterService may not normalizes expression always.

> NPE in RegularContainerAllocator.preCheckForNewContainer()
> --
>
> Key: YARN-4338
> URL: https://issues.apache.org/jira/browse/YARN-4338
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xinwei Qin 
>Priority: Minor
> Attachments: YARN-4338.001.patch
>
>
> The codes in RegularContainerAllocator.preCheckForNewContainer():
> {code}
> if (anyRequest.getNodeLabelExpression()
> .equals(RMNodeLabelsManager.NO_LABEL)) {
>   missedNonPartitionedRequestSchedulingOpportunity =
>   application
>   .addMissedNonPartitionedRequestSchedulingOpportunity(priority);
> }
> {code}
> {code}anyRequest.getNodeLabelExpression(){code}may return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4338) NPE in RegularContainerAllocator.preCheckForNewContainer()

2015-11-09 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996191#comment-14996191
 ] 

Naganarasimha G R commented on YARN-4338:
-

Hi [~xinwei],
   What was the scenario in which you got this NPE? 

> NPE in RegularContainerAllocator.preCheckForNewContainer()
> --
>
> Key: YARN-4338
> URL: https://issues.apache.org/jira/browse/YARN-4338
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xinwei Qin 
>Priority: Minor
> Attachments: YARN-4338.001.patch
>
>
> The codes in RegularContainerAllocator.preCheckForNewContainer():
> {code}
> if (anyRequest.getNodeLabelExpression()
> .equals(RMNodeLabelsManager.NO_LABEL)) {
>   missedNonPartitionedRequestSchedulingOpportunity =
>   application
>   .addMissedNonPartitionedRequestSchedulingOpportunity(priority);
> }
> {code}
> {code}anyRequest.getNodeLabelExpression(){code}may return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4338) NPE in RegularContainerAllocator.preCheckForNewContainer()

2015-11-09 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996241#comment-14996241
 ] 

Naganarasimha G R commented on YARN-4338:
-

Hi [~xinwei],
In that case its expected to come through ApplicationMasterService , so may be 
its sufficient to rectify the test case with default label "" . 
[~sunilg] & [~wangda]
But as we are coming across this more frequently, how about correcting it with 
setting with Default Label when using other overloaded methods or even in the 
main overloaded method we can check for null and set to Default i.e. "" ?


> NPE in RegularContainerAllocator.preCheckForNewContainer()
> --
>
> Key: YARN-4338
> URL: https://issues.apache.org/jira/browse/YARN-4338
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xinwei Qin 
>Priority: Minor
> Attachments: YARN-4338.001.patch
>
>
> The codes in RegularContainerAllocator.preCheckForNewContainer():
> {code}
> if (anyRequest.getNodeLabelExpression()
> .equals(RMNodeLabelsManager.NO_LABEL)) {
>   missedNonPartitionedRequestSchedulingOpportunity =
>   application
>   .addMissedNonPartitionedRequestSchedulingOpportunity(priority);
> }
> {code}
> {code}anyRequest.getNodeLabelExpression(){code}may return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-11-09 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996336#comment-14996336
 ] 

Varun Saxena commented on YARN-2934:


Thanks [~Naganarasimha] for uploading the patch. Sorry could not do a thorough 
review earlier.
Had a cursory glance at the latest patch. A few quick comments.

* In the code below, *instead of compiling the pattern again and again*, we can 
compile it once and store it in a static variable(because its taken from config 
and hence wont change).
Pattern#compile incurs a performance overhead if called again and again.
{code}
String errorFileNameRegexPattern =
   conf.get(YarnConfiguration.NM_CONTAINER_ERROR_FILE_NAME_PATTERN,
 YarnConfiguration.DEFAULT_NM_CONTAINER_ERROR_FILE_NAME_PATTERN);
Pattern pattern = null;
try {
  pattern =
  Pattern.compile(errorFileNameRegexPattern, Pattern.CASE_INSENSITIVE);
} catch (PatternSyntaxException e) {
  pattern = Pattern.compile(
  YarnConfiguration.DEFAULT_NM_CONTAINER_ERROR_FILE_NAME_PATTERN,
  Pattern.CASE_INSENSITIVE);
   }
{code}

* Also IMO, atleast a warning log should be printed if configured pattern 
cannot compile. This can alert the user about wrong configuration.
Should we consider not starting up NM in this case(if config is wrong) ? Maybe 
its not that important a config to not start NM. An alert message should be 
enough.
* Moreover, you can also consider using Configuration#getPattern, but take care 
of using it only once.

> Improve handling of container's stderr 
> ---
>
> Key: YARN-2934
> URL: https://issues.apache.org/jira/browse/YARN-2934
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gera Shegalov
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, 
> YARN-2934.v1.003.patch, YARN-2934.v1.004.patch
>
>
> Most YARN applications redirect stderr to some file. That's why when 
> container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-11-09 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996335#comment-14996335
 ] 

Naganarasimha G R commented on YARN-2934:
-

Findbugs is not related to this jira and check style & white space issue can be 
corrected as part of the next patch, Waiting for review comments !  cc/ 
[~jira.shegalov] & [~bikassaha].

> Improve handling of container's stderr 
> ---
>
> Key: YARN-2934
> URL: https://issues.apache.org/jira/browse/YARN-2934
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gera Shegalov
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, 
> YARN-2934.v1.003.patch, YARN-2934.v1.004.patch
>
>
> Most YARN applications redirect stderr to some file. That's why when 
> container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4050) NM event dispatcher may blocked by LogAggregationService if NameNode is slow

2015-11-09 Thread sandflee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandflee updated YARN-4050:
---
Assignee: (was: sandflee)

> NM event dispatcher may blocked by LogAggregationService if NameNode is slow
> 
>
> Key: YARN-4050
> URL: https://issues.apache.org/jira/browse/YARN-4050
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: sandflee
>
> env:  nm restart and log aggregation is enabled. 
> NN is almost dead, when we restart NM, NM event dispatcher is blocked until 
> NN returns to normal.It seems. NM recovered app  and send APPLICATION_START 
> event to log aggregation service, it will check log dir permission in 
> HDFS(BLOCKED) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4338) NPE in RegularContainerAllocator.preCheckForNewContainer()

2015-11-09 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996338#comment-14996338
 ] 

Naganarasimha G R commented on YARN-4338:
-

meant => ??how about correcting it in *ResourceRequest* with setting Default 
Label when using other overloaded methods or in the main overloaded method we 
can check for null and set to Default label i.e. ""??

> NPE in RegularContainerAllocator.preCheckForNewContainer()
> --
>
> Key: YARN-4338
> URL: https://issues.apache.org/jira/browse/YARN-4338
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xinwei Qin 
>Priority: Minor
> Attachments: YARN-4338.001.patch
>
>
> The codes in RegularContainerAllocator.preCheckForNewContainer():
> {code}
> if (anyRequest.getNodeLabelExpression()
> .equals(RMNodeLabelsManager.NO_LABEL)) {
>   missedNonPartitionedRequestSchedulingOpportunity =
>   application
>   .addMissedNonPartitionedRequestSchedulingOpportunity(priority);
> }
> {code}
> {code}anyRequest.getNodeLabelExpression(){code}may return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

2015-11-09 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996345#comment-14996345
 ] 

sandflee commented on YARN-4051:


Is it possible for the finish application or complete container requests to 
arrive at this point?   
yes, we see this in YARN-4050.  If we register to RM after complete container 
recover, we must face the risk that the container running on this node will be 
killed if container recovery takes much more time(in YARN-4050), for 
long-runing-services, maybe not so perfect.

> ContainerKillEvent is lost when container is  In New State and is recovering
> 
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: sandflee
>Assignee: sandflee
>Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4338) NPE in RegularContainerAllocator.preCheckForNewContainer()

2015-11-09 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997663#comment-14997663
 ] 

Wangda Tan commented on YARN-4338:
--

I don't know what's the impact of it, since we're leveraging 
nodeLabelExpression=="null" to represent "unset" in ResourceRequest. I think 
some code path will fail if ResourceRequest.getNodeLabelExpression returns "" 
if it == null/

> NPE in RegularContainerAllocator.preCheckForNewContainer()
> --
>
> Key: YARN-4338
> URL: https://issues.apache.org/jira/browse/YARN-4338
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xinwei Qin 
>Priority: Minor
> Attachments: YARN-4338.001.patch
>
>
> The codes in RegularContainerAllocator.preCheckForNewContainer():
> {code}
> if (anyRequest.getNodeLabelExpression()
> .equals(RMNodeLabelsManager.NO_LABEL)) {
>   missedNonPartitionedRequestSchedulingOpportunity =
>   application
>   .addMissedNonPartitionedRequestSchedulingOpportunity(priority);
> }
> {code}
> {code}anyRequest.getNodeLabelExpression(){code}may return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4287) Capacity Scheduler: Rack Locality improvement

2015-11-09 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997573#comment-14997573
 ] 

Wangda Tan commented on YARN-4287:
--

Thanks for update, [~nroberts].

Patch generally looks good, few comments:
- Could you add a comment at
{code}
  return (Math.min(rmContext.getScheduler().getNumClusterNodes(), 
  (requiredContainers * localityWaitFactor)) < missedOpportunities);
{code}
People read the code can get better understanding that why missedOpportunity 
need to be capped by numClusterNodes
- I would suggest to add tests for missedOpportunity capped by numClusterNodes 
and resetSchedulingOpportunity for rack request.

> Capacity Scheduler: Rack Locality improvement
> -
>
> Key: YARN-4287
> URL: https://issues.apache.org/jira/browse/YARN-4287
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.7.1
>Reporter: Nathan Roberts
>Assignee: Nathan Roberts
> Attachments: YARN-4287-minimal-v2.patch, YARN-4287-minimal-v3.patch, 
> YARN-4287-minimal.patch, YARN-4287-v2.patch, YARN-4287-v3.patch, 
> YARN-4287-v4.patch, YARN-4287.patch
>
>
> YARN-4189 does an excellent job describing the issues with the current delay 
> scheduling algorithms within the capacity scheduler. The design proposal also 
> seems like a good direction.
> This jira proposes a simple interim solution to the key issue we've been 
> experiencing on a regular basis:
>  - rackLocal assignments trickle out due to nodeLocalityDelay. This can have 
> significant impact on things like CombineFileInputFormat which targets very 
> specific nodes in its split calculations.
> I'm not sure when YARN-4189 will become reality so I thought a simple interim 
> patch might make sense. The basic idea is simple: 
> 1) Separate delays for rackLocal, and OffSwitch (today there is only 1)
> 2) When we're getting rackLocal assignments, subsequent rackLocal assignments 
> should not be delayed
> Patch will be uploaded shortly. No big deal if the consensus is to go 
> straight to YARN-4189. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)

2015-11-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997652#comment-14997652
 ] 

Hudson commented on YARN-3840:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2588 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2588/])
YARN-3840. Resource Manager web ui issue when sorting application by id 
(jianhe: rev 8fbea531d7f7b665f6f55af54c8ebf330118ff37)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AppPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/AllContainersPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/static/dt-plugin-1.10.7/sorting/natural.js.gz
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AppAttemptPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebApp.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/view/JQueryUI.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/TaskPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebPageUtils.java
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/TasksPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/AllApplicationsPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebApp.java
* hadoop-yarn-project/CHANGES.txt


> Resource Manager web ui issue when sorting application by id (with 
> application having id > )
> 
>
> Key: YARN-3840
> URL: https://issues.apache.org/jira/browse/YARN-3840
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: LINTE
>Assignee: Mohammad Shahid Khan
> Fix For: 2.8.0, 2.7.3
>
> Attachments: RMApps.png, YARN-3840-1.patch, YARN-3840-2.patch, 
> YARN-3840-3.patch, YARN-3840-4.patch, YARN-3840-5.patch, YARN-3840-6.patch, 
> yarn-3840-7.patch
>
>
> On the WEBUI, the global main view page : 
> http://resourcemanager:8088/cluster/apps doesn't display applications over 
> .
> With command line it works (# yarn application -list).
> Regards,
> Alexandre



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4338) NPE in RegularContainerAllocator.preCheckForNewContainer()

2015-11-09 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997659#comment-14997659
 ] 

Naganarasimha G R commented on YARN-4338:
-

Hi [~wangda],
How about setting Default Label in ResourceRequest when not set ?


> NPE in RegularContainerAllocator.preCheckForNewContainer()
> --
>
> Key: YARN-4338
> URL: https://issues.apache.org/jira/browse/YARN-4338
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xinwei Qin 
>Priority: Minor
> Attachments: YARN-4338.001.patch
>
>
> The codes in RegularContainerAllocator.preCheckForNewContainer():
> {code}
> if (anyRequest.getNodeLabelExpression()
> .equals(RMNodeLabelsManager.NO_LABEL)) {
>   missedNonPartitionedRequestSchedulingOpportunity =
>   application
>   .addMissedNonPartitionedRequestSchedulingOpportunity(priority);
> }
> {code}
> {code}anyRequest.getNodeLabelExpression(){code}may return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3946) Allow fetching exact reason as to why a submitted app is in ACCEPTED state in CS

2015-11-09 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997734#comment-14997734
 ] 

Wangda Tan commented on YARN-3946:
--

1) Is it possible to merge amLaunchDiagnostics and other diagnostics? Which can 
simplify RMAppAttemptImpl implementation.
2) Could you take a look at my previous comment?
bq. Since RMAppAttempt and SchedulerApplicationAttempt has 1 to 1 relationship, 
we can save a reference to RMAppAttemt in SchedulerApplicationAttempt, which 
could avoid getting it from RMContext.getRMApps()...

3) I feel this may not needed (no code change needed for you latest patch)
bq. Since String is immutable, amLaunchDiagnostics could be violate so we don't 
need acquire locks.
Since currently createApplicationAttemptReport has a big readLock, we don't 
need to spend extra time for the volatile.

4) Suggestions about diagnostic message:
- Have an internal field to record when is the latest update for the app. We 
can print it with diagnostic message to say, {{\[23 sec before\] }}. 
- And we can use above field to prevent excessive updating of diagnostic 
message, currently it will be updated for every heartbeat for every accessed 
applications. I think we should limit frequency of updating to avoid overheads, 
hardcoding it to 1 sec seems fine to me for now, we can make it configurable if 
people starting complain it :)
- Generally, I think the message format could be:
{{Last update from scheduler:  (such as 23 sec before);  (such 
as "Application is activated, waiting for allocating AM container"); Details: 
(instead of GenericInfo) Partition=x, queue's absoluate capacity ... (and other 
fields in your patch)}}
- After AM container is allocated and running, above message is still useful 
because people could understand if application is actively allocating resource 
or stay in the queue waiting to be accessed.

> Allow fetching exact reason as to why a submitted app is in ACCEPTED state in 
> CS
> 
>
> Key: YARN-3946
> URL: https://issues.apache.org/jira/browse/YARN-3946
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Sumit Nigam
>Assignee: Naganarasimha G R
> Attachments: 3946WebImages.zip, YARN-3946.v1.001.patch, 
> YARN-3946.v1.002.patch
>
>
> Currently there is no direct way to get the exact reason as to why a 
> submitted app is still in ACCEPTED state. It should be possible to know 
> through RM REST API as to what aspect is not being met - say, queue limits 
> being reached, or core/ memory requirement not being met, or AM limit being 
> reached, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-11-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996236#comment-14996236
 ] 

Hadoop QA commented on YARN-2934:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 6s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
2s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 50s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
27s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
38s {color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 16s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common in 
trunk has 3 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 23s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 50s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
17s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 47s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 47s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 26s 
{color} | {color:red} Patch generated 3 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 277, now 278). {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
37s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 9 line(s) with tabs. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
55s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 25s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 51s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 20s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_60. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 52s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_60. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 30s {color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK 
v1.8.0_60. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_79. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 3s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_79. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 5s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.7.0_79. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
22s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} 

[jira] [Updated] (YARN-4338) NPE in RegularContainerAllocator.preCheckForNewContainer()

2015-11-09 Thread Xinwei Qin (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinwei Qin  updated YARN-4338:
--
Attachment: YARN-4338.001.patch

Simple fix is to check if the value is null.

> NPE in RegularContainerAllocator.preCheckForNewContainer()
> --
>
> Key: YARN-4338
> URL: https://issues.apache.org/jira/browse/YARN-4338
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xinwei Qin 
>Priority: Minor
> Attachments: YARN-4338.001.patch
>
>
> The codes in RegularContainerAllocator.preCheckForNewContainer():
> {code}
> if (anyRequest.getNodeLabelExpression()
> .equals(RMNodeLabelsManager.NO_LABEL)) {
>   missedNonPartitionedRequestSchedulingOpportunity =
>   application
>   .addMissedNonPartitionedRequestSchedulingOpportunity(priority);
> }
> {code}
> {code}anyRequest.getNodeLabelExpression(){code}may return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4338) NPE in RegularContainerAllocator.preCheckForNewContainer()

2015-11-09 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996194#comment-14996194
 ] 

Sunil G commented on YARN-4338:
---

Missed to add a point earlier, are you using a custom scheduler here?

> NPE in RegularContainerAllocator.preCheckForNewContainer()
> --
>
> Key: YARN-4338
> URL: https://issues.apache.org/jira/browse/YARN-4338
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xinwei Qin 
>Priority: Minor
> Attachments: YARN-4338.001.patch
>
>
> The codes in RegularContainerAllocator.preCheckForNewContainer():
> {code}
> if (anyRequest.getNodeLabelExpression()
> .equals(RMNodeLabelsManager.NO_LABEL)) {
>   missedNonPartitionedRequestSchedulingOpportunity =
>   application
>   .addMissedNonPartitionedRequestSchedulingOpportunity(priority);
> }
> {code}
> {code}anyRequest.getNodeLabelExpression(){code}may return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4338) NPE in RegularContainerAllocator.preCheckForNewContainer()

2015-11-09 Thread Xinwei Qin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996246#comment-14996246
 ] 

Xinwei Qin  commented on YARN-4338:
---

Thanks [~Naganarasimha] for your suggestion, the test case passed with this 
modification.

> NPE in RegularContainerAllocator.preCheckForNewContainer()
> --
>
> Key: YARN-4338
> URL: https://issues.apache.org/jira/browse/YARN-4338
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xinwei Qin 
>Priority: Minor
> Attachments: YARN-4338.001.patch
>
>
> The codes in RegularContainerAllocator.preCheckForNewContainer():
> {code}
> if (anyRequest.getNodeLabelExpression()
> .equals(RMNodeLabelsManager.NO_LABEL)) {
>   missedNonPartitionedRequestSchedulingOpportunity =
>   application
>   .addMissedNonPartitionedRequestSchedulingOpportunity(priority);
> }
> {code}
> {code}anyRequest.getNodeLabelExpression(){code}may return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2934) Improve handling of container's stderr

2015-11-09 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-2934:

Attachment: YARN-2934.v1.005.patch

Thanks for the comments [~varun_saxena],
bq. Pattern#compile incurs a performance overhead if called again and again.
thought of handling this based on further comments and also it was not in 
critical/repititive code flow but any way worth optimizing hence have done in 
this patch.

bq. Should we consider not starting up NM in this case(if config is wrong) ? 
Maybe its not that important a config to not start NM. An alert message should 
be enough.
As discussed alert is enough as its not critical.

bq. Moreover, you can also consider using Configuration#getPattern, but take 
care of using it only once.
Yep this would be usefull, and also takes care of your 2nd comment, hence using 
this. but adding one more additional method there to ignore the case.

> Improve handling of container's stderr 
> ---
>
> Key: YARN-2934
> URL: https://issues.apache.org/jira/browse/YARN-2934
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gera Shegalov
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, 
> YARN-2934.v1.003.patch, YARN-2934.v1.004.patch, YARN-2934.v1.005.patch
>
>
> Most YARN applications redirect stderr to some file. That's why when 
> container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)