[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk

2015-06-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600817#comment-14600817
 ] 

Hadoop QA commented on YARN-2871:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 51s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 31s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 34s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 46s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 25s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  50m 49s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  88m 30s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12741794/YARN-2871.001.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / a815cc1 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8341/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8341/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8341/console |


This message was automatically generated.

 TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
 -

 Key: YARN-2871
 URL: https://issues.apache.org/jira/browse/YARN-2871
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: zhihai xu
Priority: Minor
 Attachments: YARN-2871.000.patch, YARN-2871.001.patch


 From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746):
 {code}
 Failed tests:
   TestRMRestart.testRMRestartGetApplicationList:957
 rMAppManager.logApplicationSummary(
 isA(org.apache.hadoop.yarn.api.records.ApplicationId)
 );
 Wanted 3 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957)
 But was 2 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-06-25 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601848#comment-14601848
 ] 

Wangda Tan commented on YARN-3849:
--

[~sunilg],
Trying to understand this issue, when the toObtainResource becomes 10,0, and 
assume container size are c1=2,1, c2=5,3, c3=4,2, c4=2,1. Preemption 
policy will kill c1..c3, my understanding of this problem is preemption policy 
can preempt one of the resource type (CPU/Memory) more than needed, but I'm not 
sure why it preempts all containers except AM.

 Too much of preemption activity causing continuos killing of containers 
 across queues
 -

 Key: YARN-3849
 URL: https://issues.apache.org/jira/browse/YARN-3849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Critical

 Two queues are used. Each queue has given a capacity of 0.5. Dominant 
 Resource policy is used.
 1. An app is submitted in QueueA which is consuming full cluster capacity
 2. After submitting an app in QueueB, there are some demand  and invoking 
 preemption in QueueA
 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
 all containers other than AM is getting killed in QueueA
 4. Now the app in QueueB is trying to take over cluster with the current free 
 space. But there are some updated demand from the app in QueueA which lost 
 its containers earlier, and preemption is kicked in QueueB now.
 Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
 apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3793) Several NPEs when deleting local files on NM recovery

2015-06-25 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3793:
---
Attachment: YARN-3793.01.patch

 Several NPEs when deleting local files on NM recovery
 -

 Key: YARN-3793
 URL: https://issues.apache.org/jira/browse/YARN-3793
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3793.01.patch


 When NM work-preserving restart is enabled, we see several NPEs on recovery. 
 These seem to correspond to sub-directories that need to be deleted. I wonder 
 if null pointers here mean incorrect tracking of these resources and a 
 potential leak. This JIRA is to investigate and fix anything required.
 Logs show:
 {noformat}
 2015-05-18 07:06:10,225 INFO 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
 absolute path : null
 2015-05-18 07:06:10,224 ERROR 
 org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during 
 execution of task in DeletionService
 java.lang.NullPointerException
 at 
 org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
 at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
 at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
 at 
 org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3793) Several NPEs when deleting local files on NM recovery

2015-06-25 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3793:
---
Priority: Critical  (was: Major)

 Several NPEs when deleting local files on NM recovery
 -

 Key: YARN-3793
 URL: https://issues.apache.org/jira/browse/YARN-3793
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3793.01.patch


 When NM work-preserving restart is enabled, we see several NPEs on recovery. 
 These seem to correspond to sub-directories that need to be deleted. I wonder 
 if null pointers here mean incorrect tracking of these resources and a 
 potential leak. This JIRA is to investigate and fix anything required.
 Logs show:
 {noformat}
 2015-05-18 07:06:10,225 INFO 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
 absolute path : null
 2015-05-18 07:06:10,224 ERROR 
 org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during 
 execution of task in DeletionService
 java.lang.NullPointerException
 at 
 org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
 at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
 at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
 at 
 org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (YARN-3855) If acl is enabled and http.authentication.type is simple, user cannot view the app page in default setup

2015-06-25 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602341#comment-14602341
 ] 

Jian He edited comment on YARN-3855 at 6/26/15 4:35 AM:


I believe what you suggested is a general good practice to setup secure 
cluster. Btw, the patch did not enable/enforce any of this. People can config 
whatever they want for the http authentication regardless how the rest 
components are setup before this jira. The point of this jira is to prevent 
this scenario that user cannot view any application (even for its own 
application) in whatever way unless the daemon is restarted.


was (Author: jianhe):
I believe what you suggested is a general good practice to setup secure 
cluster. Btw, the patch did not enable/enforce any of this. People can config 
whatever they want for the http authentication regardless how the rest 
components are setup before this jira. The point of this jira is to prevent 
this scenario that user cannot view the applications in whatever way unless the 
daemon is restarted.

 If acl is enabled and http.authentication.type is simple, user cannot view 
 the app page in default setup
 

 Key: YARN-3855
 URL: https://issues.apache.org/jira/browse/YARN-3855
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-3855.1.patch, YARN-3855.2.patch


 If all ACLs (admin acl, queue-admin-acls etc.) are setup properly and 
 http.authentication.type is 'simple' in secure mode , user cannot view the 
 application web page in default setup because the incoming user is always 
 considered as dr.who . User also cannot pass user.name to indicate the 
 incoming user name, because AuthenticationFilterInitializer is not enabled by 
 default. This is inconvenient from user's perspective. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages

2015-06-25 Thread Devaraj K (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K updated YARN-3826:


The timed out test is not related to the patch.

+1, will commit it shortly.

 Race condition in ResourceTrackerService leads to wrong diagnostics messages
 

 Key: YARN-3826
 URL: https://issues.apache.org/jira/browse/YARN-3826
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, 
 YARN-3826.03.patch


 Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which 
 can be called concurrently, the static {{resync}} and {{shutdown}} may have 
 wrong diagnostics messages in some cases.
 On the other side, these static members can hardly save any memory, since the 
 normal heartbeat responses are created for each heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3745) SerializedException should also try to instantiate internal exception with the default constructor

2015-06-25 Thread Devaraj K (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K updated YARN-3745:

Hadoop Flags: Reviewed

+1, latest patch looks good to me, will commit it shortly.

 SerializedException should also try to instantiate internal exception with 
 the default constructor
 --

 Key: YARN-3745
 URL: https://issues.apache.org/jira/browse/YARN-3745
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Lavkesh Lahngir
Assignee: Lavkesh Lahngir
 Attachments: YARN-3745.1.patch, YARN-3745.2.patch, YARN-3745.3.patch, 
 YARN-3745.patch


 While deserialising a SerializedException it tries to create internal 
 exception in instantiateException() with cn = 
 cls.getConstructor(String.class).
 if cls does not has a constructor with String parameter it throws 
 Nosuchmethodexception
 for example ClosedChannelException class.  
 We should also try to instantiate exception with default constructor so that 
 inner exception can to propagated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3826) Race condition in ResourceTrackerService: potential wrong diagnostics messages

2015-06-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600981#comment-14600981
 ] 

Hadoop QA commented on YARN-3826:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  16m 46s | Pre-patch trunk has 3 extant 
Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 35s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 35s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 14s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 26s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 24s | Tests passed in 
hadoop-yarn-server-common. |
| {color:red}-1{color} | yarn tests |  61m  0s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | | 101m 33s | |
\\
\\
|| Reason || Tests ||
| Timed out tests | 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12741817/YARN-3826.03.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / a815cc1 |
| Pre-patch Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8342/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html
 |
| hadoop-yarn-server-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8342/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8342/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8342/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8342/console |


This message was automatically generated.

 Race condition in ResourceTrackerService: potential wrong diagnostics messages
 --

 Key: YARN-3826
 URL: https://issues.apache.org/jira/browse/YARN-3826
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, 
 YARN-3826.03.patch


 Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which 
 can be called concurrently, the static {{resync}} and {{shutdown}} may have 
 wrong diagnostics messages in some cases.
 On the other side, these static members can hardly save any memory, since the 
 normal heartbeat responses are created for each heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3745) SerializedException should also try to instantiate internal exception with the default constructor

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601145#comment-14601145
 ] 

Hudson commented on YARN-3745:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8066 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8066/])
YARN-3745. SerializedException should also try to instantiate internal 
(devaraj: rev b381f88c71d18497deb35039372b1e9715d2c038)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestSerializedExceptionPBImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/SerializedExceptionPBImpl.java
* hadoop-yarn-project/CHANGES.txt


 SerializedException should also try to instantiate internal exception with 
 the default constructor
 --

 Key: YARN-3745
 URL: https://issues.apache.org/jira/browse/YARN-3745
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Lavkesh Lahngir
Assignee: Lavkesh Lahngir
 Fix For: 2.8.0

 Attachments: YARN-3745.1.patch, YARN-3745.2.patch, YARN-3745.3.patch, 
 YARN-3745.patch


 While deserialising a SerializedException it tries to create internal 
 exception in instantiateException() with cn = 
 cls.getConstructor(String.class).
 if cls does not has a constructor with String parameter it throws 
 Nosuchmethodexception
 for example ClosedChannelException class.  
 We should also try to instantiate exception with default constructor so that 
 inner exception can to propagated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600995#comment-14600995
 ] 

Hudson commented on YARN-3826:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8065 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8065/])
YARN-3826. Race condition in ResourceTrackerService leads to wrong (devaraj: 
rev 57f1a01eda80f44d3ffcbcb93c4ee290e274946a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/YarnServerBuilderUtils.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java


 Race condition in ResourceTrackerService leads to wrong diagnostics messages
 

 Key: YARN-3826
 URL: https://issues.apache.org/jira/browse/YARN-3826
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Fix For: 2.8.0

 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, 
 YARN-3826.03.patch


 Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which 
 can be called concurrently, the static {{resync}} and {{shutdown}} may have 
 wrong diagnostics messages in some cases.
 On the other side, these static members can hardly save any memory, since the 
 normal heartbeat responses are created for each heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3830) AbstractYarnScheduler.createReleaseCache may try to clean a null attempt

2015-06-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601042#comment-14601042
 ] 

Hadoop QA commented on YARN-3830:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m  9s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 39s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 53s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 21s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 49s | The applied patch generated  1 
new checkstyle issues (total was 37, now 31). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 30s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 24s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |  60m 43s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  99m  4s | |
\\
\\
|| Reason || Tests ||
| Timed out tests | 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12741828/YARN-3830_2.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / a815cc1 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8344/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8344/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8344/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8344/console |


This message was automatically generated.

 AbstractYarnScheduler.createReleaseCache may try to clean a null attempt
 

 Key: YARN-3830
 URL: https://issues.apache.org/jira/browse/YARN-3830
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: nijel
Assignee: nijel
 Attachments: YARN-3830_1.patch, YARN-3830_2.patch


 org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.createReleaseCache()
 {code}
 protected void createReleaseCache() {
 // Cleanup the cache after nm expire interval.
 new Timer().schedule(new TimerTask() {
   @Override
   public void run() {
 for (SchedulerApplicationT app : applications.values()) {
   T attempt = app.getCurrentAppAttempt();
   synchronized (attempt) {
 for (ContainerId containerId : attempt.getPendingRelease()) {
   RMAuditLogger.logFailure(
 {code}
 Here the attempt can be null since the attempt is created later. So null 
 pointer exception  will come
 {code}
 2015-06-19 09:29:16,195 | ERROR | Timer-3 | Thread Thread[Timer-3,5,main] 
 threw an Exception. | YarnUncaughtExceptionHandler.java:68
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler$1.run(AbstractYarnScheduler.java:457)
   at java.util.TimerThread.mainLoop(Timer.java:555)
   at java.util.TimerThread.run(Timer.java:505)
 {code}
 This will skip the other applications in this run.
 Can add a null check and continue with other applications



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3848) TestNodeLabelContainerAllocation is timing out

2015-06-25 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-3848:


 Summary: TestNodeLabelContainerAllocation is timing out
 Key: YARN-3848
 URL: https://issues.apache.org/jira/browse/YARN-3848
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Jason Lowe


A number of builds, pre-commit and otherwise, have been failing recently 
because TestNodeLabelContainerAllocation has timed out.  See 
https://builds.apache.org/job/Hadoop-Yarn-trunk/969/, YARN-3830, YARN-3802, or 
YARN-3826 for examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601070#comment-14601070
 ] 

Hudson commented on YARN-3809:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #969 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/969/])
YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's 
threads all hang. Contributed by Jun Gong (jlowe: rev 
2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java


 Failed to launch new attempts because ApplicationMasterLauncher's threads all 
 hang
 --

 Key: YARN-3809
 URL: https://issues.apache.org/jira/browse/YARN-3809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.7.1

 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, 
 YARN-3809.03.patch


 ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
 AMLauncherEventType(LAUNCH and CLEANUP).
 In our cluster, there was many NM with 10+ AM running on it, and one shut 
 down for some reason. After RM found the NM LOST, it cleaned up AMs running 
 on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
 ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
 in the code containerMgrProxy.stopContainers(stopRequest) because NM was 
 down, the default RPC time out is 15 mins. It means that in 15 mins 
 ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
 attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601072#comment-14601072
 ] 

Hudson commented on YARN-3790:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #969 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/969/])
YARN-3790. usedResource from rootQueue metrics may get stale data for FS 
scheduler after recovering the container (Zhihai Xu via rohithsharmaks) 
(rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 
2df00d53d13d16628b6bde5e05133d239f138f52)
* hadoop-yarn-project/CHANGES.txt


 usedResource from rootQueue metrics may get stale data for FS scheduler after 
 recovering the container
 --

 Key: YARN-3790
 URL: https://issues.apache.org/jira/browse/YARN-3790
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, test
Reporter: Rohith Sharma K S
Assignee: zhihai xu
 Fix For: 2.8.0

 Attachments: YARN-3790.000.patch


 Failure trace is as follows
 {noformat}
 Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
 testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
   Time elapsed: 6.502 sec   FAILURE!
 java.lang.AssertionError: expected:6144 but was:8192
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3360) Add JMX metrics to TimelineDataManager

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601074#comment-14601074
 ] 

Hudson commented on YARN-3360:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #969 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/969/])
YARN-3360. Add JMX metrics to TimelineDataManager (Jason Lowe via jeagles) 
(jeagles: rev 4c659ddbf7629aae92e66a5b54893e9c1c68dfb0)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManagerMetrics.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestTimelineDataManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java


 Add JMX metrics to TimelineDataManager
 --

 Key: YARN-3360
 URL: https://issues.apache.org/jira/browse/YARN-3360
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
  Labels: BB2015-05-TBR
 Fix For: 3.0.0, 2.8.0

 Attachments: YARN-3360.001.patch, YARN-3360.002.patch, 
 YARN-3360.003.patch


 The TimelineDataManager currently has no metrics, outside of the standard JVM 
 metrics.  It would be very useful to at least log basic counts of method 
 calls, time spent in those calls, and number of entities/events involved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601073#comment-14601073
 ] 

Hudson commented on YARN-3832:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #969 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/969/])
YARN-3832. Resource Localization fails on a cluster due to existing cache 
directories. Contributed by Brahma Reddy Battula (jlowe: rev 
8d58512d6e6d9fe93784a9de2af0056bcc316d96)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java


 Resource Localization fails on a cluster due to existing cache directories
 --

 Key: YARN-3832
 URL: https://issues.apache.org/jira/browse/YARN-3832
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Ranga Swamy
Assignee: Brahma Reddy Battula
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3832.patch


  *We have found resource localization fails on a cluster with following 
 error.* 
  
 Got this error in hadoop-2.7.0 release which was fixed in 2.6.0 (YARN-2624)
 {noformat}
 Application application_1434703279149_0057 failed 2 times due to AM Container 
 for appattempt_1434703279149_0057_02 exited with exitCode: -1000
 For more detailed output, check application tracking 
 page:http://S0559LDPag68:45020/cluster/app/application_1434703279149_0057Then,
  click on links to logs of each attempt.
 Diagnostics: Rename cannot overwrite non empty destination directory 
 /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
 java.io.IOException: Rename cannot overwrite non empty destination directory 
 /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
 at 
 org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:735)
 at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:244)
 at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678)
 at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Failing this attempt. Failing the application.
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages

2015-06-25 Thread Devaraj K (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K updated YARN-3826:

Hadoop Flags: Reviewed

 Race condition in ResourceTrackerService leads to wrong diagnostics messages
 

 Key: YARN-3826
 URL: https://issues.apache.org/jira/browse/YARN-3826
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Fix For: 2.8.0

 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, 
 YARN-3826.03.patch


 Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which 
 can be called concurrently, the static {{resync}} and {{shutdown}} may have 
 wrong diagnostics messages in some cases.
 On the other side, these static members can hardly save any memory, since the 
 normal heartbeat responses are created for each heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601051#comment-14601051
 ] 

Hudson commented on YARN-3809:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #239 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/239/])
YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's 
threads all hang. Contributed by Jun Gong (jlowe: rev 
2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* hadoop-yarn-project/CHANGES.txt


 Failed to launch new attempts because ApplicationMasterLauncher's threads all 
 hang
 --

 Key: YARN-3809
 URL: https://issues.apache.org/jira/browse/YARN-3809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.7.1

 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, 
 YARN-3809.03.patch


 ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
 AMLauncherEventType(LAUNCH and CLEANUP).
 In our cluster, there was many NM with 10+ AM running on it, and one shut 
 down for some reason. After RM found the NM LOST, it cleaned up AMs running 
 on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
 ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
 in the code containerMgrProxy.stopContainers(stopRequest) because NM was 
 down, the default RPC time out is 15 mins. It means that in 15 mins 
 ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
 attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601053#comment-14601053
 ] 

Hudson commented on YARN-3790:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #239 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/239/])
YARN-3790. usedResource from rootQueue metrics may get stale data for FS 
scheduler after recovering the container (Zhihai Xu via rohithsharmaks) 
(rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* hadoop-yarn-project/CHANGES.txt
Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 
2df00d53d13d16628b6bde5e05133d239f138f52)
* hadoop-yarn-project/CHANGES.txt


 usedResource from rootQueue metrics may get stale data for FS scheduler after 
 recovering the container
 --

 Key: YARN-3790
 URL: https://issues.apache.org/jira/browse/YARN-3790
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, test
Reporter: Rohith Sharma K S
Assignee: zhihai xu
 Fix For: 2.8.0

 Attachments: YARN-3790.000.patch


 Failure trace is as follows
 {noformat}
 Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
 testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
   Time elapsed: 6.502 sec   FAILURE!
 java.lang.AssertionError: expected:6144 but was:8192
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3360) Add JMX metrics to TimelineDataManager

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601055#comment-14601055
 ] 

Hudson commented on YARN-3360:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #239 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/239/])
YARN-3360. Add JMX metrics to TimelineDataManager (Jason Lowe via jeagles) 
(jeagles: rev 4c659ddbf7629aae92e66a5b54893e9c1c68dfb0)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManagerMetrics.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestTimelineDataManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java


 Add JMX metrics to TimelineDataManager
 --

 Key: YARN-3360
 URL: https://issues.apache.org/jira/browse/YARN-3360
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
  Labels: BB2015-05-TBR
 Fix For: 3.0.0, 2.8.0

 Attachments: YARN-3360.001.patch, YARN-3360.002.patch, 
 YARN-3360.003.patch


 The TimelineDataManager currently has no metrics, outside of the standard JVM 
 metrics.  It would be very useful to at least log basic counts of method 
 calls, time spent in those calls, and number of entities/events involved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601054#comment-14601054
 ] 

Hudson commented on YARN-3832:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #239 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/239/])
YARN-3832. Resource Localization fails on a cluster due to existing cache 
directories. Contributed by Brahma Reddy Battula (jlowe: rev 
8d58512d6e6d9fe93784a9de2af0056bcc316d96)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* hadoop-yarn-project/CHANGES.txt


 Resource Localization fails on a cluster due to existing cache directories
 --

 Key: YARN-3832
 URL: https://issues.apache.org/jira/browse/YARN-3832
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Ranga Swamy
Assignee: Brahma Reddy Battula
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3832.patch


  *We have found resource localization fails on a cluster with following 
 error.* 
  
 Got this error in hadoop-2.7.0 release which was fixed in 2.6.0 (YARN-2624)
 {noformat}
 Application application_1434703279149_0057 failed 2 times due to AM Container 
 for appattempt_1434703279149_0057_02 exited with exitCode: -1000
 For more detailed output, check application tracking 
 page:http://S0559LDPag68:45020/cluster/app/application_1434703279149_0057Then,
  click on links to logs of each attempt.
 Diagnostics: Rename cannot overwrite non empty destination directory 
 /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
 java.io.IOException: Rename cannot overwrite non empty destination directory 
 /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
 at 
 org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:735)
 at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:244)
 at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678)
 at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Failing this attempt. Failing the application.
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages

2015-06-25 Thread Devaraj K (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K updated YARN-3826:

Summary: Race condition in ResourceTrackerService leads to wrong 
diagnostics messages  (was: Race condition in ResourceTrackerService: potential 
wrong diagnostics messages)

 Race condition in ResourceTrackerService leads to wrong diagnostics messages
 

 Key: YARN-3826
 URL: https://issues.apache.org/jira/browse/YARN-3826
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, 
 YARN-3826.03.patch


 Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which 
 can be called concurrently, the static {{resync}} and {{shutdown}} may have 
 wrong diagnostics messages in some cases.
 On the other side, these static members can hardly save any memory, since the 
 normal heartbeat responses are created for each heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages

2015-06-25 Thread Chengbing Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601032#comment-14601032
 ] 

Chengbing Liu commented on YARN-3826:
-

Thanks [~devaraj.k] for review and committing!

 Race condition in ResourceTrackerService leads to wrong diagnostics messages
 

 Key: YARN-3826
 URL: https://issues.apache.org/jira/browse/YARN-3826
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Fix For: 2.8.0

 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, 
 YARN-3826.03.patch


 Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which 
 can be called concurrently, the static {{resync}} and {{shutdown}} may have 
 wrong diagnostics messages in some cases.
 On the other side, these static members can hardly save any memory, since the 
 normal heartbeat responses are created for each heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery

2015-06-25 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601865#comment-14601865
 ] 

Varun Saxena commented on YARN-3793:


While NPEs' are a problem, on close look at the code shows that there is a 
bigger problem here and that is *container logs can be lost* if disk has become 
bad(become 90% full).

When application finishes,  we upload logs after aggregation by calling 
{{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks 
the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} 
which in case of disk full would return nothing. So none of the container logs 
are aggregated and uploaded.
But on application finish, we also call 
{{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the 
application directory which contains container logs. This is because it calls 
{{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks 
as well.

So we are left with neither aggregated logs for the app nor the individual 
container logs for the app.

This sounds like a critical if not a blocker. [~kasha], [~jlowe], can you have 
a look ? I will upload a patch shortly.




 Several NPEs when deleting local files on NM recovery
 -

 Key: YARN-3793
 URL: https://issues.apache.org/jira/browse/YARN-3793
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Varun Saxena

 When NM work-preserving restart is enabled, we see several NPEs on recovery. 
 These seem to correspond to sub-directories that need to be deleted. I wonder 
 if null pointers here mean incorrect tracking of these resources and a 
 potential leak. This JIRA is to investigate and fix anything required.
 Logs show:
 {noformat}
 2015-05-18 07:06:10,225 INFO 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
 absolute path : null
 2015-05-18 07:06:10,224 ERROR 
 org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during 
 execution of task in DeletionService
 java.lang.NullPointerException
 at 
 org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
 at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
 at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
 at 
 org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3830) AbstractYarnScheduler.createReleaseCache may try to clean a null attempt

2015-06-25 Thread nijel (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nijel updated YARN-3830:

Attachment: YARN-3830_2.patch

Thanks [~xgong] for the comment.
Updated the patch
Please review

 AbstractYarnScheduler.createReleaseCache may try to clean a null attempt
 

 Key: YARN-3830
 URL: https://issues.apache.org/jira/browse/YARN-3830
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: nijel
Assignee: nijel
 Attachments: YARN-3830_1.patch, YARN-3830_2.patch


 org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.createReleaseCache()
 {code}
 protected void createReleaseCache() {
 // Cleanup the cache after nm expire interval.
 new Timer().schedule(new TimerTask() {
   @Override
   public void run() {
 for (SchedulerApplicationT app : applications.values()) {
   T attempt = app.getCurrentAppAttempt();
   synchronized (attempt) {
 for (ContainerId containerId : attempt.getPendingRelease()) {
   RMAuditLogger.logFailure(
 {code}
 Here the attempt can be null since the attempt is created later. So null 
 pointer exception  will come
 {code}
 2015-06-19 09:29:16,195 | ERROR | Timer-3 | Thread Thread[Timer-3,5,main] 
 threw an Exception. | YarnUncaughtExceptionHandler.java:68
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler$1.run(AbstractYarnScheduler.java:457)
   at java.util.TimerThread.mainLoop(Timer.java:555)
   at java.util.TimerThread.run(Timer.java:505)
 {code}
 This will skip the other applications in this run.
 Can add a null check and continue with other applications



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3846) RM Web UI queue fileter not working

2015-06-25 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601859#comment-14601859
 ] 

Wangda Tan commented on YARN-3846:
--

[~mohdshahidkhan],
Could you try https://issues.apache.org/jira/browse/YARN-2238 to see if this 
problem is already resolved?

 RM Web UI queue fileter not working
 ---

 Key: YARN-3846
 URL: https://issues.apache.org/jira/browse/YARN-3846
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.7.0
Reporter: Mohammad Shahid Khan
Assignee: Mohammad Shahid Khan

 Click on root queue will show the complete applications
 But click on the leaf queue is not filtering the application related to the 
 the clicked queue.
 The regular expression seems to be wrong 
 {code}
 q = '^' + q.substr(q.lastIndexOf(':') + 2) + '$';,
 {code}
 For example
 1. Suppose  queue name is  b
 them the above expression will try to substr at index 1 
 q.lastIndexOf(':')  = -1
 -1+2= 1
 which is wrong. its should look at the 0 index.
 2. if queue name is ab.x
 then it will parse it to .x 
 but it should be x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3508) Preemption processing occuring on the main RM dispatcher

2015-06-25 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601884#comment-14601884
 ] 

Varun Saxena commented on YARN-3508:


[~jlowe]/[~jianhe]/[~leftnoteasy]/[~rohithsharma], can one of the commiters 
have a look at this ? :)

 Preemption processing occuring on the main RM dispatcher
 

 Key: YARN-3508
 URL: https://issues.apache.org/jira/browse/YARN-3508
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Varun Saxena
 Attachments: YARN-3508.002.patch, YARN-3508.01.patch


 We recently saw the RM for a large cluster lag far behind on the 
 AsyncDispacher event queue.  The AsyncDispatcher thread was consistently 
 blocked on the highly-contended CapacityScheduler lock trying to dispatch 
 preemption-related events for RMContainerPreemptEventDispatcher.  Preemption 
 processing should occur on the scheduler event dispatcher thread or a 
 separate thread to avoid delaying the processing of other events in the 
 primary dispatcher queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3508) Preemption processing occuring on the main RM dispatcher

2015-06-25 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601947#comment-14601947
 ] 

Wangda Tan commented on YARN-3508:
--

I tent to support [~jlowe] and [~jianhe]'s suggestion, make preemption events 
directly goes to scheduler event queue.

I think we cannot assume preemption events have higher priority than other 
events, in most cases, preemption events are just notify AM about something 
will happen. And manage two queues for scheduler can be complex, how to balance 
them, etc. To reduce complex, I suggest to only maintain one queue for 
scheduler until we have to.

 Preemption processing occuring on the main RM dispatcher
 

 Key: YARN-3508
 URL: https://issues.apache.org/jira/browse/YARN-3508
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Varun Saxena
 Attachments: YARN-3508.002.patch, YARN-3508.01.patch


 We recently saw the RM for a large cluster lag far behind on the 
 AsyncDispacher event queue.  The AsyncDispatcher thread was consistently 
 blocked on the highly-contended CapacityScheduler lock trying to dispatch 
 preemption-related events for RMContainerPreemptEventDispatcher.  Preemption 
 processing should occur on the scheduler event dispatcher thread or a 
 separate thread to avoid delaying the processing of other events in the 
 primary dispatcher queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery

2015-06-25 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601956#comment-14601956
 ] 

Jason Lowe commented on YARN-3793:
--

It sounds like the NPEs are scary in the logs but benign in practice, since 
they occur in situations where we don't actually want to delete anything anyway.

Regarding loss of logs, I agree with your analysis.  Makes me think there 
should be a getLogDirsForRead that can be used for places to search for files 
that are already there.  The NPE and the log loss are unrelated, so arguably 
the blocker of log loss should be tracked in a separate JIRA.

 Several NPEs when deleting local files on NM recovery
 -

 Key: YARN-3793
 URL: https://issues.apache.org/jira/browse/YARN-3793
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3793.01.patch


 When NM work-preserving restart is enabled, we see several NPEs on recovery. 
 These seem to correspond to sub-directories that need to be deleted. I wonder 
 if null pointers here mean incorrect tracking of these resources and a 
 potential leak. This JIRA is to investigate and fix anything required.
 Logs show:
 {noformat}
 2015-05-18 07:06:10,225 INFO 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
 absolute path : null
 2015-05-18 07:06:10,224 ERROR 
 org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during 
 execution of task in DeletionService
 java.lang.NullPointerException
 at 
 org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
 at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
 at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
 at 
 org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1965) Interrupted exception when closing YarnClient

2015-06-25 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601965#comment-14601965
 ] 

zhihai xu commented on YARN-1965:
-

Should this be a hadoop common issue? Looks like all the changes are in hadoop 
common project.

 Interrupted exception when closing YarnClient
 -

 Key: YARN-1965
 URL: https://issues.apache.org/jira/browse/YARN-1965
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api
Affects Versions: 2.3.0
Reporter: Oleg Zhurakousky
Assignee: Kuhu Shukla
Priority: Minor
  Labels: newbie
 Attachments: YARN-1965-v2.patch, YARN-1965.patch


 Its more of a nuisance then a bug, but nevertheless 
 {code}
 16:16:48,709 ERROR pool-1-thread-1 ipc.Client:195 - Interrupted while waiting 
 for clientExecutorto stop
 java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2072)
   at 
 java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1468)
   at 
 org.apache.hadoop.ipc.Client$ClientExecutorServiceFactory.unrefAndCleanup(Client.java:191)
   at org.apache.hadoop.ipc.Client.stop(Client.java:1235)
   at org.apache.hadoop.ipc.ClientCache.stopClient(ClientCache.java:100)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.close(ProtobufRpcEngine.java:251)
   at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.close(ApplicationClientProtocolPBClientImpl.java:112)
   at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:621)
   at 
 org.apache.hadoop.io.retry.DefaultFailoverProxyProvider.close(DefaultFailoverProxyProvider.java:57)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.close(RetryInvocationHandler.java:206)
   at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStop(YarnClientImpl.java:124)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 . . .
 {code}
 It happens sporadically when stopping YarnClient. 
 Looking at the code in Client's 'unrefAndCleanup' its not immediately obvious 
 why and who throws the interrupt but in any event it should not be logged as 
 ERROR. Probably a WARN with no stack trace.
 Also, for consistency and correctness you may want to Interrupt current 
 thread as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3793) Several NPEs when deleting local files on NM recovery

2015-06-25 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601974#comment-14601974
 ] 

Varun Saxena commented on YARN-3793:


Thanks for looking at this [~jlowe].
I will raise a separate JIRA for this.

getLogDirsForRead will be same as getLogDirsForCleanup but I guess would be 
semantically correct to use it.

 Several NPEs when deleting local files on NM recovery
 -

 Key: YARN-3793
 URL: https://issues.apache.org/jira/browse/YARN-3793
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3793.01.patch


 When NM work-preserving restart is enabled, we see several NPEs on recovery. 
 These seem to correspond to sub-directories that need to be deleted. I wonder 
 if null pointers here mean incorrect tracking of these resources and a 
 potential leak. This JIRA is to investigate and fix anything required.
 Logs show:
 {noformat}
 2015-05-18 07:06:10,225 INFO 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
 absolute path : null
 2015-05-18 07:06:10,224 ERROR 
 org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during 
 execution of task in DeletionService
 java.lang.NullPointerException
 at 
 org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
 at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
 at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458)
 at 
 org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3850) Container logs can be lost if disk is full

2015-06-25 Thread Varun Saxena (JIRA)
Varun Saxena created YARN-3850:
--

 Summary: Container logs can be lost if disk is full
 Key: YARN-3850
 URL: https://issues.apache.org/jira/browse/YARN-3850
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Blocker


*Container logs* can be lost if disk has become bad(become 90% full).
When application finishes, we upload logs after aggregation by calling 
{{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks 
the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} 
which in case of disk full would return nothing. So none of the container logs 
are aggregated and uploaded.
But on application finish, we also call 
{{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the 
application directory which contains container logs. This is because it calls 
{{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks 
as well.
So we are left with neither aggregated logs for the app nor the individual 
container logs for the app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3611) Support Docker Containers In LinuxContainerExecutor

2015-06-25 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601981#comment-14601981
 ] 

Sidharta Seethana commented on YARN-3611:
-

[~ashahab] and I have been working together on this for the past few weeks. (We 
demoed this recently as well). I am going to file sub tasks so that we can make 
progress. 

thanks,
-Sidharta

 Support Docker Containers In LinuxContainerExecutor
 ---

 Key: YARN-3611
 URL: https://issues.apache.org/jira/browse/YARN-3611
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Sidharta Seethana
Assignee: Sidharta Seethana

 Support Docker Containers In LinuxContainerExecutor
 LinuxContainerExecutor provides useful functionality today with respect to 
 localization, cgroups based resource management and isolation for CPU, 
 network, disk etc. as well as security with a well-defined mechanism to 
 execute privileged operations using the container-executor utility.  Bringing 
 docker support to LinuxContainerExecutor lets us use all of this 
 functionality when running docker containers under YARN, while not requiring 
 users and admins to configure and use a different ContainerExecutor. 
 There are several aspects here that need to be worked through :
 * Mechanism(s) to let clients request docker-specific functionality - we 
 could initially implement this via environment variables without impacting 
 the client API.
 * Security - both docker daemon as well as application
 * Docker image localization
 * Running a docker container via container-executor as a specified user
 * “Isolate” the docker container in terms of CPU/network/disk/etc
 * Communicating with and/or signaling the running container (ensure correct 
 pid handling)
 * Figure out workarounds for certain performance-sensitive scenarios like 
 HDFS short-circuit reads 
 * All of these need to be achieved without changing the current behavior of 
 LinuxContainerExecutor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1965) Interrupted exception when closing YarnClient

2015-06-25 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-1965:
--
Attachment: YARN-1965-v2.patch

Patch with correction for whitespace. 
Fix to log Interrupted Exception in IPC Client as a warning. The current thread 
is interrupted once the Exception is caught. Also, some cleanup code in TestIPC 
is added so that the client executor count is decremented after each test.

 Interrupted exception when closing YarnClient
 -

 Key: YARN-1965
 URL: https://issues.apache.org/jira/browse/YARN-1965
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api
Affects Versions: 2.3.0
Reporter: Oleg Zhurakousky
Assignee: Kuhu Shukla
Priority: Minor
  Labels: newbie
 Attachments: YARN-1965-v2.patch, YARN-1965.patch


 Its more of a nuisance then a bug, but nevertheless 
 {code}
 16:16:48,709 ERROR pool-1-thread-1 ipc.Client:195 - Interrupted while waiting 
 for clientExecutorto stop
 java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2072)
   at 
 java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1468)
   at 
 org.apache.hadoop.ipc.Client$ClientExecutorServiceFactory.unrefAndCleanup(Client.java:191)
   at org.apache.hadoop.ipc.Client.stop(Client.java:1235)
   at org.apache.hadoop.ipc.ClientCache.stopClient(ClientCache.java:100)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.close(ProtobufRpcEngine.java:251)
   at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.close(ApplicationClientProtocolPBClientImpl.java:112)
   at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:621)
   at 
 org.apache.hadoop.io.retry.DefaultFailoverProxyProvider.close(DefaultFailoverProxyProvider.java:57)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.close(RetryInvocationHandler.java:206)
   at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStop(YarnClientImpl.java:124)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 . . .
 {code}
 It happens sporadically when stopping YarnClient. 
 Looking at the code in Client's 'unrefAndCleanup' its not immediately obvious 
 why and who throws the interrupt but in any event it should not be logged as 
 ERROR. Probably a WARN with no stack trace.
 Also, for consistency and correctness you may want to Interrupt current 
 thread as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3745) SerializedException should also try to instantiate internal exception with the default constructor

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601413#comment-14601413
 ] 

Hudson commented on YARN-3745:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #228 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/228/])
YARN-3745. SerializedException should also try to instantiate internal 
(devaraj: rev b381f88c71d18497deb35039372b1e9715d2c038)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestSerializedExceptionPBImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/SerializedExceptionPBImpl.java


 SerializedException should also try to instantiate internal exception with 
 the default constructor
 --

 Key: YARN-3745
 URL: https://issues.apache.org/jira/browse/YARN-3745
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Lavkesh Lahngir
Assignee: Lavkesh Lahngir
 Fix For: 2.8.0

 Attachments: YARN-3745.1.patch, YARN-3745.2.patch, YARN-3745.3.patch, 
 YARN-3745.patch


 While deserialising a SerializedException it tries to create internal 
 exception in instantiateException() with cn = 
 cls.getConstructor(String.class).
 if cls does not has a constructor with String parameter it throws 
 Nosuchmethodexception
 for example ClosedChannelException class.  
 We should also try to instantiate exception with default constructor so that 
 inner exception can to propagated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3827) Migrate YARN native build to new CMake framework

2015-06-25 Thread Alan Burlison (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Burlison updated YARN-3827:

Attachment: YARN-3827.001.patch

 Migrate YARN native build to new CMake framework
 

 Key: YARN-3827
 URL: https://issues.apache.org/jira/browse/YARN-3827
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: build
Affects Versions: 2.7.0
Reporter: Alan Burlison
Assignee: Alan Burlison
 Attachments: YARN-3827.001.patch


 As per HADOOP-12036, the CMake infrastructure should be refactored and made 
 common across all Hadoop components. This bug covers the migration of YARN to 
 the new CMake infrastructure. This change will also add support for building 
 YARN Native components on Solaris.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601369#comment-14601369
 ] 

Hudson commented on YARN-3809:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2167 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2167/])
YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's 
threads all hang. Contributed by Jun Gong (jlowe: rev 
2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* hadoop-yarn-project/CHANGES.txt


 Failed to launch new attempts because ApplicationMasterLauncher's threads all 
 hang
 --

 Key: YARN-3809
 URL: https://issues.apache.org/jira/browse/YARN-3809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.7.1

 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, 
 YARN-3809.03.patch


 ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
 AMLauncherEventType(LAUNCH and CLEANUP).
 In our cluster, there was many NM with 10+ AM running on it, and one shut 
 down for some reason. After RM found the NM LOST, it cleaned up AMs running 
 on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
 ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
 in the code containerMgrProxy.stopContainers(stopRequest) because NM was 
 down, the default RPC time out is 15 mins. It means that in 15 mins 
 ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
 attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3830) AbstractYarnScheduler.createReleaseCache may try to clean a null attempt

2015-06-25 Thread nijel (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nijel updated YARN-3830:

Attachment: YARN-3830_3.patch

Sorry for the small mistake
Line limit is corrected

Test fail is not related to this patch. Verified locally. It is passing

 AbstractYarnScheduler.createReleaseCache may try to clean a null attempt
 

 Key: YARN-3830
 URL: https://issues.apache.org/jira/browse/YARN-3830
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: nijel
Assignee: nijel
 Attachments: YARN-3830_1.patch, YARN-3830_2.patch, YARN-3830_3.patch


 org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.createReleaseCache()
 {code}
 protected void createReleaseCache() {
 // Cleanup the cache after nm expire interval.
 new Timer().schedule(new TimerTask() {
   @Override
   public void run() {
 for (SchedulerApplicationT app : applications.values()) {
   T attempt = app.getCurrentAppAttempt();
   synchronized (attempt) {
 for (ContainerId containerId : attempt.getPendingRelease()) {
   RMAuditLogger.logFailure(
 {code}
 Here the attempt can be null since the attempt is created later. So null 
 pointer exception  will come
 {code}
 2015-06-19 09:29:16,195 | ERROR | Timer-3 | Thread Thread[Timer-3,5,main] 
 threw an Exception. | YarnUncaughtExceptionHandler.java:68
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler$1.run(AbstractYarnScheduler.java:457)
   at java.util.TimerThread.mainLoop(Timer.java:555)
   at java.util.TimerThread.run(Timer.java:505)
 {code}
 This will skip the other applications in this run.
 Can add a null check and continue with other applications



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601412#comment-14601412
 ] 

Hudson commented on YARN-3832:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #228 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/228/])
YARN-3832. Resource Localization fails on a cluster due to existing cache 
directories. Contributed by Brahma Reddy Battula (jlowe: rev 
8d58512d6e6d9fe93784a9de2af0056bcc316d96)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* hadoop-yarn-project/CHANGES.txt


 Resource Localization fails on a cluster due to existing cache directories
 --

 Key: YARN-3832
 URL: https://issues.apache.org/jira/browse/YARN-3832
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Ranga Swamy
Assignee: Brahma Reddy Battula
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3832.patch


  *We have found resource localization fails on a cluster with following 
 error.* 
  
 Got this error in hadoop-2.7.0 release which was fixed in 2.6.0 (YARN-2624)
 {noformat}
 Application application_1434703279149_0057 failed 2 times due to AM Container 
 for appattempt_1434703279149_0057_02 exited with exitCode: -1000
 For more detailed output, check application tracking 
 page:http://S0559LDPag68:45020/cluster/app/application_1434703279149_0057Then,
  click on links to logs of each attempt.
 Diagnostics: Rename cannot overwrite non empty destination directory 
 /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
 java.io.IOException: Rename cannot overwrite non empty destination directory 
 /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
 at 
 org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:735)
 at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:244)
 at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678)
 at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Failing this attempt. Failing the application.
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601407#comment-14601407
 ] 

Hudson commented on YARN-3826:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #228 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/228/])
YARN-3826. Race condition in ResourceTrackerService leads to wrong (devaraj: 
rev 57f1a01eda80f44d3ffcbcb93c4ee290e274946a)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/YarnServerBuilderUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java


 Race condition in ResourceTrackerService leads to wrong diagnostics messages
 

 Key: YARN-3826
 URL: https://issues.apache.org/jira/browse/YARN-3826
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Fix For: 2.8.0

 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, 
 YARN-3826.03.patch


 Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which 
 can be called concurrently, the static {{resync}} and {{shutdown}} may have 
 wrong diagnostics messages in some cases.
 On the other side, these static members can hardly save any memory, since the 
 normal heartbeat responses are created for each heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601411#comment-14601411
 ] 

Hudson commented on YARN-3790:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #228 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/228/])
YARN-3790. usedResource from rootQueue metrics may get stale data for FS 
scheduler after recovering the container (Zhihai Xu via rohithsharmaks) 
(rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 
2df00d53d13d16628b6bde5e05133d239f138f52)
* hadoop-yarn-project/CHANGES.txt


 usedResource from rootQueue metrics may get stale data for FS scheduler after 
 recovering the container
 --

 Key: YARN-3790
 URL: https://issues.apache.org/jira/browse/YARN-3790
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, test
Reporter: Rohith Sharma K S
Assignee: zhihai xu
 Fix For: 2.8.0

 Attachments: YARN-3790.000.patch


 Failure trace is as follows
 {noformat}
 Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
 testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
   Time elapsed: 6.502 sec   FAILURE!
 java.lang.AssertionError: expected:6144 but was:8192
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3360) Add JMX metrics to TimelineDataManager

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601414#comment-14601414
 ] 

Hudson commented on YARN-3360:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #228 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/228/])
YARN-3360. Add JMX metrics to TimelineDataManager (Jason Lowe via jeagles) 
(jeagles: rev 4c659ddbf7629aae92e66a5b54893e9c1c68dfb0)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestTimelineDataManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManagerMetrics.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java


 Add JMX metrics to TimelineDataManager
 --

 Key: YARN-3360
 URL: https://issues.apache.org/jira/browse/YARN-3360
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
  Labels: BB2015-05-TBR
 Fix For: 3.0.0, 2.8.0

 Attachments: YARN-3360.001.patch, YARN-3360.002.patch, 
 YARN-3360.003.patch


 The TimelineDataManager currently has no metrics, outside of the standard JVM 
 metrics.  It would be very useful to at least log basic counts of method 
 calls, time spent in those calls, and number of entities/events involved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601409#comment-14601409
 ] 

Hudson commented on YARN-3809:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #228 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/228/])
YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's 
threads all hang. Contributed by Jun Gong (jlowe: rev 
2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml


 Failed to launch new attempts because ApplicationMasterLauncher's threads all 
 hang
 --

 Key: YARN-3809
 URL: https://issues.apache.org/jira/browse/YARN-3809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.7.1

 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, 
 YARN-3809.03.patch


 ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
 AMLauncherEventType(LAUNCH and CLEANUP).
 In our cluster, there was many NM with 10+ AM running on it, and one shut 
 down for some reason. After RM found the NM LOST, it cleaned up AMs running 
 on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
 ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
 in the code containerMgrProxy.stopContainers(stopRequest) because NM was 
 down, the default RPC time out is 15 mins. It means that in 15 mins 
 ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
 attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3827) Migrate YARN native build to new CMake framework

2015-06-25 Thread Alan Burlison (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Burlison updated YARN-3827:

Attachment: (was: YARN-3827.001.patch)

 Migrate YARN native build to new CMake framework
 

 Key: YARN-3827
 URL: https://issues.apache.org/jira/browse/YARN-3827
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: build
Affects Versions: 2.7.0
Reporter: Alan Burlison
Assignee: Alan Burlison

 As per HADOOP-12036, the CMake infrastructure should be refactored and made 
 common across all Hadoop components. This bug covers the migration of YARN to 
 the new CMake infrastructure. This change will also add support for building 
 YARN Native components on Solaris.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3745) SerializedException should also try to instantiate internal exception with the default constructor

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601373#comment-14601373
 ] 

Hudson commented on YARN-3745:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2167 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2167/])
YARN-3745. SerializedException should also try to instantiate internal 
(devaraj: rev b381f88c71d18497deb35039372b1e9715d2c038)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/SerializedExceptionPBImpl.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestSerializedExceptionPBImpl.java


 SerializedException should also try to instantiate internal exception with 
 the default constructor
 --

 Key: YARN-3745
 URL: https://issues.apache.org/jira/browse/YARN-3745
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Lavkesh Lahngir
Assignee: Lavkesh Lahngir
 Fix For: 2.8.0

 Attachments: YARN-3745.1.patch, YARN-3745.2.patch, YARN-3745.3.patch, 
 YARN-3745.patch


 While deserialising a SerializedException it tries to create internal 
 exception in instantiateException() with cn = 
 cls.getConstructor(String.class).
 if cls does not has a constructor with String parameter it throws 
 Nosuchmethodexception
 for example ClosedChannelException class.  
 We should also try to instantiate exception with default constructor so that 
 inner exception can to propagated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601367#comment-14601367
 ] 

Hudson commented on YARN-3826:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2167 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2167/])
YARN-3826. Race condition in ResourceTrackerService leads to wrong (devaraj: 
rev 57f1a01eda80f44d3ffcbcb93c4ee290e274946a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/YarnServerBuilderUtils.java
* hadoop-yarn-project/CHANGES.txt


 Race condition in ResourceTrackerService leads to wrong diagnostics messages
 

 Key: YARN-3826
 URL: https://issues.apache.org/jira/browse/YARN-3826
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Fix For: 2.8.0

 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, 
 YARN-3826.03.patch


 Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which 
 can be called concurrently, the static {{resync}} and {{shutdown}} may have 
 wrong diagnostics messages in some cases.
 On the other side, these static members can hardly save any memory, since the 
 normal heartbeat responses are created for each heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3360) Add JMX metrics to TimelineDataManager

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601374#comment-14601374
 ] 

Hudson commented on YARN-3360:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2167 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2167/])
YARN-3360. Add JMX metrics to TimelineDataManager (Jason Lowe via jeagles) 
(jeagles: rev 4c659ddbf7629aae92e66a5b54893e9c1c68dfb0)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestTimelineDataManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManagerMetrics.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java
* hadoop-yarn-project/CHANGES.txt


 Add JMX metrics to TimelineDataManager
 --

 Key: YARN-3360
 URL: https://issues.apache.org/jira/browse/YARN-3360
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
  Labels: BB2015-05-TBR
 Fix For: 3.0.0, 2.8.0

 Attachments: YARN-3360.001.patch, YARN-3360.002.patch, 
 YARN-3360.003.patch


 The TimelineDataManager currently has no metrics, outside of the standard JVM 
 metrics.  It would be very useful to at least log basic counts of method 
 calls, time spent in those calls, and number of entities/events involved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601372#comment-14601372
 ] 

Hudson commented on YARN-3832:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2167 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2167/])
YARN-3832. Resource Localization fails on a cluster due to existing cache 
directories. Contributed by Brahma Reddy Battula (jlowe: rev 
8d58512d6e6d9fe93784a9de2af0056bcc316d96)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java


 Resource Localization fails on a cluster due to existing cache directories
 --

 Key: YARN-3832
 URL: https://issues.apache.org/jira/browse/YARN-3832
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Ranga Swamy
Assignee: Brahma Reddy Battula
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3832.patch


  *We have found resource localization fails on a cluster with following 
 error.* 
  
 Got this error in hadoop-2.7.0 release which was fixed in 2.6.0 (YARN-2624)
 {noformat}
 Application application_1434703279149_0057 failed 2 times due to AM Container 
 for appattempt_1434703279149_0057_02 exited with exitCode: -1000
 For more detailed output, check application tracking 
 page:http://S0559LDPag68:45020/cluster/app/application_1434703279149_0057Then,
  click on links to logs of each attempt.
 Diagnostics: Rename cannot overwrite non empty destination directory 
 /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
 java.io.IOException: Rename cannot overwrite non empty destination directory 
 /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
 at 
 org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:735)
 at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:244)
 at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678)
 at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Failing this attempt. Failing the application.
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601371#comment-14601371
 ] 

Hudson commented on YARN-3790:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2167 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2167/])
YARN-3790. usedResource from rootQueue metrics may get stale data for FS 
scheduler after recovering the container (Zhihai Xu via rohithsharmaks) 
(rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* hadoop-yarn-project/CHANGES.txt
Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 
2df00d53d13d16628b6bde5e05133d239f138f52)
* hadoop-yarn-project/CHANGES.txt


 usedResource from rootQueue metrics may get stale data for FS scheduler after 
 recovering the container
 --

 Key: YARN-3790
 URL: https://issues.apache.org/jira/browse/YARN-3790
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, test
Reporter: Rohith Sharma K S
Assignee: zhihai xu
 Fix For: 2.8.0

 Attachments: YARN-3790.000.patch


 Failure trace is as follows
 {noformat}
 Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
 testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
   Time elapsed: 6.502 sec   FAILURE!
 java.lang.AssertionError: expected:6144 but was:8192
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations

2015-06-25 Thread Jonathan Yaniv (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Yaniv updated YARN-3656:
-
Attachment: YARN-3656-v1.2.patch

 LowCost: A Cost-Based Placement Agent for YARN Reservations
 ---

 Key: YARN-3656
 URL: https://issues.apache.org/jira/browse/YARN-3656
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.6.0
Reporter: Ishai Menache
Assignee: Jonathan Yaniv
  Labels: capacity-scheduler, resourcemanager
 Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.1.patch, 
 YARN-3656-v1.2.patch, YARN-3656-v1.patch, lowcostrayonexternal_v2.pdf


 YARN-1051 enables SLA support by allowing users to reserve cluster capacity 
 ahead of time. YARN-1710 introduced a greedy agent for placing user 
 reservations. The greedy agent makes fast placement decisions but at the cost 
 of ignoring the cluster committed resources, which might result in blocking 
 the cluster resources for certain periods of time, and in turn rejecting some 
 arriving jobs.
 We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” 
 the demand of the job throughout the allowed time-window according to a 
 global, load-based cost function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1965) Interrupted exception when closing YarnClient

2015-06-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601394#comment-14601394
 ] 

Hadoop QA commented on YARN-1965:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  17m  4s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 50s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m  0s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m  8s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 36s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 51s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | common tests |  22m 21s | Tests passed in 
hadoop-common. |
| | |  62m 51s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12741865/YARN-1965-v2.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / b381f88 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8345/artifact/patchprocess/testrun_hadoop-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8345/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8345/console |


This message was automatically generated.

 Interrupted exception when closing YarnClient
 -

 Key: YARN-1965
 URL: https://issues.apache.org/jira/browse/YARN-1965
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api
Affects Versions: 2.3.0
Reporter: Oleg Zhurakousky
Assignee: Kuhu Shukla
Priority: Minor
  Labels: newbie
 Attachments: YARN-1965-v2.patch, YARN-1965.patch


 Its more of a nuisance then a bug, but nevertheless 
 {code}
 16:16:48,709 ERROR pool-1-thread-1 ipc.Client:195 - Interrupted while waiting 
 for clientExecutorto stop
 java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2072)
   at 
 java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1468)
   at 
 org.apache.hadoop.ipc.Client$ClientExecutorServiceFactory.unrefAndCleanup(Client.java:191)
   at org.apache.hadoop.ipc.Client.stop(Client.java:1235)
   at org.apache.hadoop.ipc.ClientCache.stopClient(ClientCache.java:100)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.close(ProtobufRpcEngine.java:251)
   at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.close(ApplicationClientProtocolPBClientImpl.java:112)
   at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:621)
   at 
 org.apache.hadoop.io.retry.DefaultFailoverProxyProvider.close(DefaultFailoverProxyProvider.java:57)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.close(RetryInvocationHandler.java:206)
   at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStop(YarnClientImpl.java:124)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 . . .
 {code}
 It happens sporadically when stopping YarnClient. 
 Looking at the code in Client's 'unrefAndCleanup' its not immediately obvious 
 why and who throws the interrupt but in any event it should not be logged as 
 ERROR. Probably a WARN with no stack trace.
 Also, for consistency and correctness you may want to Interrupt current 
 thread as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2004) Priority scheduling support in Capacity scheduler

2015-06-25 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-2004:
--
Attachment: 0007-YARN-2004.patch

Rebasing patch against latest trunk. Also made changes as per OrderingPolicy  
in CS.

 Priority scheduling support in Capacity scheduler
 -

 Key: YARN-2004
 URL: https://issues.apache.org/jira/browse/YARN-2004
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Sunil G
Assignee: Sunil G
 Attachments: 0001-YARN-2004.patch, 0002-YARN-2004.patch, 
 0003-YARN-2004.patch, 0004-YARN-2004.patch, 0005-YARN-2004.patch, 
 0006-YARN-2004.patch, 0007-YARN-2004.patch


 Based on the priority of the application, Capacity Scheduler should be able 
 to give preference to application while doing scheduling.
 ComparatorFiCaSchedulerApp applicationComparator can be changed as below.   
 
 1.Check for Application priority. If priority is available, then return 
 the highest priority job.
 2.Otherwise continue with existing logic such as App ID comparison and 
 then TimeStamp comparison.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3827) Migrate YARN native build to new CMake framework

2015-06-25 Thread Alan Burlison (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Burlison updated YARN-3827:

Attachment: YARN-3827.001.patch

 Migrate YARN native build to new CMake framework
 

 Key: YARN-3827
 URL: https://issues.apache.org/jira/browse/YARN-3827
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: build
Affects Versions: 2.7.0
Reporter: Alan Burlison
Assignee: Alan Burlison

 As per HADOOP-12036, the CMake infrastructure should be refactored and made 
 common across all Hadoop components. This bug covers the migration of YARN to 
 the new CMake infrastructure. This change will also add support for building 
 YARN Native components on Solaris.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3827) Migrate YARN native build to new CMake framework

2015-06-25 Thread Alan Burlison (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Burlison updated YARN-3827:

Attachment: (was: YARN-3827.001.patch)

 Migrate YARN native build to new CMake framework
 

 Key: YARN-3827
 URL: https://issues.apache.org/jira/browse/YARN-3827
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: build
Affects Versions: 2.7.0
Reporter: Alan Burlison
Assignee: Alan Burlison

 As per HADOOP-12036, the CMake infrastructure should be refactored and made 
 common across all Hadoop components. This bug covers the migration of YARN to 
 the new CMake infrastructure. This change will also add support for building 
 YARN Native components on Solaris.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601447#comment-14601447
 ] 

Hudson commented on YARN-3790:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #237 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/237/])
YARN-3790. usedResource from rootQueue metrics may get stale data for FS 
scheduler after recovering the container (Zhihai Xu via rohithsharmaks) 
(rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* hadoop-yarn-project/CHANGES.txt
Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 
2df00d53d13d16628b6bde5e05133d239f138f52)
* hadoop-yarn-project/CHANGES.txt


 usedResource from rootQueue metrics may get stale data for FS scheduler after 
 recovering the container
 --

 Key: YARN-3790
 URL: https://issues.apache.org/jira/browse/YARN-3790
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, test
Reporter: Rohith Sharma K S
Assignee: zhihai xu
 Fix For: 2.8.0

 Attachments: YARN-3790.000.patch


 Failure trace is as follows
 {noformat}
 Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
 testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
   Time elapsed: 6.502 sec   FAILURE!
 java.lang.AssertionError: expected:6144 but was:8192
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601448#comment-14601448
 ] 

Hudson commented on YARN-3832:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #237 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/237/])
YARN-3832. Resource Localization fails on a cluster due to existing cache 
directories. Contributed by Brahma Reddy Battula (jlowe: rev 
8d58512d6e6d9fe93784a9de2af0056bcc316d96)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java


 Resource Localization fails on a cluster due to existing cache directories
 --

 Key: YARN-3832
 URL: https://issues.apache.org/jira/browse/YARN-3832
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Ranga Swamy
Assignee: Brahma Reddy Battula
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3832.patch


  *We have found resource localization fails on a cluster with following 
 error.* 
  
 Got this error in hadoop-2.7.0 release which was fixed in 2.6.0 (YARN-2624)
 {noformat}
 Application application_1434703279149_0057 failed 2 times due to AM Container 
 for appattempt_1434703279149_0057_02 exited with exitCode: -1000
 For more detailed output, check application tracking 
 page:http://S0559LDPag68:45020/cluster/app/application_1434703279149_0057Then,
  click on links to logs of each attempt.
 Diagnostics: Rename cannot overwrite non empty destination directory 
 /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
 java.io.IOException: Rename cannot overwrite non empty destination directory 
 /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
 at 
 org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:735)
 at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:244)
 at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678)
 at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Failing this attempt. Failing the application.
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601443#comment-14601443
 ] 

Hudson commented on YARN-3826:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #237 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/237/])
YARN-3826. Race condition in ResourceTrackerService leads to wrong (devaraj: 
rev 57f1a01eda80f44d3ffcbcb93c4ee290e274946a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/YarnServerBuilderUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* hadoop-yarn-project/CHANGES.txt


 Race condition in ResourceTrackerService leads to wrong diagnostics messages
 

 Key: YARN-3826
 URL: https://issues.apache.org/jira/browse/YARN-3826
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Fix For: 2.8.0

 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, 
 YARN-3826.03.patch


 Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which 
 can be called concurrently, the static {{resync}} and {{shutdown}} may have 
 wrong diagnostics messages in some cases.
 On the other side, these static members can hardly save any memory, since the 
 normal heartbeat responses are created for each heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3360) Add JMX metrics to TimelineDataManager

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601450#comment-14601450
 ] 

Hudson commented on YARN-3360:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #237 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/237/])
YARN-3360. Add JMX metrics to TimelineDataManager (Jason Lowe via jeagles) 
(jeagles: rev 4c659ddbf7629aae92e66a5b54893e9c1c68dfb0)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestTimelineDataManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManagerMetrics.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java


 Add JMX metrics to TimelineDataManager
 --

 Key: YARN-3360
 URL: https://issues.apache.org/jira/browse/YARN-3360
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
  Labels: BB2015-05-TBR
 Fix For: 3.0.0, 2.8.0

 Attachments: YARN-3360.001.patch, YARN-3360.002.patch, 
 YARN-3360.003.patch


 The TimelineDataManager currently has no metrics, outside of the standard JVM 
 metrics.  It would be very useful to at least log basic counts of method 
 calls, time spent in those calls, and number of entities/events involved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations

2015-06-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601472#comment-14601472
 ] 

Hadoop QA commented on YARN-3656:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  14m 56s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 2 new or modified test files. |
| {color:green}+1{color} | javac |   7m 32s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 32s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 24s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  3s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 35s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 26s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  50m 50s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  87m 16s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12741868/YARN-3656-v1.2.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / b381f88 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8346/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8346/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8346/console |


This message was automatically generated.

 LowCost: A Cost-Based Placement Agent for YARN Reservations
 ---

 Key: YARN-3656
 URL: https://issues.apache.org/jira/browse/YARN-3656
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.6.0
Reporter: Ishai Menache
Assignee: Jonathan Yaniv
  Labels: capacity-scheduler, resourcemanager
 Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.1.patch, 
 YARN-3656-v1.2.patch, YARN-3656-v1.patch, lowcostrayonexternal_v2.pdf


 YARN-1051 enables SLA support by allowing users to reserve cluster capacity 
 ahead of time. YARN-1710 introduced a greedy agent for placing user 
 reservations. The greedy agent makes fast placement decisions but at the cost 
 of ignoring the cluster committed resources, which might result in blocking 
 the cluster resources for certain periods of time, and in turn rejecting some 
 arriving jobs.
 We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” 
 the demand of the job throughout the allowed time-window according to a 
 global, load-based cost function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-221) NM should provide a way for AM to tell it not to aggregate logs.

2015-06-25 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601479#comment-14601479
 ] 

Ming Ma commented on YARN-221:
--

Here is the scenario. a) no applications want to over the default. b) 
Administrators of the cluster want to make a cluster-side global change from 
sample rate of 20 percent to 50 percent.

 NM should provide a way for AM to tell it not to aggregate logs.
 

 Key: YARN-221
 URL: https://issues.apache.org/jira/browse/YARN-221
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager
Reporter: Robert Joseph Evans
Assignee: Ming Ma
 Attachments: YARN-221-trunk-v1.patch, YARN-221-trunk-v2.patch, 
 YARN-221-trunk-v3.patch, YARN-221-trunk-v4.patch, YARN-221-trunk-v5.patch


 The NodeManager should provide a way for an AM to tell it that either the 
 logs should not be aggregated, that they should be aggregated with a high 
 priority, or that they should be aggregated but with a lower priority.  The 
 AM should be able to do this in the ContainerLaunch context to provide a 
 default value, but should also be able to update the value when the container 
 is released.
 This would allow for the NM to not aggregate logs in some cases, and avoid 
 connection to the NN at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-221) NM should provide a way for AM to tell it not to aggregate logs.

2015-06-25 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601616#comment-14601616
 ] 

Xuan Gong commented on YARN-221:


bq. Here is the scenario. a) no applications want to over the default. b) 
Administrators of the cluster want to make a cluster-side global change from 
sample rate of 20 percent to 50 percent.

OK. This makes sense. Thanks for explanation. 

 NM should provide a way for AM to tell it not to aggregate logs.
 

 Key: YARN-221
 URL: https://issues.apache.org/jira/browse/YARN-221
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager
Reporter: Robert Joseph Evans
Assignee: Ming Ma
 Attachments: YARN-221-trunk-v1.patch, YARN-221-trunk-v2.patch, 
 YARN-221-trunk-v3.patch, YARN-221-trunk-v4.patch, YARN-221-trunk-v5.patch


 The NodeManager should provide a way for an AM to tell it that either the 
 logs should not be aggregated, that they should be aggregated with a high 
 priority, or that they should be aggregated but with a lower priority.  The 
 AM should be able to do this in the ContainerLaunch context to provide a 
 default value, but should also be able to update the value when the container 
 is released.
 This would allow for the NM to not aggregate logs in some cases, and avoid 
 connection to the NN at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3724) Use POSIX nftw(3) instead of fts(3)

2015-06-25 Thread Alan Burlison (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601434#comment-14601434
 ] 

Alan Burlison commented on YARN-3724:
-

See also the discussion in 
http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201506.mbox/%3C558BCA3A.1020602%40oracle.com%3E.
 The use of fts(3) should be replaced by nftw(3)

 Use POSIX nftw(3) instead of fts(3)
 ---

 Key: YARN-3724
 URL: https://issues.apache.org/jira/browse/YARN-3724
 Project: Hadoop YARN
  Issue Type: Sub-task
 Environment: Solaris 11.2
Reporter: Malcolm Kavalsky
Assignee: Alan Burlison
   Original Estimate: 24h
  Remaining Estimate: 24h

 Compiling the Yarn Node Manager results in fts not found. On Solaris we 
 have an alternative ftw with similar functionality.
 This is isolated to a single file container-executor.c
 Note that this will just fix the compilation error. A more serious issue is 
 that Solaris does not support cgroups as Linux does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-06-25 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601475#comment-14601475
 ] 

Sunil G commented on YARN-3849:
---

Looping [~rohithsharma] and [~leftnoteasy]

Since we use Dominant resource calculator, below piece of code in 
ProportionalPreemptionPolicy looks doubtful

{code}
  // When we have no more resource need to obtain, remove from map.
  if (Resources.lessThanOrEqual(rc, clusterResource, toObtainByPartition,
  Resources.none())) {
resourceToObtainByPartitions.remove(nodePartition);
  }
{code}

Assume toObtainByPartition is 12, 1 ()memory, core). After another round of 
preemption, this will become 10, 0.
If the above check hits with this value, its supposed to return TRUE. But the 
method returns FALSE.

Reason is that due to dominance, if any resource item is non-zero then that is 
returned as true.

{code}
// Just use 'dominant' resource
return (dominant) ?
Math.max(
(float)resource.getMemory() / clusterResource.getMemory(), 
(float)resource.getVirtualCores() / 
clusterResource.getVirtualCores()
) 
:
  Math.min(
  (float)resource.getMemory() / clusterResource.getMemory(), 
  (float)resource.getVirtualCores() / 
clusterResource.getVirtualCores()
  ); 
{code}

If resource.getVirtualCores() is ZERO and resource.getMemory() is Non-Zero, 
then this check will return +ve. 
We feel that this has to be checked prior and if one item is ZERO, we have to 
say lhs is lesser to rhs.

 Too much of preemption activity causing continuos killing of containers 
 across queues
 -

 Key: YARN-3849
 URL: https://issues.apache.org/jira/browse/YARN-3849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Critical

 Two queues are used. Each queue has given a capacity of 0.5. Dominant 
 Resource policy is used.
 1. An app is submitted in QueueA which is consuming full cluster capacity
 2. After submitting an app in QueueB, there are some demand  and invoking 
 preemption in QueueA
 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
 all containers other than AM is getting killed in QueueA
 4. Now the app in QueueB is trying to take over cluster with the current free 
 space. But there are some updated demand from the app in QueueA which lost 
 its containers earlier, and preemption is kicked in QueueB now.
 Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
 apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3838) Rest API failing when ip configured in RM address in secure https mode

2015-06-25 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3838:
---
Attachment: 0002-YARN-3838.patch

Updated patch since new util is not required to be added. 

 Rest API failing when ip configured in RM address in secure https mode
 --

 Key: YARN-3838
 URL: https://issues.apache.org/jira/browse/YARN-3838
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-HADOOP-12096.patch, 0001-YARN-3810.patch, 
 0001-YARN-3838.patch, 0002-YARN-3810.patch, 0002-YARN-3838.patch


 Steps to reproduce
 ===
 1.Configure hadoop.http.authentication.kerberos.principal as below
 {code:xml}
   property
 namehadoop.http.authentication.kerberos.principal/name
 valueHTTP/_h...@hadoop.com/value
   /property
 {code}
 2. In RM web address also configure IP 
 3. Startup RM 
 Call Rest API for RM  {{ curl -i -k  --insecure --negotiate -u : https IP 
 /ws/v1/cluster/info}}
 *Actual*
 Rest API  failing
 {code}
 2015-06-16 19:03:49,845 DEBUG 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
 Authentication exception: GSSException: No valid credentials provided 
 (Mechanism level: Failed to find any Kerberos credentails)
 org.apache.hadoop.security.authentication.client.AuthenticationException: 
 GSSException: No valid credentials provided (Mechanism level: Failed to find 
 any Kerberos credentails)
   at 
 org.apache.hadoop.security.authentication.server.KerberosAuthenticationHandler.authenticate(KerberosAuthenticationHandler.java:399)
   at 
 org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationHandler.authenticate(DelegationTokenAuthenticationHandler.java:348)
   at 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:519)
   at 
 org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-06-25 Thread Sunil G (JIRA)
Sunil G created YARN-3849:
-

 Summary: Too much of preemption activity causing continuos killing 
of containers across queues
 Key: YARN-3849
 URL: https://issues.apache.org/jira/browse/YARN-3849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Critical


Two queues are used. Each queue has given a capacity of 0.5. Dominant Resource 
policy is used.

1. An app is submitted in QueueA which is consuming full cluster capacity
2. After submitting an app in QueueB, there are some demand  and invoking 
preemption in QueueA
3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
all containers other than AM is getting killed in QueueA
4. Now the app in QueueB is trying to take over cluster with the current free 
space. But there are some updated demand from the app in QueueA which lost its 
containers earlier, and preemption is kicked in QueueB now.

Scenario in step 3 and 4 continuously happening in loop. Thus none of the apps 
are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601461#comment-14601461
 ] 

Hudson commented on YARN-3826:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2185 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2185/])
YARN-3826. Race condition in ResourceTrackerService leads to wrong (devaraj: 
rev 57f1a01eda80f44d3ffcbcb93c4ee290e274946a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/YarnServerBuilderUtils.java
* hadoop-yarn-project/CHANGES.txt


 Race condition in ResourceTrackerService leads to wrong diagnostics messages
 

 Key: YARN-3826
 URL: https://issues.apache.org/jira/browse/YARN-3826
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Fix For: 2.8.0

 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, 
 YARN-3826.03.patch


 Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which 
 can be called concurrently, the static {{resync}} and {{shutdown}} may have 
 wrong diagnostics messages in some cases.
 On the other side, these static members can hardly save any memory, since the 
 normal heartbeat responses are created for each heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601463#comment-14601463
 ] 

Hudson commented on YARN-3809:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2185 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2185/])
YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's 
threads all hang. Contributed by Jun Gong (jlowe: rev 
2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml


 Failed to launch new attempts because ApplicationMasterLauncher's threads all 
 hang
 --

 Key: YARN-3809
 URL: https://issues.apache.org/jira/browse/YARN-3809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.7.1

 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, 
 YARN-3809.03.patch


 ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
 AMLauncherEventType(LAUNCH and CLEANUP).
 In our cluster, there was many NM with 10+ AM running on it, and one shut 
 down for some reason. After RM found the NM LOST, it cleaned up AMs running 
 on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
 ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
 in the code containerMgrProxy.stopContainers(stopRequest) because NM was 
 down, the default RPC time out is 15 mins. It means that in 15 mins 
 ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
 attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3745) SerializedException should also try to instantiate internal exception with the default constructor

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601467#comment-14601467
 ] 

Hudson commented on YARN-3745:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2185 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2185/])
YARN-3745. SerializedException should also try to instantiate internal 
(devaraj: rev b381f88c71d18497deb35039372b1e9715d2c038)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestSerializedExceptionPBImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/SerializedExceptionPBImpl.java
* hadoop-yarn-project/CHANGES.txt


 SerializedException should also try to instantiate internal exception with 
 the default constructor
 --

 Key: YARN-3745
 URL: https://issues.apache.org/jira/browse/YARN-3745
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Lavkesh Lahngir
Assignee: Lavkesh Lahngir
 Fix For: 2.8.0

 Attachments: YARN-3745.1.patch, YARN-3745.2.patch, YARN-3745.3.patch, 
 YARN-3745.patch


 While deserialising a SerializedException it tries to create internal 
 exception in instantiateException() with cn = 
 cls.getConstructor(String.class).
 if cls does not has a constructor with String parameter it throws 
 Nosuchmethodexception
 for example ClosedChannelException class.  
 We should also try to instantiate exception with default constructor so that 
 inner exception can to propagated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601466#comment-14601466
 ] 

Hudson commented on YARN-3832:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2185 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2185/])
YARN-3832. Resource Localization fails on a cluster due to existing cache 
directories. Contributed by Brahma Reddy Battula (jlowe: rev 
8d58512d6e6d9fe93784a9de2af0056bcc316d96)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* hadoop-yarn-project/CHANGES.txt


 Resource Localization fails on a cluster due to existing cache directories
 --

 Key: YARN-3832
 URL: https://issues.apache.org/jira/browse/YARN-3832
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Ranga Swamy
Assignee: Brahma Reddy Battula
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3832.patch


  *We have found resource localization fails on a cluster with following 
 error.* 
  
 Got this error in hadoop-2.7.0 release which was fixed in 2.6.0 (YARN-2624)
 {noformat}
 Application application_1434703279149_0057 failed 2 times due to AM Container 
 for appattempt_1434703279149_0057_02 exited with exitCode: -1000
 For more detailed output, check application tracking 
 page:http://S0559LDPag68:45020/cluster/app/application_1434703279149_0057Then,
  click on links to logs of each attempt.
 Diagnostics: Rename cannot overwrite non empty destination directory 
 /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
 java.io.IOException: Rename cannot overwrite non empty destination directory 
 /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
 at 
 org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:735)
 at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:244)
 at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678)
 at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Failing this attempt. Failing the application.
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601465#comment-14601465
 ] 

Hudson commented on YARN-3790:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2185 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2185/])
YARN-3790. usedResource from rootQueue metrics may get stale data for FS 
scheduler after recovering the container (Zhihai Xu via rohithsharmaks) 
(rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* hadoop-yarn-project/CHANGES.txt
Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 
2df00d53d13d16628b6bde5e05133d239f138f52)
* hadoop-yarn-project/CHANGES.txt


 usedResource from rootQueue metrics may get stale data for FS scheduler after 
 recovering the container
 --

 Key: YARN-3790
 URL: https://issues.apache.org/jira/browse/YARN-3790
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, test
Reporter: Rohith Sharma K S
Assignee: zhihai xu
 Fix For: 2.8.0

 Attachments: YARN-3790.000.patch


 Failure trace is as follows
 {noformat}
 Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
 testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
   Time elapsed: 6.502 sec   FAILURE!
 java.lang.AssertionError: expected:6144 but was:8192
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3360) Add JMX metrics to TimelineDataManager

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601468#comment-14601468
 ] 

Hudson commented on YARN-3360:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2185 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2185/])
YARN-3360. Add JMX metrics to TimelineDataManager (Jason Lowe via jeagles) 
(jeagles: rev 4c659ddbf7629aae92e66a5b54893e9c1c68dfb0)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManagerMetrics.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TestAHSWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestTimelineDataManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java


 Add JMX metrics to TimelineDataManager
 --

 Key: YARN-3360
 URL: https://issues.apache.org/jira/browse/YARN-3360
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
  Labels: BB2015-05-TBR
 Fix For: 3.0.0, 2.8.0

 Attachments: YARN-3360.001.patch, YARN-3360.002.patch, 
 YARN-3360.003.patch


 The TimelineDataManager currently has no metrics, outside of the standard JVM 
 metrics.  It would be very useful to at least log basic counts of method 
 calls, time spent in those calls, and number of entities/events involved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3724) Use POSIX nftw(3) instead of fts(3)

2015-06-25 Thread Alan Burlison (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Burlison updated YARN-3724:

Summary: Use POSIX nftw(3) instead of fts(3)  (was: Native compilation on 
Solaris fails on Yarn due to use of FTS)

 Use POSIX nftw(3) instead of fts(3)
 ---

 Key: YARN-3724
 URL: https://issues.apache.org/jira/browse/YARN-3724
 Project: Hadoop YARN
  Issue Type: Sub-task
 Environment: Solaris 11.2
Reporter: Malcolm Kavalsky
Assignee: Alan Burlison
   Original Estimate: 24h
  Remaining Estimate: 24h

 Compiling the Yarn Node Manager results in fts not found. On Solaris we 
 have an alternative ftw with similar functionality.
 This is isolated to a single file container-executor.c
 Note that this will just fix the compilation error. A more serious issue is 
 that Solaris does not support cgroups as Linux does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601445#comment-14601445
 ] 

Hudson commented on YARN-3809:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #237 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/237/])
YARN-3809. Failed to launch new attempts because ApplicationMasterLauncher's 
threads all hang. Contributed by Jun Gong (jlowe: rev 
2a20dd9b61ba3833460cbda0e8c3e8b6366fc3ab)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml


 Failed to launch new attempts because ApplicationMasterLauncher's threads all 
 hang
 --

 Key: YARN-3809
 URL: https://issues.apache.org/jira/browse/YARN-3809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.7.1

 Attachments: YARN-3809.01.patch, YARN-3809.02.patch, 
 YARN-3809.03.patch


 ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
 AMLauncherEventType(LAUNCH and CLEANUP).
 In our cluster, there was many NM with 10+ AM running on it, and one shut 
 down for some reason. After RM found the NM LOST, it cleaned up AMs running 
 on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
 ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
 in the code containerMgrProxy.stopContainers(stopRequest) because NM was 
 down, the default RPC time out is 15 mins. It means that in 15 mins 
 ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
 attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3745) SerializedException should also try to instantiate internal exception with the default constructor

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601449#comment-14601449
 ] 

Hudson commented on YARN-3745:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #237 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/237/])
YARN-3745. SerializedException should also try to instantiate internal 
(devaraj: rev b381f88c71d18497deb35039372b1e9715d2c038)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/SerializedExceptionPBImpl.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestSerializedExceptionPBImpl.java


 SerializedException should also try to instantiate internal exception with 
 the default constructor
 --

 Key: YARN-3745
 URL: https://issues.apache.org/jira/browse/YARN-3745
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Lavkesh Lahngir
Assignee: Lavkesh Lahngir
 Fix For: 2.8.0

 Attachments: YARN-3745.1.patch, YARN-3745.2.patch, YARN-3745.3.patch, 
 YARN-3745.patch


 While deserialising a SerializedException it tries to create internal 
 exception in instantiateException() with cn = 
 cls.getConstructor(String.class).
 if cls does not has a constructor with String parameter it throws 
 Nosuchmethodexception
 for example ClosedChannelException class.  
 We should also try to instantiate exception with default constructor so that 
 inner exception can to propagated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3409) Add constraint node labels

2015-06-25 Thread chong chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601495#comment-14601495
 ] 

chong chen commented on YARN-3409:
--

Any update on this? 

 Add constraint node labels
 --

 Key: YARN-3409
 URL: https://issues.apache.org/jira/browse/YARN-3409
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, capacityscheduler, client
Reporter: Wangda Tan
Assignee: Wangda Tan

 Specify only one label for each node (IAW, partition a cluster) is a way to 
 determinate how resources of a special set of nodes could be shared by a 
 group of entities (like teams, departments, etc.). Partitions of a cluster 
 has following characteristics:
 - Cluster divided to several disjoint sub clusters.
 - ACL/priority can apply on partition (Only market team / marke team has 
 priority to use the partition).
 - Percentage of capacities can apply on partition (Market team has 40% 
 minimum capacity and Dev team has 60% of minimum capacity of the partition).
 Constraints are orthogonal to partition, they’re describing attributes of 
 node’s hardware/software just for affinity. Some example of constraints:
 - glibc version
 - JDK version
 - Type of CPU (x86_64/i686)
 - Type of OS (windows, linux, etc.)
 With this, application can be able to ask for resource has (glibc.version = 
 2.20  JDK.version = 8u20  x86_64).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3838) Rest API failing when ip configured in RM address in secure https mode

2015-06-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601565#comment-14601565
 ] 

Hadoop QA commented on YARN-3838:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m  7s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 35s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 54s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 35s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 34s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   1m 57s | Tests passed in 
hadoop-yarn-common. |
| | |  40m 19s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12741889/0002-YARN-3838.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / bc43390 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8348/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8348/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8348/console |


This message was automatically generated.

 Rest API failing when ip configured in RM address in secure https mode
 --

 Key: YARN-3838
 URL: https://issues.apache.org/jira/browse/YARN-3838
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-HADOOP-12096.patch, 0001-YARN-3810.patch, 
 0001-YARN-3838.patch, 0002-YARN-3810.patch, 0002-YARN-3838.patch


 Steps to reproduce
 ===
 1.Configure hadoop.http.authentication.kerberos.principal as below
 {code:xml}
   property
 namehadoop.http.authentication.kerberos.principal/name
 valueHTTP/_h...@hadoop.com/value
   /property
 {code}
 2. In RM web address also configure IP 
 3. Startup RM 
 Call Rest API for RM  {{ curl -i -k  --insecure --negotiate -u : https IP 
 /ws/v1/cluster/info}}
 *Actual*
 Rest API  failing
 {code}
 2015-06-16 19:03:49,845 DEBUG 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
 Authentication exception: GSSException: No valid credentials provided 
 (Mechanism level: Failed to find any Kerberos credentails)
 org.apache.hadoop.security.authentication.client.AuthenticationException: 
 GSSException: No valid credentials provided (Mechanism level: Failed to find 
 any Kerberos credentails)
   at 
 org.apache.hadoop.security.authentication.server.KerberosAuthenticationHandler.authenticate(KerberosAuthenticationHandler.java:399)
   at 
 org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationHandler.authenticate(DelegationTokenAuthenticationHandler.java:348)
   at 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:519)
   at 
 org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3830) AbstractYarnScheduler.createReleaseCache may try to clean a null attempt

2015-06-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601538#comment-14601538
 ] 

Hadoop QA commented on YARN-3830:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 58s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 34s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 34s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 45s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 24s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  50m 49s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  88m 36s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12741871/YARN-3830_3.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / bc43390 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8347/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8347/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8347/console |


This message was automatically generated.

 AbstractYarnScheduler.createReleaseCache may try to clean a null attempt
 

 Key: YARN-3830
 URL: https://issues.apache.org/jira/browse/YARN-3830
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: nijel
Assignee: nijel
 Attachments: YARN-3830_1.patch, YARN-3830_2.patch, YARN-3830_3.patch


 org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.createReleaseCache()
 {code}
 protected void createReleaseCache() {
 // Cleanup the cache after nm expire interval.
 new Timer().schedule(new TimerTask() {
   @Override
   public void run() {
 for (SchedulerApplicationT app : applications.values()) {
   T attempt = app.getCurrentAppAttempt();
   synchronized (attempt) {
 for (ContainerId containerId : attempt.getPendingRelease()) {
   RMAuditLogger.logFailure(
 {code}
 Here the attempt can be null since the attempt is created later. So null 
 pointer exception  will come
 {code}
 2015-06-19 09:29:16,195 | ERROR | Timer-3 | Thread Thread[Timer-3,5,main] 
 threw an Exception. | YarnUncaughtExceptionHandler.java:68
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler$1.run(AbstractYarnScheduler.java:457)
   at java.util.TimerThread.mainLoop(Timer.java:555)
   at java.util.TimerThread.run(Timer.java:505)
 {code}
 This will skip the other applications in this run.
 Can add a null check and continue with other applications



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-221) NM should provide a way for AM to tell it not to aggregate logs.

2015-06-25 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601755#comment-14601755
 ] 

Ming Ma commented on YARN-221:
--

Thanks. [~vinodkv] and others, any additional suggestions for the design?

 NM should provide a way for AM to tell it not to aggregate logs.
 

 Key: YARN-221
 URL: https://issues.apache.org/jira/browse/YARN-221
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager
Reporter: Robert Joseph Evans
Assignee: Ming Ma
 Attachments: YARN-221-trunk-v1.patch, YARN-221-trunk-v2.patch, 
 YARN-221-trunk-v3.patch, YARN-221-trunk-v4.patch, YARN-221-trunk-v5.patch


 The NodeManager should provide a way for an AM to tell it that either the 
 logs should not be aggregated, that they should be aggregated with a high 
 priority, or that they should be aggregated but with a lower priority.  The 
 AM should be able to do this in the ContainerLaunch context to provide a 
 default value, but should also be able to update the value when the container 
 is released.
 This would allow for the NM to not aggregate logs in some cases, and avoid 
 connection to the NN at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1965) Interrupted exception when closing YarnClient

2015-06-25 Thread Mit Desai (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601701#comment-14601701
 ] 

Mit Desai commented on YARN-1965:
-

Thanks for the patch [~kshukla]. I will review it shortly.

 Interrupted exception when closing YarnClient
 -

 Key: YARN-1965
 URL: https://issues.apache.org/jira/browse/YARN-1965
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api
Affects Versions: 2.3.0
Reporter: Oleg Zhurakousky
Assignee: Kuhu Shukla
Priority: Minor
  Labels: newbie
 Attachments: YARN-1965-v2.patch, YARN-1965.patch


 Its more of a nuisance then a bug, but nevertheless 
 {code}
 16:16:48,709 ERROR pool-1-thread-1 ipc.Client:195 - Interrupted while waiting 
 for clientExecutorto stop
 java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2072)
   at 
 java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1468)
   at 
 org.apache.hadoop.ipc.Client$ClientExecutorServiceFactory.unrefAndCleanup(Client.java:191)
   at org.apache.hadoop.ipc.Client.stop(Client.java:1235)
   at org.apache.hadoop.ipc.ClientCache.stopClient(ClientCache.java:100)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.close(ProtobufRpcEngine.java:251)
   at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.close(ApplicationClientProtocolPBClientImpl.java:112)
   at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:621)
   at 
 org.apache.hadoop.io.retry.DefaultFailoverProxyProvider.close(DefaultFailoverProxyProvider.java:57)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.close(RetryInvocationHandler.java:206)
   at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStop(YarnClientImpl.java:124)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 . . .
 {code}
 It happens sporadically when stopping YarnClient. 
 Looking at the code in Client's 'unrefAndCleanup' its not immediately obvious 
 why and who throws the interrupt but in any event it should not be logged as 
 ERROR. Probably a WARN with no stack trace.
 Also, for consistency and correctness you may want to Interrupt current 
 thread as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3853) Add docker container runtime support to LinuxContainterExecutor

2015-06-25 Thread Sidharta Seethana (JIRA)
Sidharta Seethana created YARN-3853:
---

 Summary: Add docker container runtime support to 
LinuxContainterExecutor
 Key: YARN-3853
 URL: https://issues.apache.org/jira/browse/YARN-3853
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Sidharta Seethana
Assignee: Sidharta Seethana


Create a new DockerContainerRuntime that implements support for docker 
containers via container-executor. LinuxContainerExecutor should default to 
current behavior when launching containers but switch to docker when requested. 
Till a first class ‘container type’ mechanism/API is available on the client 
side, we could potentially implement this via environment variables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3852) Add docker container support to container-executor

2015-06-25 Thread Sidharta Seethana (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sidharta Seethana updated YARN-3852:

Target Version/s: 2.8.0

 Add docker container support to container-executor 
 ---

 Key: YARN-3852
 URL: https://issues.apache.org/jira/browse/YARN-3852
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: yarn
Reporter: Sidharta Seethana
Assignee: Abin Shahab

 For security reasons, we need to ensure that access to the docker daemon and 
 the ability to run docker containers is restricted to privileged users ( i.e 
 users running applications should not have direct access to docker). In order 
 to ensure the node manager can run docker commands, we need to add docker 
 support to the container-executor binary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-25 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602006#comment-14602006
 ] 

Naganarasimha G R commented on YARN-3644:
-

As long as Refactoring is taken care in YARN-3847, i don't mind! I will try to 
review the patch as soon as possible

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
 YARN-3644.002.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3850) Container logs can be lost if disk is full

2015-06-25 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3850:
---
Attachment: YARN-3850.01.patch

 Container logs can be lost if disk is full
 --

 Key: YARN-3850
 URL: https://issues.apache.org/jira/browse/YARN-3850
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Blocker
 Attachments: YARN-3850.01.patch


 *Container logs* can be lost if disk has become bad(become 90% full).
 When application finishes, we upload logs after aggregation by calling 
 {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns 
 checks the eligible directories on call to 
 {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would 
 return nothing. So none of the container logs are aggregated and uploaded.
 But on application finish, we also call 
 {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the 
 application directory which contains container logs. This is because it calls 
 {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks 
 as well.
 So we are left with neither aggregated logs for the app nor the individual 
 container logs for the app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3855) If acl is enabled and http.authentication.type is simple, user cannot view the app page in default setup

2015-06-25 Thread Jian He (JIRA)
Jian He created YARN-3855:
-

 Summary: If acl is enabled and http.authentication.type is simple, 
user cannot view the app page in default setup
 Key: YARN-3855
 URL: https://issues.apache.org/jira/browse/YARN-3855
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He


If all ACLs (admin acl, queue-admin-acls etc.) are setup properly and 
http.authentication.type is 'simple' in secure mode , user cannot view the 
application web page in default setup because the incoming user is always 
considered as dr.who and user cannot pass user.name to indicate the 
incoming user name, because AuthenticationFilterInitializer is not enabled by 
default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3855) If acl is enabled and http.authentication.type is simple, user cannot view the app page in default setup

2015-06-25 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-3855:
--
Description: If all ACLs (admin acl, queue-admin-acls etc.) are setup 
properly and http.authentication.type is 'simple' in secure mode , user 
cannot view the application web page in default setup because the incoming user 
is always considered as dr.who . User also cannot pass user.name to 
indicate the incoming user name, because AuthenticationFilterInitializer is not 
enabled by default.  (was: If all ACLs (admin acl, queue-admin-acls etc.) are 
setup properly and http.authentication.type is 'simple' in secure mode , user 
cannot view the application web page in default setup because the incoming user 
is always considered as dr.who and user cannot pass user.name to indicate 
the incoming user name, because AuthenticationFilterInitializer is not enabled 
by default.)

 If acl is enabled and http.authentication.type is simple, user cannot view 
 the app page in default setup
 

 Key: YARN-3855
 URL: https://issues.apache.org/jira/browse/YARN-3855
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He

 If all ACLs (admin acl, queue-admin-acls etc.) are setup properly and 
 http.authentication.type is 'simple' in secure mode , user cannot view the 
 application web page in default setup because the incoming user is always 
 considered as dr.who . User also cannot pass user.name to indicate the 
 incoming user name, because AuthenticationFilterInitializer is not enabled by 
 default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3855) If acl is enabled and http.authentication.type is simple, user cannot view the app page in default setup

2015-06-25 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-3855:
--
Description: If all ACLs (admin acl, queue-admin-acls etc.) are setup 
properly and http.authentication.type is 'simple' in secure mode , user 
cannot view the application web page in default setup because the incoming user 
is always considered as dr.who . User also cannot pass user.name to 
indicate the incoming user name, because AuthenticationFilterInitializer is not 
enabled by default. This is inconvenient from user's perspective.   (was: If 
all ACLs (admin acl, queue-admin-acls etc.) are setup properly and 
http.authentication.type is 'simple' in secure mode , user cannot view the 
application web page in default setup because the incoming user is always 
considered as dr.who . User also cannot pass user.name to indicate the 
incoming user name, because AuthenticationFilterInitializer is not enabled by 
default.)

 If acl is enabled and http.authentication.type is simple, user cannot view 
 the app page in default setup
 

 Key: YARN-3855
 URL: https://issues.apache.org/jira/browse/YARN-3855
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He

 If all ACLs (admin acl, queue-admin-acls etc.) are setup properly and 
 http.authentication.type is 'simple' in secure mode , user cannot view the 
 application web page in default setup because the incoming user is always 
 considered as dr.who . User also cannot pass user.name to indicate the 
 incoming user name, because AuthenticationFilterInitializer is not enabled by 
 default. This is inconvenient from user's perspective. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3850) Container logs can be lost if disk is full

2015-06-25 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602112#comment-14602112
 ] 

Varun Saxena commented on YARN-3850:


Below also seems to be a problem.
{{RecoveredContainerLaunch#locatePidFile}}

 Container logs can be lost if disk is full
 --

 Key: YARN-3850
 URL: https://issues.apache.org/jira/browse/YARN-3850
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Blocker
 Attachments: YARN-3850.01.patch


 *Container logs* can be lost if disk has become bad(become 90% full).
 When application finishes, we upload logs after aggregation by calling 
 {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns 
 checks the eligible directories on call to 
 {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would 
 return nothing. So none of the container logs are aggregated and uploaded.
 But on application finish, we also call 
 {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the 
 application directory which contains container logs. This is because it calls 
 {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks 
 as well.
 So we are left with neither aggregated logs for the app nor the individual 
 container logs for the app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3850) Container logs can be lost if disk is full

2015-06-25 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602118#comment-14602118
 ] 

Varun Saxena commented on YARN-3850:


Raise a separate JIRA for this or fix it as part of this one ?

 Container logs can be lost if disk is full
 --

 Key: YARN-3850
 URL: https://issues.apache.org/jira/browse/YARN-3850
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Blocker
 Attachments: YARN-3850.01.patch


 *Container logs* can be lost if disk has become bad(become 90% full).
 When application finishes, we upload logs after aggregation by calling 
 {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns 
 checks the eligible directories on call to 
 {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would 
 return nothing. So none of the container logs are aggregated and uploaded.
 But on application finish, we also call 
 {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the 
 application directory which contains container logs. This is because it calls 
 {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks 
 as well.
 So we are left with neither aggregated logs for the app nor the individual 
 container logs for the app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3851) Add support for container runtimes in YARN

2015-06-25 Thread Sidharta Seethana (JIRA)
Sidharta Seethana created YARN-3851:
---

 Summary: Add support for container runtimes in YARN 
 Key: YARN-3851
 URL: https://issues.apache.org/jira/browse/YARN-3851
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: yarn
Reporter: Sidharta Seethana
Assignee: Sidharta Seethana


We need the ability to support different container types within the same 
executor. Container runtimes are lower-level implementations for supporting 
specific container engines (e.g docker). These are meant to be independent of 
executors themselves - a given executor (e.g LinuxContainerExecutor) could 
potentially switch between different container runtimes depending on what a 
client/application is requesting. An executor continues to provide higher level 
functionality that could be specific to an operating system - for example, 
LinuxContainerExecutor continues to handle cgroups, users, diagnostic events 
etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3851) Add support for container runtimes in YARN

2015-06-25 Thread Sidharta Seethana (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sidharta Seethana updated YARN-3851:

Target Version/s: 2.8.0

 Add support for container runtimes in YARN 
 ---

 Key: YARN-3851
 URL: https://issues.apache.org/jira/browse/YARN-3851
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: yarn
Reporter: Sidharta Seethana
Assignee: Sidharta Seethana

 We need the ability to support different container types within the same 
 executor. Container runtimes are lower-level implementations for supporting 
 specific container engines (e.g docker). These are meant to be independent of 
 executors themselves - a given executor (e.g LinuxContainerExecutor) could 
 potentially switch between different container runtimes depending on what a 
 client/application is requesting. An executor continues to provide higher 
 level functionality that could be specific to an operating system - for 
 example, LinuxContainerExecutor continues to handle cgroups, users, 
 diagnostic events etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3850) Container logs can be lost if disk is full

2015-06-25 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602004#comment-14602004
 ] 

Jason Lowe commented on YARN-3850:
--

After thinking about this I was wondering if ShuffleHandler had a similar 
issue, since it too is looking for places to read files.  It looks like it 
might not be affected in the same way, since it doesn't use 
LocalDirsHandlerService and just uses the underlying LocalDirAllocator.  I 
don't think the latter will auto-update the list of bad/good directories, since 
it doesn't appear to update unless something tries to write through it or the 
conf is updated.

I think it could be problematic in that the ShuffleHandler will likely continue 
to search disks that later go bad or fail to search disks that were bad/full on 
startup and later became good.  If we start persisting bad/full disks across NM 
restart then it seems likely a map task could deposit shuffle data on a disk 
the ShuffleHandler will fail to search with its stale view of the disks on 
startup.  What do you think?  Should be addressed as a separate JIRA if a 
problem, but I'm trying to think of other places in the NM where we would have 
a similar bug and only searching good dirs for reading rather than also 
checking the full disks.

 Container logs can be lost if disk is full
 --

 Key: YARN-3850
 URL: https://issues.apache.org/jira/browse/YARN-3850
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Blocker

 *Container logs* can be lost if disk has become bad(become 90% full).
 When application finishes, we upload logs after aggregation by calling 
 {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns 
 checks the eligible directories on call to 
 {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would 
 return nothing. So none of the container logs are aggregated and uploaded.
 But on application finish, we also call 
 {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the 
 application directory which contains container logs. This is because it calls 
 {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks 
 as well.
 So we are left with neither aggregated logs for the app nor the individual 
 container logs for the app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3855) If acl is enabled and http.authentication.type is simple, user cannot view the app page in default setup

2015-06-25 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602040#comment-14602040
 ] 

Jian He commented on YARN-3855:
---

Today, RMAuthenticationFilterInitializer is always added in non-secure mode. 
The proposal is to always add RMAuthenticationFilterInitializer too in secure 
mode so that if http.authentication.type is 'simple' , user can pass the 
user.name to indicate the incoming user name.

 If acl is enabled and http.authentication.type is simple, user cannot view 
 the app page in default setup
 

 Key: YARN-3855
 URL: https://issues.apache.org/jira/browse/YARN-3855
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He

 If all ACLs (admin acl, queue-admin-acls etc.) are setup properly and 
 http.authentication.type is 'simple' in secure mode , user cannot view the 
 application web page in default setup because the incoming user is always 
 considered as dr.who and user cannot pass user.name to indicate the 
 incoming user name, because AuthenticationFilterInitializer is not enabled by 
 default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3705) forcemanual transitionToStandby in RM-HA automatic-failover mode should change elector state

2015-06-25 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602101#comment-14602101
 ] 

Xuan Gong commented on YARN-3705:
-

[~iwasakims] Thanks for working on this.
Here is one issue for this patch.
If we call resetLeaderElection inside the rmadmin.transitionToStandby(), it 
will cause a infinite loop.

Basically, resetLeaderElection-- terminate and recreate zk client -- rejoin 
the leader elector -- transitionToStandby --resetLeaderElection

Could you check this, please ?

 forcemanual transitionToStandby in RM-HA automatic-failover mode should 
 change elector state
 

 Key: YARN-3705
 URL: https://issues.apache.org/jira/browse/YARN-3705
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
 Attachments: YARN-3705.001.patch, YARN-3705.002.patch, 
 YARN-3705.003.patch, YARN-3705.004.patch, YARN-3705.005.patch


 Executing {{rmadmin -transitionToStandby --forcemanual}} in 
 automatic-failover.enabled mode makes ResouceManager standby while keeping 
 the state of ActiveStandbyElector. It should make elector to quit and rejoin 
 in order to enable other candidates to promote, otherwise forcemanual 
 transition should not be allowed in automatic-failover mode in order to avoid 
 confusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3850) Container logs can be lost if disk is full

2015-06-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602115#comment-14602115
 ] 

Hadoop QA commented on YARN-3850:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 27s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 58s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 55s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 21s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 42s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 36s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 15s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   6m 17s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  45m  6s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12741960/YARN-3850.01.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / aa5b15b |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8351/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8351/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8351/console |


This message was automatically generated.

 Container logs can be lost if disk is full
 --

 Key: YARN-3850
 URL: https://issues.apache.org/jira/browse/YARN-3850
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Blocker
 Attachments: YARN-3850.01.patch


 *Container logs* can be lost if disk has become bad(become 90% full).
 When application finishes, we upload logs after aggregation by calling 
 {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns 
 checks the eligible directories on call to 
 {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would 
 return nothing. So none of the container logs are aggregated and uploaded.
 But on application finish, we also call 
 {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the 
 application directory which contains container logs. This is because it calls 
 {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks 
 as well.
 So we are left with neither aggregated logs for the app nor the individual 
 container logs for the app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3852) Add docker container support to container-executor

2015-06-25 Thread Sidharta Seethana (JIRA)
Sidharta Seethana created YARN-3852:
---

 Summary: Add docker container support to container-executor 
 Key: YARN-3852
 URL: https://issues.apache.org/jira/browse/YARN-3852
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Sidharta Seethana
Assignee: Abin Shahab


For security reasons, we need to ensure that access to the docker daemon and 
the ability to run docker containers is restricted to privileged users ( i.e 
users running applications should not have direct access to docker). In order 
to ensure the node manager can run docker commands, we need to add docker 
support to the container-executor binary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3854) Add localization support for docker images

2015-06-25 Thread Sidharta Seethana (JIRA)
Sidharta Seethana created YARN-3854:
---

 Summary: Add localization support for docker images
 Key: YARN-3854
 URL: https://issues.apache.org/jira/browse/YARN-3854
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Sidharta Seethana
Assignee: Sidharta Seethana


We need the ability to localize images from HDFS and load them for use when 
launching docker containers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3850) Container logs can be lost if disk is full

2015-06-25 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3850:
---
Attachment: YARN-3850.01.patch

 Container logs can be lost if disk is full
 --

 Key: YARN-3850
 URL: https://issues.apache.org/jira/browse/YARN-3850
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Blocker
 Attachments: YARN-3850.01.patch


 *Container logs* can be lost if disk has become bad(become 90% full).
 When application finishes, we upload logs after aggregation by calling 
 {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns 
 checks the eligible directories on call to 
 {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would 
 return nothing. So none of the container logs are aggregated and uploaded.
 But on application finish, we also call 
 {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the 
 application directory which contains container logs. This is because it calls 
 {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks 
 as well.
 So we are left with neither aggregated logs for the app nor the individual 
 container logs for the app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3850) Container logs can be lost if disk is full

2015-06-25 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3850:
---
Attachment: (was: YARN-3850.01.patch)

 Container logs can be lost if disk is full
 --

 Key: YARN-3850
 URL: https://issues.apache.org/jira/browse/YARN-3850
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Blocker

 *Container logs* can be lost if disk has become bad(become 90% full).
 When application finishes, we upload logs after aggregation by calling 
 {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns 
 checks the eligible directories on call to 
 {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would 
 return nothing. So none of the container logs are aggregated and uploaded.
 But on application finish, we also call 
 {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the 
 application directory which contains container logs. This is because it calls 
 {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks 
 as well.
 So we are left with neither aggregated logs for the app nor the individual 
 container logs for the app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3850) Container logs can be lost if disk is full

2015-06-25 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602045#comment-14602045
 ] 

Varun Saxena commented on YARN-3850:


Yes this also looks like a problem. We should not use LocalDirAllocator for 
ShuffleHandler.
I will look for other areas where similar problem can happen and update if I 
find something.

 Container logs can be lost if disk is full
 --

 Key: YARN-3850
 URL: https://issues.apache.org/jira/browse/YARN-3850
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Blocker
 Attachments: YARN-3850.01.patch


 *Container logs* can be lost if disk has become bad(become 90% full).
 When application finishes, we upload logs after aggregation by calling 
 {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns 
 checks the eligible directories on call to 
 {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would 
 return nothing. So none of the container logs are aggregated and uploaded.
 But on application finish, we also call 
 {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the 
 application directory which contains container logs. This is because it calls 
 {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks 
 as well.
 So we are left with neither aggregated logs for the app nor the individual 
 container logs for the app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3850) Container logs can be lost if disk is full

2015-06-25 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602044#comment-14602044
 ] 

Varun Saxena commented on YARN-3850:


Yes this also looks like a problem. We should not use LocalDirAllocator for 
ShuffleHandler.
I will look for other areas where similar problem can happen and update if I 
find something.

 Container logs can be lost if disk is full
 --

 Key: YARN-3850
 URL: https://issues.apache.org/jira/browse/YARN-3850
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Blocker
 Attachments: YARN-3850.01.patch


 *Container logs* can be lost if disk has become bad(become 90% full).
 When application finishes, we upload logs after aggregation by calling 
 {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns 
 checks the eligible directories on call to 
 {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would 
 return nothing. So none of the container logs are aggregated and uploaded.
 But on application finish, we also call 
 {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the 
 application directory which contains container logs. This is because it calls 
 {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks 
 as well.
 So we are left with neither aggregated logs for the app nor the individual 
 container logs for the app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >