[jira] [Commented] (YARN-3850) NM fails to read files from full disks which can lead to container logs being lost and other issues

2015-06-26 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602466#comment-14602466
 ] 

Varun Saxena commented on YARN-3850:


There seems to be some issue with whitespace check. The line it shows in result 
doesnt have any whitespace. The one below has, but that hasnt been added by me.

 NM fails to read files from full disks which can lead to container logs being 
 lost and other issues
 ---

 Key: YARN-3850
 URL: https://issues.apache.org/jira/browse/YARN-3850
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation, nodemanager
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Blocker
 Attachments: YARN-3850.01.patch, YARN-3850.02.patch


 *Container logs* can be lost if disk has become full(~90% full).
 When application finishes, we upload logs after aggregation by calling 
 {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns 
 checks the eligible directories on call to 
 {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would 
 return nothing. So none of the container logs are aggregated and uploaded.
 But on application finish, we also call 
 {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the 
 application directory which contains container logs. This is because it calls 
 {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks 
 as well.
 So we are left with neither aggregated logs for the app nor the individual 
 container logs for the app.
 In addition to this, there are 2 more issues :
 # {{ContainerLogsUtil#getContainerLogDirs}} does not consider full disks so 
 NM will fail to serve up logs from full disks from its web interfaces.
 # {{RecoveredContainerLaunch#locatePidFile}} also does not consider full 
 disks so it is possible that on container recovery, PID file is not found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3508) Preemption processing occuring on the main RM dispatcher

2015-06-26 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602474#comment-14602474
 ] 

Varun Saxena commented on YARN-3508:


[~leftnoteasy], ok if thats the consensus, I will do so.

 Preemption processing occuring on the main RM dispatcher
 

 Key: YARN-3508
 URL: https://issues.apache.org/jira/browse/YARN-3508
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Varun Saxena
 Attachments: YARN-3508.002.patch, YARN-3508.01.patch


 We recently saw the RM for a large cluster lag far behind on the 
 AsyncDispacher event queue.  The AsyncDispatcher thread was consistently 
 blocked on the highly-contended CapacityScheduler lock trying to dispatch 
 preemption-related events for RMContainerPreemptEventDispatcher.  Preemption 
 processing should occur on the scheduler event dispatcher thread or a 
 separate thread to avoid delaying the processing of other events in the 
 primary dispatcher queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED

2015-06-26 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602483#comment-14602483
 ] 

Tsuyoshi Ozawa commented on YARN-3798:
--

Sure. After fixing this, I'd like to release 2.7.2 soon.

 ZKRMStateStore shouldn't create new session without occurrance of 
 SESSIONEXPIED
 ---

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Blocker
 Attachments: RM.log, YARN-3798-2.7.002.patch, 
 YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.patch


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 

[jira] [Commented] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED

2015-06-26 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602487#comment-14602487
 ] 

Tsuyoshi Ozawa commented on YARN-3798:
--

[~zxu] thank for your explanation.

Based on your log, when ZKRMStateStore meets SessionMovedException, I think we 
should close the session and fail over to another RM as a workaround since we 
cannot recover from the exception. If we close and open new session without 
fencing, same issue as Bibin reported will come up. 

I'll create a patch to going standby mode when ZKRMSTateStore meets 
SessionMovedException. Please let me know if I have something missing.

 ZKRMStateStore shouldn't create new session without occurrance of 
 SESSIONEXPIED
 ---

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Blocker
 Attachments: RM.log, YARN-3798-2.7.002.patch, 
 YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.patch


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at 

[jira] [Created] (YARN-3856) YARN shoud allocate container that is closest to the data

2015-06-26 Thread jaehoon ko (JIRA)
jaehoon ko created YARN-3856:


 Summary: YARN shoud allocate container that is closest to the data
 Key: YARN-3856
 URL: https://issues.apache.org/jira/browse/YARN-3856
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Hadoop cluster with multi-level network hierarchy
Reporter: jaehoon ko


Currently, given a Container request for a host, ResourceManager allocates a 
Container with following priorities (RMContainerAllocator.java):
 - Requested host
 - a host in the same rack as the requested host
 - any host

This can lead to a sub-optimal allocation if Hadoop cluster is deployed on 
multi-level networked hosts (which is typical). For example, let's suppose a 
network architecture with one core switches, two aggregate switches, four ToR 
switches, and 8 hosts. Each switch has two downlinks. Rack IDs of hosts are as 
follows:
h1, h2: /c/a1/t1
h3, h4: /c/a1/t2
h5, h6: /c/a2/t3
h7, h8: /c/a2/t4

To allocate a container for data in h1, Hadoop first tries h1 itself, then h2, 
then any of h3 ~ h8. Clearly, h3 or h4 are better than h5~h8 in terms of 
network distance and bandwidth. However, current implementation choose one from 
h3~h8 with equal probabilities.

This limitation is more obvious when considering hadoop clusters deployed on VM 
or containers. In this case, only the VMs or containers running in the same 
physical host are considered rack local, and actual rack-local hosts are chosen 
with same probabilities as far hosts.

The root cause of this limitation is that RMContainerAllocator.java performs 
exact matching on rack id to find a rack local host. Alternatively, we can 
perform longest-prefix matching to find a closest host. Using the same network 
architecture as above, with longest-prefix matching, hosts are selected with 
the following priorities:
 h1
 h2
 h3 or h4
 h5 or h6 or h7 or h8




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2369) Environment variable handling assumes values should be appended

2015-06-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602500#comment-14602500
 ] 

Hadoop QA commented on YARN-2369:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  20m 12s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 33s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 48s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 25s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   2m 14s | The applied patch generated  1 
new checkstyle issues (total was 176, now 173). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   5m 56s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | common tests |  22m  5s | Tests passed in 
hadoop-common. |
| {color:green}+1{color} | mapreduce tests |   0m 46s | Tests passed in 
hadoop-mapreduce-client-common. |
| {color:green}+1{color} | mapreduce tests |   1m 42s | Tests passed in 
hadoop-mapreduce-client-core. |
| {color:green}+1{color} | yarn tests |   1m 56s | Tests passed in 
hadoop-yarn-common. |
| | |  75m 45s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12742041/YARN-2369-6.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 8ef07f7 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8355/artifact/patchprocess/diffcheckstylehadoop-common.txt
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8355/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-mapreduce-client-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8355/artifact/patchprocess/testrun_hadoop-mapreduce-client-common.txt
 |
| hadoop-mapreduce-client-core test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8355/artifact/patchprocess/testrun_hadoop-mapreduce-client-core.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8355/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8355/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8355/console |


This message was automatically generated.

 Environment variable handling assumes values should be appended
 ---

 Key: YARN-2369
 URL: https://issues.apache.org/jira/browse/YARN-2369
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
Reporter: Jason Lowe
Assignee: Dustin Cote
 Attachments: YARN-2369-1.patch, YARN-2369-2.patch, YARN-2369-3.patch, 
 YARN-2369-4.patch, YARN-2369-5.patch, YARN-2369-6.patch


 When processing environment variables for a container context the code 
 assumes that the value should be appended to any pre-existing value in the 
 environment.  This may be desired behavior for handling path-like environment 
 variables such as PATH, LD_LIBRARY_PATH, CLASSPATH, etc. but it is a 
 non-intuitive and harmful way to handle any variable that does not have 
 path-like semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3856) YARN shoud allocate container that is closest to the data

2015-06-26 Thread jaehoon ko (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jaehoon ko updated YARN-3856:
-
Attachment: YARN-3856.001.patch

This patch changes RMContainerAllocator's behaviour so that longest prefix 
matching on rack id is performed to find a rack local host

 YARN shoud allocate container that is closest to the data
 -

 Key: YARN-3856
 URL: https://issues.apache.org/jira/browse/YARN-3856
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Hadoop cluster with multi-level network hierarchy
Reporter: jaehoon ko
 Attachments: YARN-3856.001.patch


 Currently, given a Container request for a host, ResourceManager allocates a 
 Container with following priorities (RMContainerAllocator.java):
  - Requested host
  - a host in the same rack as the requested host
  - any host
 This can lead to a sub-optimal allocation if Hadoop cluster is deployed on 
 multi-level networked hosts (which is typical). For example, let's suppose a 
 network architecture with one core switches, two aggregate switches, four ToR 
 switches, and 8 hosts. Each switch has two downlinks. Rack IDs of hosts are 
 as follows:
 h1, h2: /c/a1/t1
 h3, h4: /c/a1/t2
 h5, h6: /c/a2/t3
 h7, h8: /c/a2/t4
 To allocate a container for data in h1, Hadoop first tries h1 itself, then 
 h2, then any of h3 ~ h8. Clearly, h3 or h4 are better than h5~h8 in terms of 
 network distance and bandwidth. However, current implementation choose one 
 from h3~h8 with equal probabilities.
 This limitation is more obvious when considering hadoop clusters deployed on 
 VM or containers. In this case, only the VMs or containers running in the 
 same physical host are considered rack local, and actual rack-local hosts are 
 chosen with same probabilities as far hosts.
 The root cause of this limitation is that RMContainerAllocator.java performs 
 exact matching on rack id to find a rack local host. Alternatively, we can 
 perform longest-prefix matching to find a closest host. Using the same 
 network architecture as above, with longest-prefix matching, hosts are 
 selected with the following priorities:
  h1
  h2
  h3 or h4
  h5 or h6 or h7 or h8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3855) If acl is enabled and http.authentication.type is simple, user cannot view the app page in default setup

2015-06-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602512#comment-14602512
 ] 

Hadoop QA commented on YARN-3855:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 35s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 26s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 34s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 20s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 10s | The applied patch generated  5 
new checkstyle issues (total was 53, now 51). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 31s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 11s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | mapreduce tests |   5m 44s | Tests passed in 
hadoop-mapreduce-client-hs. |
| {color:green}+1{color} | yarn tests |  50m 34s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  95m 41s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12742043/YARN-3855.2.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 8ef07f7 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8354/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| hadoop-mapreduce-client-hs test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8354/artifact/patchprocess/testrun_hadoop-mapreduce-client-hs.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8354/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8354/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8354/console |


This message was automatically generated.

 If acl is enabled and http.authentication.type is simple, user cannot view 
 the app page in default setup
 

 Key: YARN-3855
 URL: https://issues.apache.org/jira/browse/YARN-3855
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-3855.1.patch, YARN-3855.2.patch


 If all ACLs (admin acl, queue-admin-acls etc.) are setup properly and 
 http.authentication.type is 'simple' in secure mode , user cannot view the 
 application web page in default setup because the incoming user is always 
 considered as dr.who . User also cannot pass user.name to indicate the 
 incoming user name, because AuthenticationFilterInitializer is not enabled by 
 default. This is inconvenient from user's perspective. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3856) YARN shoud allocate container that is closest to the data

2015-06-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602550#comment-14602550
 ] 

Hadoop QA commented on YARN-3856:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  16m 54s | Pre-patch trunk has 1 extant 
Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 35s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 16s | The applied patch generated  
19 new checkstyle issues (total was 0, now 19). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   2m 41s | The patch appears to introduce 1 
new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | mapreduce tests |   9m  3s | Tests passed in 
hadoop-mapreduce-client-app. |
| {color:green}+1{color} | yarn tests |   1m 56s | Tests passed in 
hadoop-yarn-common. |
| | |  51m 33s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-yarn-common |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12742067/YARN-3856.001.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 8ef07f7 |
| Pre-patch Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8356/artifact/patchprocess/trunkFindbugsWarningshadoop-mapreduce-client-app.html
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8356/artifact/patchprocess/diffcheckstylehadoop-yarn-common.txt
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8356/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html
 |
| hadoop-mapreduce-client-app test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8356/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8356/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8356/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8356/console |


This message was automatically generated.

 YARN shoud allocate container that is closest to the data
 -

 Key: YARN-3856
 URL: https://issues.apache.org/jira/browse/YARN-3856
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Hadoop cluster with multi-level network hierarchy
Reporter: jaehoon ko
 Attachments: YARN-3856.001.patch


 Currently, given a Container request for a host, ResourceManager allocates a 
 Container with following priorities (RMContainerAllocator.java):
  - Requested host
  - a host in the same rack as the requested host
  - any host
 This can lead to a sub-optimal allocation if Hadoop cluster is deployed on 
 multi-level networked hosts (which is typical). For example, let's suppose a 
 network architecture with one core switches, two aggregate switches, four ToR 
 switches, and 8 hosts. Each switch has two downlinks. Rack IDs of hosts are 
 as follows:
 h1, h2: /c/a1/t1
 h3, h4: /c/a1/t2
 h5, h6: /c/a2/t3
 h7, h8: /c/a2/t4
 To allocate a container for data in h1, Hadoop first tries h1 itself, then 
 h2, then any of h3 ~ h8. Clearly, h3 or h4 are better than h5~h8 in terms of 
 network distance and bandwidth. However, current implementation choose one 
 from h3~h8 with equal probabilities.
 This limitation is more obvious when considering hadoop clusters deployed on 
 VM or containers. In this case, only the VMs or containers running in the 
 same physical host are considered rack local, and actual rack-local hosts are 
 chosen with same probabilities as far hosts.
 The root cause of this limitation is that RMContainerAllocator.java performs 
 exact matching on rack id to find a rack local host. 

[jira] [Commented] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED

2015-06-26 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602546#comment-14602546
 ] 

zhihai xu commented on YARN-3798:
-

[~ozawa], thanks for the information.
For SessionMovedException, most likely we can workaround it by increasing the 
Session Timeout. For example if we increase the session timeout from 10 seconds 
to 30 seconds, the timeout for connection will be increased to 10 seconds from 
3.3 seconds, which is calculated by {{connectTimeout = negotiatedSessionTimeout 
/ hostProvider.size();}}. The above SessionMovedException can't happen because 
the Leader processed the request from client after 5 seconds which is less than 
10 seconds time out.
One question: For SessionExpiredException, we will close and open new session 
without fencing, Why the issue Bibin reported won't come up for 
SessionExpiredException?

 ZKRMStateStore shouldn't create new session without occurrance of 
 SESSIONEXPIED
 ---

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Blocker
 Attachments: RM.log, YARN-3798-2.7.002.patch, 
 YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.patch


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 

[jira] [Commented] (YARN-3850) NM fails to read files from full disks which can lead to container logs being lost and other issues

2015-06-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603086#comment-14603086
 ] 

Hudson commented on YARN-3850:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8072 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8072/])
YARN-3850. NM fails to read files from full disks which can lead to container 
logs being lost and other issues. Contributed by Varun Saxena (jlowe: rev 
40b256949ad6f6e0dbdd248f2d257b05899f4332)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/TestContainerLogsPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/ContainerLogsUtils.java


 NM fails to read files from full disks which can lead to container logs being 
 lost and other issues
 ---

 Key: YARN-3850
 URL: https://issues.apache.org/jira/browse/YARN-3850
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation, nodemanager
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Blocker
 Fix For: 2.7.1

 Attachments: YARN-3850.01.patch, YARN-3850.02.patch


 *Container logs* can be lost if disk has become full(~90% full).
 When application finishes, we upload logs after aggregation by calling 
 {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns 
 checks the eligible directories on call to 
 {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would 
 return nothing. So none of the container logs are aggregated and uploaded.
 But on application finish, we also call 
 {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the 
 application directory which contains container logs. This is because it calls 
 {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks 
 as well.
 So we are left with neither aggregated logs for the app nor the individual 
 container logs for the app.
 In addition to this, there are 2 more issues :
 # {{ContainerLogsUtil#getContainerLogDirs}} does not consider full disks so 
 NM will fail to serve up logs from full disks from its web interfaces.
 # {{RecoveredContainerLaunch#locatePidFile}} also does not consider full 
 disks so it is possible that on container recovery, PID file is not found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3850) NM fails to read files from full disks which can lead to container logs being lost and other issues

2015-06-26 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603068#comment-14603068
 ] 

Jason Lowe commented on YARN-3850:
--

+1 lgtm.  Committing this.

 NM fails to read files from full disks which can lead to container logs being 
 lost and other issues
 ---

 Key: YARN-3850
 URL: https://issues.apache.org/jira/browse/YARN-3850
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation, nodemanager
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Blocker
 Attachments: YARN-3850.01.patch, YARN-3850.02.patch


 *Container logs* can be lost if disk has become full(~90% full).
 When application finishes, we upload logs after aggregation by calling 
 {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns 
 checks the eligible directories on call to 
 {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would 
 return nothing. So none of the container logs are aggregated and uploaded.
 But on application finish, we also call 
 {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the 
 application directory which contains container logs. This is because it calls 
 {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks 
 as well.
 So we are left with neither aggregated logs for the app nor the individual 
 container logs for the app.
 In addition to this, there are 2 more issues :
 # {{ContainerLogsUtil#getContainerLogDirs}} does not consider full disks so 
 NM will fail to serve up logs from full disks from its web interfaces.
 # {{RecoveredContainerLaunch#locatePidFile}} also does not consider full 
 disks so it is possible that on container recovery, PID file is not found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603064#comment-14603064
 ] 

Hadoop QA commented on YARN-3644:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  18m 30s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 36s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 34s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 44s | The applied patch generated  1 
new checkstyle issues (total was 211, now 211). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 35s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 18s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 25s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   1m 57s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   6m 16s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  53m 37s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12742125/YARN-3644.003.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 8ef07f7 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8357/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8357/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8357/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8357/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8357/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8357/console |


This message was automatically generated.

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
 YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3409) Add constraint node labels

2015-06-26 Thread Lei Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602942#comment-14602942
 ] 

Lei Guo commented on YARN-3409:
---

[~xinxianyin], topology related information could be another type of server 
attribute, if we look at Yarn-3856, the topology could be more complicate than 
rack. Node label may not a great option when we facing thousands node 
environment. 

And for YARN-1042, the complexity is more on relationship mapping among 
containers, and how Yarn to know the way AM use container, especially when we 
talk about affinity. Node label may not help in that area.

 Add constraint node labels
 --

 Key: YARN-3409
 URL: https://issues.apache.org/jira/browse/YARN-3409
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, capacityscheduler, client
Reporter: Wangda Tan
Assignee: Wangda Tan

 Specify only one label for each node (IAW, partition a cluster) is a way to 
 determinate how resources of a special set of nodes could be shared by a 
 group of entities (like teams, departments, etc.). Partitions of a cluster 
 has following characteristics:
 - Cluster divided to several disjoint sub clusters.
 - ACL/priority can apply on partition (Only market team / marke team has 
 priority to use the partition).
 - Percentage of capacities can apply on partition (Market team has 40% 
 minimum capacity and Dev team has 60% of minimum capacity of the partition).
 Constraints are orthogonal to partition, they’re describing attributes of 
 node’s hardware/software just for affinity. Some example of constraints:
 - glibc version
 - JDK version
 - Type of CPU (x86_64/i686)
 - Type of OS (windows, linux, etc.)
 With this, application can be able to ask for resource has (glibc.version = 
 2.20  JDK.version = 8u20  x86_64).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-26 Thread Raju Bairishetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raju Bairishetti updated YARN-3644:
---
Attachment: YARN-3644.003.patch

Fixed test case with the newly added changes in the trunk. Override the   
unRegisterNodeManager(request) method in MyResourceTracker8 class.


 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
 YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-06-26 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603005#comment-14603005
 ] 

Sunil G commented on YARN-3849:
---

Thank you [~leftnoteasy] and [~ka...@cloudera.com]

[~kasha] , we have tested this only in CS. And the issue is looking like in 
DominentResourceCalculator. I will analyze whether this will happen in Fair.

[~leftnoteasy], I have understood your point. I can explain you the scenario 
based on few key code snippets.
Please feel free to point out if any issues in my analysis.

CSQueueUtils#updateUsedCapacity has below code to calculate 
absoluteUsedCapacity.
{code}
absoluteUsedCapacity =
  Resources.divide(rc, totalPartitionResource, usedResource,
  totalPartitionResource); 
{code}

This will result a call to DominentResourceCalculator, 
{code}
  public float divide(Resource clusterResource, 
  Resource numerator, Resource denominator) {
return 
getResourceAsValue(clusterResource, numerator, true) / 
getResourceAsValue(clusterResource, denominator, true);
{code}

In our cluster, the resource allocation is as follows
usedResource   10Gb, 95Cores
totalPartitionResource  100Gb, 100Cores.

Since we use dominence, absoluteUsedCapacity will come close to 1 eventhough 
Memory is used only 10%.


IN ProportionalPreemptionPolicy, we use like below
{code}
float absUsed = qc.getAbsoluteUsedCapacity(partitionToLookAt);
Resource current = Resources.multiply(partitionResource, absUsed);
{code} 

So *current - guaranteed* will give us tobePreempted which will be close to 
50GB, 45Cores. Actually here memory should have been 5Gb.
Now in our cluster, each container is of 1Gb, 10Cores. 
Hence the *cores* will be 0 after 5 container kills, but tobePreempted will 
still have memory.
And as mentioned in above comment, preemption will continue to kill other 
containers too.

 Too much of preemption activity causing continuos killing of containers 
 across queues
 -

 Key: YARN-3849
 URL: https://issues.apache.org/jira/browse/YARN-3849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.7.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Critical

 Two queues are used. Each queue has given a capacity of 0.5. Dominant 
 Resource policy is used.
 1. An app is submitted in QueueA which is consuming full cluster capacity
 2. After submitting an app in QueueB, there are some demand  and invoking 
 preemption in QueueA
 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
 all containers other than AM is getting killed in QueueA
 4. Now the app in QueueB is trying to take over cluster with the current free 
 space. But there are some updated demand from the app in QueueA which lost 
 its containers earlier, and preemption is kicked in QueueB now.
 Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
 apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-06-26 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603373#comment-14603373
 ] 

Wangda Tan commented on YARN-3849:
--

Make sense, please try to run the test with/without the change. And if you have 
time, could you add the test for node partition preemption as well?

Thanks,
Wangda

 Too much of preemption activity causing continuos killing of containers 
 across queues
 -

 Key: YARN-3849
 URL: https://issues.apache.org/jira/browse/YARN-3849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.7.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Critical

 Two queues are used. Each queue has given a capacity of 0.5. Dominant 
 Resource policy is used.
 1. An app is submitted in QueueA which is consuming full cluster capacity
 2. After submitting an app in QueueB, there are some demand  and invoking 
 preemption in QueueA
 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
 all containers other than AM is getting killed in QueueA
 4. Now the app in QueueB is trying to take over cluster with the current free 
 space. But there are some updated demand from the app in QueueA which lost 
 its containers earlier, and preemption is kicked in QueueB now.
 Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
 apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-06-26 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603389#comment-14603389
 ] 

Sunil G commented on YARN-3849:
---

Yes [~leftnoteasy] and [~rohithsharma]. Thank you for the updates.

It seems we cannot give CPU to the tests as of now. We can update that by 
changing buildPolicy.
Meantime once this is handled, I will add case for nodepartition too.

 Too much of preemption activity causing continuos killing of containers 
 across queues
 -

 Key: YARN-3849
 URL: https://issues.apache.org/jira/browse/YARN-3849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.7.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Critical

 Two queues are used. Each queue has given a capacity of 0.5. Dominant 
 Resource policy is used.
 1. An app is submitted in QueueA which is consuming full cluster capacity
 2. After submitting an app in QueueB, there are some demand  and invoking 
 preemption in QueueA
 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
 all containers other than AM is getting killed in QueueA
 4. Now the app in QueueB is trying to take over cluster with the current free 
 space. But there are some updated demand from the app in QueueA which lost 
 its containers earlier, and preemption is kicked in QueueB now.
 Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
 apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-06-26 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603349#comment-14603349
 ] 

Wangda Tan commented on YARN-3849:
--

I understand now, this is a bad issue when DRF enabled.

Thanks for explanation from [~sunilg] and [~rohithsharma]. Let me take a look 
at how to solve this issue.

 Too much of preemption activity causing continuos killing of containers 
 across queues
 -

 Key: YARN-3849
 URL: https://issues.apache.org/jira/browse/YARN-3849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.7.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Critical

 Two queues are used. Each queue has given a capacity of 0.5. Dominant 
 Resource policy is used.
 1. An app is submitted in QueueA which is consuming full cluster capacity
 2. After submitting an app in QueueB, there are some demand  and invoking 
 preemption in QueueA
 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
 all containers other than AM is getting killed in QueueA
 4. Now the app in QueueB is trying to take over cluster with the current free 
 space. But there are some updated demand from the app in QueueA which lost 
 its containers earlier, and preemption is kicked in QueueB now.
 Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
 apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-06-26 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603381#comment-14603381
 ] 

Wangda Tan commented on YARN-3849:
--

Good suggestion [~rohithsharma], but the more urgent issue we need to solve now 
is currently we cannot specify CPU to tests. I think we can file a separated 
ticket for the parameterized test class.

 Too much of preemption activity causing continuos killing of containers 
 across queues
 -

 Key: YARN-3849
 URL: https://issues.apache.org/jira/browse/YARN-3849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.7.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Critical

 Two queues are used. Each queue has given a capacity of 0.5. Dominant 
 Resource policy is used.
 1. An app is submitted in QueueA which is consuming full cluster capacity
 2. After submitting an app in QueueB, there are some demand  and invoking 
 preemption in QueueA
 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
 all containers other than AM is getting killed in QueueA
 4. Now the app in QueueB is trying to take over cluster with the current free 
 space. But there are some updated demand from the app in QueueA which lost 
 its containers earlier, and preemption is kicked in QueueB now.
 Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
 apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-06-26 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603382#comment-14603382
 ] 

Rohith Sharma K S commented on YARN-3849:
-

I mean for TestProportionalPreemptinPolicy.

 Too much of preemption activity causing continuos killing of containers 
 across queues
 -

 Key: YARN-3849
 URL: https://issues.apache.org/jira/browse/YARN-3849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.7.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Critical

 Two queues are used. Each queue has given a capacity of 0.5. Dominant 
 Resource policy is used.
 1. An app is submitted in QueueA which is consuming full cluster capacity
 2. After submitting an app in QueueB, there are some demand  and invoking 
 preemption in QueueA
 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
 all containers other than AM is getting killed in QueueA
 4. Now the app in QueueB is trying to take over cluster with the current free 
 space. But there are some updated demand from the app in QueueA which lost 
 its containers earlier, and preemption is kicked in QueueB now.
 Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
 apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-06-26 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603308#comment-14603308
 ] 

Rohith Sharma K S commented on YARN-3849:
-

The below is the log trace for the issue.

In our cluster, 
there are 3 NodeManager and each with resource {{memory:327680, vCores:35}}. 
Total cluster resource is {{clusterResource: memory:983040, vCores:105}} with 
CapacityScheduler configured queue's with name *default* and *QueueA*.


 # Application app-1 is submitted to queue default and containers are started 
running the applications with 10 containers,each with {{resource: memory:1024, 
vCores:10}}. so total used is {{usedResources=memory:10240, vCores:91}}
{noformat}
default user=spark used=memory:10240, vCores:91 numContainers=10 headroom = 
memory:1024, vCores:10 user-resources=memory:10240, vCores:91
Re-sorting assigned queue: root.default stats: default: capacity=0.5, 
absoluteCapacity=0.5, usedResources=memory:10240, vCores:91, 
usedCapacity=1.733, absoluteUsedCapacity=0.867, numApps=1, 
numContainers=10
{noformat}
*NOTE : Resource allocation is by CPU DOMINANT*
After 10 container running, available NodeManagers memory is
{noformat}
linux-174, available: memory:323584, vCores:4
linux-175, available: memory:324608, vCores:5
linux-223, available: memory:324608, vCores:5
{noformat}
# Application app-2 is submitted to QueueA. ApplicationMaster container started 
running and NodeManager memory is {{available: memory:322560, vCores:3}}
 {noformat}
Assigned container container_1435072598099_0002_01_01 of capacity 
memory:1024, vCores:1 on host linux-174:26009, which has 5 containers, 
memory:5120, vCores:32 used and memory:322560, vCores:3 available after 
allocation | SchedulerNode.java:154
linux-174, available: memory:322560, vCores:3
{noformat}
# the preemption policy does the below calculation
{noformat}
2015-06-23 23:20:51,127 NAME: QueueA CUR: memory:0, vCores:0 PEN: memory:0, 
vCores:0 GAR: memory:491520, vCores:52 NORM: NaN IDEAL_ASSIGNED: memory:0, 
vCores:0 IDEAL_PREEMPT: memory:0, vCores:0 ACTUAL_PREEMPT: memory:0, 
vCores:0 UNTOUCHABLE: memory:0, vCores:0 PREEMPTABLE: memory:0, vCores:0
2015-06-23 23:20:51,128 NAME: default CUR: memory:851968, vCores:91 PEN: 
memory:0, vCores:0 GAR: memory:491520, vCores:52 NORM: 1.0 IDEAL_ASSIGNED: 
memory:851968, vCores:91 IDEAL_PREEMPT: memory:0, vCores:0 ACTUAL_PREEMPT: 
memory:0, vCores:0 UNTOUCHABLE: memory:0, vCores:0 PREEMPTABLE: 
memory:360448, vCores:39
{noformat}
In the above log , observe for the queue default *CUR is memory:851968, 
vCores:91*, but actually *usedResources=memory:10240, vCores:91*. Here, only 
CPU is matching but not MEMORY. The CUR calculation is done below formula
#* CUR=  {{clusterResource: memory:983040, vCores:105}} *  
{{absoluteUsedCapacity(0.8)}} = {{memory:851968, vCores:91}}
#* GAR=  {{clusterResource: memory:983040, vCores:105}} *  
{{absoluteCapacity(0.5)}} = {{ memory:491520, vCores:52}}
#* PREEMPTABLE= GAR - CUR = {{memory:360448, vCores:39}}
# App-2 request for the containers with {{resource: memory:1024, vCores:10}}. 
So, the preemption cycle finds that how much memory toBePreempt
{noformat}
2015-06-23 23:21:03,131 | DEBUG | SchedulingMonitor 
(ProportionalCapacityPreemptionPolicy) | 1435072863131:  NAME: default CUR: 
memory:851968, vCores:91 PEN: memory:0, vCores:0 GAR: memory:491520, 
vCores:52 NORM: NaN IDEAL_ASSIGNED: memory:491520, vCores:52 IDEAL_PREEMPT: 
memory:97043, vCores:10 ACTUAL_PREEMPT: memory:0, vCores:0 UNTOUCHABLE: 
memory:0, vCores:0 PREEMPTABLE: memory:360448, vCores:39
{noformat}
Observe that *IDEAL_PREEMPT: memory:97043, vCores:10*, but app-2 in queue 
QueueA required only 10 CPU resource to be preempt, but memory to be preempt is 
97043 but memory sufficiently available.
Below is the calculations which does IDEAL_PREMPT, 
#* totalPreemptionAllowed = clusterResource: memory:983040, vCores:105 *  0.1 
= memory:98304, vCores:10.5
#* totPreemptionNeeded = CUR - IDEAL_ASSIGNED = CUR: memory:851968, vCores:91
#* scalingFactor = Resources.divide(drc, memory:491520, vCores:52, 
memory:98304, vCores:10.5, memory:851968, vCores:91);
scalingFactor = 0.114285715
#* toBePreempted = CUR: memory:851968, vCores:91 *  
scalingFactor(0.1139045128455529) = memory:97368, vCores:10
{{resource-to-obtain = memory:97043, vCores:10}}

*So the problem is in either of the below steps*
# As [~sunilg] said, usedResources=memory:10240, vCores:91 but preemption 
policy calculate wrongly that current used capacity as {{memory:851968, 
vCores:91}}. This is mainly becaue preemption policy is using absoluteCapacity 
for calculating for Current usage which always gives wrong result for one of 
the resources in DominantResourceAllocator used. I think, fraction should not 
be used which caused problem in DRC(Multi dimentional resources) instead we 
should be usedResource from CSQueue.
# Even bypassing 

[jira] [Updated] (YARN-2005) Blacklisting support for scheduling AMs

2015-06-26 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-2005:

Attachment: YARN-2005.002.patch

Addressed the test failure. Unmanaged AM also executes the AMLaunched 
Transition which was causing the the allocate call that removes AM blacklist. 
Changed it so it does not execute for unmanaged AM.

 Blacklisting support for scheduling AMs
 ---

 Key: YARN-2005
 URL: https://issues.apache.org/jira/browse/YARN-2005
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Anubhav Dhoot
 Attachments: YARN-2005.001.patch, YARN-2005.002.patch


 It would be nice if the RM supported blacklisting a node for an AM launch 
 after the same node fails a configurable number of AM attempts.  This would 
 be similar to the blacklisting support for scheduling task attempts in the 
 MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-06-26 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603357#comment-14603357
 ] 

Wangda Tan commented on YARN-3849:
--

I think the correct fix should be: 

Instead of using absUsed to compute current, we should using 
getQueueResourceUsage.getUsed(...) to get the current used. And add some tests, 
that should be enough.

{code}
  QueueCapacities qc = curQueue.getQueueCapacities();
  float absUsed = qc.getAbsoluteUsedCapacity(partitionToLookAt);
  float absCap = qc.getAbsoluteCapacity(partitionToLookAt);
  float absMaxCap = qc.getAbsoluteMaximumCapacity(partitionToLookAt);
  boolean preemptionDisabled = curQueue.getPreemptionDisabled();

  Resource current = Resources.multiply(partitionResource, absUsed);
  Resource guaranteed = Resources.multiply(partitionResource, absCap);
  Resource maxCapacity = Resources.multiply(partitionResource, absMaxCap);
{code}

[~sunilg], do you want to take a shot about this?

 Too much of preemption activity causing continuos killing of containers 
 across queues
 -

 Key: YARN-3849
 URL: https://issues.apache.org/jira/browse/YARN-3849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.7.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Critical

 Two queues are used. Each queue has given a capacity of 0.5. Dominant 
 Resource policy is used.
 1. An app is submitted in QueueA which is consuming full cluster capacity
 2. After submitting an app in QueueB, there are some demand  and invoking 
 preemption in QueueA
 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
 all containers other than AM is getting killed in QueueA
 4. Now the app in QueueB is trying to take over cluster with the current free 
 space. But there are some updated demand from the app in QueueA which lost 
 its containers earlier, and preemption is kicked in QueueB now.
 Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
 apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3409) Add constraint node labels

2015-06-26 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603279#comment-14603279
 ] 

Wangda Tan commented on YARN-3409:
--

Thanks for comments [~xinxianyin], [~grey].

Actually you're not the first person want to make node label as a uniform 
solution to the problem of locality / affinity / blacklist, etc. [~curino], 
[~vinodkv] all suggested about this. Personally I think this is a good 
direction, otherwise we will have separated implementation / API for all of 
them, which is not clear enough. We're also looking at possibilities to put 
them together to the design doc, hopefully will not take too much time.



 Add constraint node labels
 --

 Key: YARN-3409
 URL: https://issues.apache.org/jira/browse/YARN-3409
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, capacityscheduler, client
Reporter: Wangda Tan
Assignee: Wangda Tan

 Specify only one label for each node (IAW, partition a cluster) is a way to 
 determinate how resources of a special set of nodes could be shared by a 
 group of entities (like teams, departments, etc.). Partitions of a cluster 
 has following characteristics:
 - Cluster divided to several disjoint sub clusters.
 - ACL/priority can apply on partition (Only market team / marke team has 
 priority to use the partition).
 - Percentage of capacities can apply on partition (Market team has 40% 
 minimum capacity and Dev team has 60% of minimum capacity of the partition).
 Constraints are orthogonal to partition, they’re describing attributes of 
 node’s hardware/software just for affinity. Some example of constraints:
 - glibc version
 - JDK version
 - Type of CPU (x86_64/i686)
 - Type of OS (windows, linux, etc.)
 With this, application can be able to ask for resource has (glibc.version = 
 2.20  JDK.version = 8u20  x86_64).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-06-26 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603362#comment-14603362
 ] 

Sunil G commented on YARN-3849:
---

Thank you [~leftnoteasy] for the pointer.

Yes. It looks to me like the root cause of the issue is the usage of 
absoluteCapacity fraction in proportionalpreemption policy.
And  we could try directly use the real usage there as you mentioned.

I will add some tests and give one. :)


 Too much of preemption activity causing continuos killing of containers 
 across queues
 -

 Key: YARN-3849
 URL: https://issues.apache.org/jira/browse/YARN-3849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.7.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Critical

 Two queues are used. Each queue has given a capacity of 0.5. Dominant 
 Resource policy is used.
 1. An app is submitted in QueueA which is consuming full cluster capacity
 2. After submitting an app in QueueB, there are some demand  and invoking 
 preemption in QueueA
 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
 all containers other than AM is getting killed in QueueA
 4. Now the app in QueueB is trying to take over cluster with the current free 
 space. But there are some updated demand from the app in QueueA which lost 
 its containers earlier, and preemption is kicked in QueueB now.
 Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
 apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-26 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603363#comment-14603363
 ] 

Raju Bairishetti commented on YARN-3644:


Seems checkstyle error was not introduced as part of this patch. File had 
already more than 2000 lines :) .
*Check style error:*  YarnConfiguration.java:1: File length is 2,036 lines (max 
allowed is 2,000).

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
 YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-06-26 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603375#comment-14603375
 ] 

Rohith Sharma K S commented on YARN-3849:
-

For the test,how it would be using parameterized test class which uses 
defaultRC and dominatRC ?

 Too much of preemption activity causing continuos killing of containers 
 across queues
 -

 Key: YARN-3849
 URL: https://issues.apache.org/jira/browse/YARN-3849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.7.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Critical

 Two queues are used. Each queue has given a capacity of 0.5. Dominant 
 Resource policy is used.
 1. An app is submitted in QueueA which is consuming full cluster capacity
 2. After submitting an app in QueueB, there are some demand  and invoking 
 preemption in QueueA
 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
 all containers other than AM is getting killed in QueueA
 4. Now the app in QueueB is trying to take over cluster with the current free 
 space. But there are some updated demand from the app in QueueA which lost 
 its containers earlier, and preemption is kicked in QueueB now.
 Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
 apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-06-26 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603396#comment-14603396
 ] 

Wangda Tan commented on YARN-3849:
--

Make sense [~sunilg].

 Too much of preemption activity causing continuos killing of containers 
 across queues
 -

 Key: YARN-3849
 URL: https://issues.apache.org/jira/browse/YARN-3849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.7.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Critical

 Two queues are used. Each queue has given a capacity of 0.5. Dominant 
 Resource policy is used.
 1. An app is submitted in QueueA which is consuming full cluster capacity
 2. After submitting an app in QueueB, there are some demand  and invoking 
 preemption in QueueA
 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
 all containers other than AM is getting killed in QueueA
 4. Now the app in QueueB is trying to take over cluster with the current free 
 space. But there are some updated demand from the app in QueueA which lost 
 its containers earlier, and preemption is kicked in QueueB now.
 Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
 apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode

2015-06-26 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603303#comment-14603303
 ] 

zhihai xu commented on YARN-3857:
-

Hi [~mujunchao], thanks for reporting and working on this issue.
It is a nice catch. I see why this is a critical issue. For non-secure cluster, 
the more completed jobs, the more entries with null value will be left in 
{{ClientToAMTokenSecretManagerInRM#masterKeys}}. You patch makes sense to me, 
since we only call {{unRegisterApplication}} in secure mode, we should also 
call {{registerApplication}} in secure mode to match {{unRegisterApplication}}.
Could you add a test case in your patch? You can do something similar as 
{{TestRMAppAttemptTransitions#testGetClientToken}} for non-secure mode.

 Memory leak in ResourceManager with SIMPLE mode
 ---

 Key: YARN-3857
 URL: https://issues.apache.org/jira/browse/YARN-3857
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: mujunchao
Priority: Critical
 Attachments: hadoop-yarn-server-resourcemanager.patch


  We register the ClientTokenMasterKey to avoid client may hold an invalid 
 ClientToken after RM restarts. In SIMPLE mode, we register 
 PairApplicationAttemptId, null ,  But we never remove it from HashMap, as 
 unregister only runing while in Security mode, so memory leak coming. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk

2015-06-26 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603309#comment-14603309
 ] 

zhihai xu commented on YARN-2871:
-

Hi [~xgong], the latest patch passed Jenkins test, Could you review it? thanks

 TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
 -

 Key: YARN-2871
 URL: https://issues.apache.org/jira/browse/YARN-2871
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: zhihai xu
Priority: Minor
 Attachments: YARN-2871.000.patch, YARN-2871.001.patch


 From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746):
 {code}
 Failed tests:
   TestRMRestart.testRMRestartGetApplicationList:957
 rMAppManager.logApplicationSummary(
 isA(org.apache.hadoop.yarn.api.records.ApplicationId)
 );
 Wanted 3 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957)
 But was 2 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3695) ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.

2015-06-26 Thread Raju Bairishetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raju Bairishetti updated YARN-3695:
---
Attachment: YARN-3695.patch

 ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.
 --

 Key: YARN-3695
 URL: https://issues.apache.org/jira/browse/YARN-3695
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du
Assignee: Raju Bairishetti
 Attachments: YARN-3695.patch


 YARN-3646 fix the retry forever policy in RMProxy that it only applies on 
 limited exceptions rather than all exceptions. Here, we may need the same fix 
 for ServerProxy (NMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3858) Distributed shell app master becomes unresponsive sometimes even when there hasn't been an AM restart

2015-06-26 Thread Alok Lal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603614#comment-14603614
 ] 

Alok Lal commented on YARN-3858:


As can be seen from the log the distributed shell app did not finish even 
thought all containers had finished successfully.

 Distributed shell app master becomes unresponsive sometimes even when there 
 hasn't been an AM restart
 -

 Key: YARN-3858
 URL: https://issues.apache.org/jira/browse/YARN-3858
 Project: Hadoop YARN
  Issue Type: Bug
 Environment: secure CentOS 6
Reporter: Alok Lal
 Attachments: yarn-yarn-resourcemanager-c7-jun24-10.log


 Attached is the resource manager log.  This was on a 10-node cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3858) Distributed shell app master becomes unresponsive sometimes even when there hasn't been an AM restart

2015-06-26 Thread Alok Lal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alok Lal updated YARN-3858:
---
Attachment: yarn-yarn-resourcemanager-c7-jun24-10.log

 Distributed shell app master becomes unresponsive sometimes even when there 
 hasn't been an AM restart
 -

 Key: YARN-3858
 URL: https://issues.apache.org/jira/browse/YARN-3858
 Project: Hadoop YARN
  Issue Type: Bug
 Environment: secure CentOS 6
Reporter: Alok Lal
 Attachments: yarn-yarn-resourcemanager-c7-jun24-10.log


 Attached is the resource manager log.  This was on a 10-node cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3858) Distributed shell app master becomes unresponsive sometimes even when there hasn't been an AM restart

2015-06-26 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-3858:

Assignee: Varun Vasudev

 Distributed shell app master becomes unresponsive sometimes even when there 
 hasn't been an AM restart
 -

 Key: YARN-3858
 URL: https://issues.apache.org/jira/browse/YARN-3858
 Project: Hadoop YARN
  Issue Type: Bug
 Environment: secure CentOS 6
Reporter: Alok Lal
Assignee: Varun Vasudev
 Attachments: yarn-yarn-resourcemanager-c7-jun24-10.log


 Attached is the resource manager log.  This was on a 10-node cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs

2015-06-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603541#comment-14603541
 ] 

Hadoop QA commented on YARN-2005:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  17m 20s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 3 new or modified test files. |
| {color:green}+1{color} | javac |   7m 39s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 38s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 30s | The applied patch generated  1 
new checkstyle issues (total was 211, now 211). |
| {color:green}+1{color} | whitespace |   0m  5s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 35s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   3m 49s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | tools/hadoop tests |   0m 52s | Tests passed in 
hadoop-sls. |
| {color:green}+1{color} | yarn tests |   0m 22s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |  51m  4s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  95m  8s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12742183/YARN-2005.002.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 60b858b |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8358/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| hadoop-sls test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8358/artifact/patchprocess/testrun_hadoop-sls.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8358/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8358/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8358/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8358/console |


This message was automatically generated.

 Blacklisting support for scheduling AMs
 ---

 Key: YARN-2005
 URL: https://issues.apache.org/jira/browse/YARN-2005
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Anubhav Dhoot
 Attachments: YARN-2005.001.patch, YARN-2005.002.patch


 It would be nice if the RM supported blacklisting a node for an AM launch 
 after the same node fails a configurable number of AM attempts.  This would 
 be similar to the blacklisting support for scheduling task attempts in the 
 MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1965) Interrupted exception when closing YarnClient

2015-06-26 Thread Mit Desai (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603103#comment-14603103
 ] 

Mit Desai commented on YARN-1965:
-

Overall patch Looks good. Few minor nits.
* There should be a space between () and { here
{{public static final ExecutorService getClientExecutor(){}}
* In testStandAloneClient(), we need spaces near the brackets. Change 
{{}finally{}} to {{} finally {}}
* In testConnectionIdleTimeouts(), we need space near the brackets. Change 
{{}finally{}} to {{} finally {}}
* testInterrupted needs to be indented.
* In doErrorTest, testRTEDuringConnectionSetup stopping the client before the 
server makes more sense. Swap the stop calls in finally block
* In testSocketFactoryException,testIpcConnectTimeout {{client.stop()}} should 
be within finally block
* Is there a need to move {{Client client = new Client(LongWritable.class, 
conf, spyFactory);}} in testRTEDuringConnectionSetup?

 Interrupted exception when closing YarnClient
 -

 Key: YARN-1965
 URL: https://issues.apache.org/jira/browse/YARN-1965
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api
Affects Versions: 2.3.0
Reporter: Oleg Zhurakousky
Assignee: Kuhu Shukla
Priority: Minor
  Labels: newbie
 Attachments: YARN-1965-v2.patch, YARN-1965.patch


 Its more of a nuisance then a bug, but nevertheless 
 {code}
 16:16:48,709 ERROR pool-1-thread-1 ipc.Client:195 - Interrupted while waiting 
 for clientExecutorto stop
 java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2072)
   at 
 java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1468)
   at 
 org.apache.hadoop.ipc.Client$ClientExecutorServiceFactory.unrefAndCleanup(Client.java:191)
   at org.apache.hadoop.ipc.Client.stop(Client.java:1235)
   at org.apache.hadoop.ipc.ClientCache.stopClient(ClientCache.java:100)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.close(ProtobufRpcEngine.java:251)
   at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.close(ApplicationClientProtocolPBClientImpl.java:112)
   at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:621)
   at 
 org.apache.hadoop.io.retry.DefaultFailoverProxyProvider.close(DefaultFailoverProxyProvider.java:57)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.close(RetryInvocationHandler.java:206)
   at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStop(YarnClientImpl.java:124)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 . . .
 {code}
 It happens sporadically when stopping YarnClient. 
 Looking at the code in Client's 'unrefAndCleanup' its not immediately obvious 
 why and who throws the interrupt but in any event it should not be logged as 
 ERROR. Probably a WARN with no stack trace.
 Also, for consistency and correctness you may want to Interrupt current 
 thread as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3695) ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.

2015-06-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603555#comment-14603555
 ] 

Hadoop QA commented on YARN-3695:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  20m 33s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   8m 32s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 51s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 14s | The applied patch generated  1 
new checkstyle issues (total was 3, now 4). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 40s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 47s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   1m 58s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   6m 18s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  54m 11s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12742186/YARN-3695.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / aa07dea |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8359/artifact/patchprocess/diffcheckstylehadoop-yarn-common.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8359/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8359/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8359/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8359/console |


This message was automatically generated.

 ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.
 --

 Key: YARN-3695
 URL: https://issues.apache.org/jira/browse/YARN-3695
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du
Assignee: Raju Bairishetti
 Attachments: YARN-3695.patch


 YARN-3646 fix the retry forever policy in RMProxy that it only applies on 
 limited exceptions rather than all exceptions. Here, we may need the same fix 
 for ServerProxy (NMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler

2015-06-26 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603603#comment-14603603
 ] 

Eric Payne commented on YARN-2004:
--

Thanks, [~sunilg], for this fix.

- {{SchedulerApplicationAttempt.java}}:
{code}
  if (!getApplicationPriority().equals(
  ((SchedulerApplicationAttempt) other).getApplicationPriority())) {
return getApplicationPriority().compareTo(
((SchedulerApplicationAttempt) other).getApplicationPriority());
  }
{code}
-- Can {{getApplicationPriority}} return null? I see that 
{{SchedulerApplicationAttempt}} initializes {{appPriority}} to null.

- {{CapacityScheduler.java}}:
{code}
  if (!a1.getApplicationPriority().equals(a2.getApplicationPriority())) {
return a1.getApplicationPriority().compareTo(
a2.getApplicationPriority());
  }
{code}
-- Same question about {{getApplicationPriority}} returning null.
-- Also, can {{updateApplicationPriority}} call 
{{authenticateApplicationPriority}}? Seems like duplicate code to me.


 Priority scheduling support in Capacity scheduler
 -

 Key: YARN-2004
 URL: https://issues.apache.org/jira/browse/YARN-2004
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Sunil G
Assignee: Sunil G
 Attachments: 0001-YARN-2004.patch, 0002-YARN-2004.patch, 
 0003-YARN-2004.patch, 0004-YARN-2004.patch, 0005-YARN-2004.patch, 
 0006-YARN-2004.patch, 0007-YARN-2004.patch


 Based on the priority of the application, Capacity Scheduler should be able 
 to give preference to application while doing scheduling.
 ComparatorFiCaSchedulerApp applicationComparator can be changed as below.   
 
 1.Check for Application priority. If priority is available, then return 
 the highest priority job.
 2.Otherwise continue with existing logic such as App ID comparison and 
 then TimeStamp comparison.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3858) Distributed shell app master becomes unresponsive sometimes even when there hasn't been an AM restart

2015-06-26 Thread Alok Lal (JIRA)
Alok Lal created YARN-3858:
--

 Summary: Distributed shell app master becomes unresponsive 
sometimes even when there hasn't been an AM restart
 Key: YARN-3858
 URL: https://issues.apache.org/jira/browse/YARN-3858
 Project: Hadoop YARN
  Issue Type: Bug
 Environment: secure CentOS 6
Reporter: Alok Lal


Attached is the resource manager log.  This was on a 10-node cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]

2015-06-26 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603425#comment-14603425
 ] 

Wangda Tan commented on YARN-2003:
--

Hi [~sunilg],
Thanks for updating, some comments:

1) API for YARNScheduler:
- updateApplicationPriority, we don't need pass user/queueName, since scheduler 
should know it. authenticate is different, since scheduler may not have 
application information at that time.
- It may be better to throw YarnException instead of IOException.

2) RMAppManager:
- Is this check necessary? {{rmContext.getScheduler() != null}}, if this is for 
test cases, I think it's better to fix tests.

 Support to process Job priority from Submission Context in 
 AppAttemptAddedSchedulerEvent [RM side]
 --

 Key: YARN-2003
 URL: https://issues.apache.org/jira/browse/YARN-2003
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Sunil G
Assignee: Sunil G
  Labels: BB2015-05-TBR
 Attachments: 0001-YARN-2003.patch, 00010-YARN-2003.patch, 
 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, 
 0005-YARN-2003.patch, 0006-YARN-2003.patch, 0007-YARN-2003.patch, 
 0008-YARN-2003.patch, 0009-YARN-2003.patch, 0011-YARN-2003.patch, 
 0012-YARN-2003.patch


 AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from 
 Submission Context and store.
 Later this can be used by Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk

2015-06-26 Thread Masatake Iwasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603484#comment-14603484
 ] 

Masatake Iwasaki commented on YARN-2871:


Thanks for the investigation, [~zxu]. I found that 
[org.mockito.Mockito.timeout|http://docs.mockito.googlecode.com/hg/1.8.5/org/mockito/Mockito.html#timeout(int)]
 is used in some other tests using Mockito. It could be used here too.

 TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
 -

 Key: YARN-2871
 URL: https://issues.apache.org/jira/browse/YARN-2871
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: zhihai xu
Priority: Minor
 Attachments: YARN-2871.000.patch, YARN-2871.001.patch


 From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746):
 {code}
 Failed tests:
   TestRMRestart.testRMRestartGetApplicationList:957
 rMAppManager.logApplicationSummary(
 isA(org.apache.hadoop.yarn.api.records.ApplicationId)
 );
 Wanted 3 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957)
 But was 2 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler

2015-06-26 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603516#comment-14603516
 ] 

Wangda Tan commented on YARN-2004:
--

Thanks for updating, [~sunilg].

A quick comment before posting others, I think most of the code to check/update 
application priority can be reused by other schedulers. [~kasha], could you 
take a quick look at this patch to see if it is also needed for Fair Scheduler?

 Priority scheduling support in Capacity scheduler
 -

 Key: YARN-2004
 URL: https://issues.apache.org/jira/browse/YARN-2004
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Sunil G
Assignee: Sunil G
 Attachments: 0001-YARN-2004.patch, 0002-YARN-2004.patch, 
 0003-YARN-2004.patch, 0004-YARN-2004.patch, 0005-YARN-2004.patch, 
 0006-YARN-2004.patch, 0007-YARN-2004.patch


 Based on the priority of the application, Capacity Scheduler should be able 
 to give preference to application while doing scheduling.
 ComparatorFiCaSchedulerApp applicationComparator can be changed as below.   
 
 1.Check for Application priority. If priority is available, then return 
 the highest priority job.
 2.Otherwise continue with existing logic such as App ID comparison and 
 then TimeStamp comparison.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk

2015-06-26 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2871:

Attachment: YARN-2871.002.patch

 TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
 -

 Key: YARN-2871
 URL: https://issues.apache.org/jira/browse/YARN-2871
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: zhihai xu
Priority: Minor
 Attachments: YARN-2871.000.patch, YARN-2871.001.patch, 
 YARN-2871.002.patch


 From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746):
 {code}
 Failed tests:
   TestRMRestart.testRMRestartGetApplicationList:957
 rMAppManager.logApplicationSummary(
 isA(org.apache.hadoop.yarn.api.records.ApplicationId)
 );
 Wanted 3 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957)
 But was 2 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk

2015-06-26 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603538#comment-14603538
 ] 

zhihai xu commented on YARN-2871:
-

[~iwasakims], thanks for the suggestion, it should work. I uploaded a new patch 
YARN-2871.002.patch based on your suggestion. Please review it.

 TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
 -

 Key: YARN-2871
 URL: https://issues.apache.org/jira/browse/YARN-2871
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: zhihai xu
Priority: Minor
 Attachments: YARN-2871.000.patch, YARN-2871.001.patch, 
 YARN-2871.002.patch


 From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746):
 {code}
 Failed tests:
   TestRMRestart.testRMRestartGetApplicationList:957
 rMAppManager.logApplicationSummary(
 isA(org.apache.hadoop.yarn.api.records.ApplicationId)
 );
 Wanted 3 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957)
 But was 2 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS

2015-06-26 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602627#comment-14602627
 ] 

Naganarasimha G R commented on YARN-3045:
-

thanks for reviewing [~djp]  [~sjlee0],  sorry for the late response as i was 
lil held up.
thanks for confirming consolidation  [~djp], will try to get that done by next 
patch.
bq. if need separated event queue later to make sure container metrics boom
Already i have created a async dispatcher for timeline publishing if req we can 
create another dispatcher for container metrics only. this is what you meant?

bq. For corner case that NM publisher delay too long time (queue is busy) to 
publish event, it still get chance to fail (very low chance should be 
acceptable here).
Ok, will leave The lifecycle management of app collector out of this jira. may 
be we can handle them (including multiple attempt as specified [~sangjin) in 
another jira.

bq. APPLICATION_CREATED_EVENT might be seeing the race condition
Yes there seems to be another race condition but this time not with src and the 
test but within the src.
{quote}
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:276)
{quote}
Had seen this only once earlier but was not able to get the logs now i can 
analyze further on this.

bq. I'm a bit puzzled by the hashCode override; is it necessary? 
My mistake i think its resudual code of initial version, which may be i have 
added while trying out MultiAsync dispatcher and events of one app needs to go 
to one handler.
but not required any more, will remove it.

Will take care of other [~sjlee0] comments and will try to provide the patch at 
the earliest

 [Event producers] Implement NM writing container lifecycle events to ATS
 

 Key: YARN-3045
 URL: https://issues.apache.org/jira/browse/YARN-3045
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Naganarasimha G R
 Attachments: YARN-3045-YARN-2928.002.patch, 
 YARN-3045-YARN-2928.003.patch, YARN-3045-YARN-2928.004.patch, 
 YARN-3045.20150420-1.patch


 Per design in YARN-2928, implement NM writing container lifecycle events and 
 container system metrics to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode

2015-06-26 Thread mujunchao (JIRA)
mujunchao created YARN-3857:
---

 Summary: Memory leak in ResourceManager with SIMPLE mode
 Key: YARN-3857
 URL: https://issues.apache.org/jira/browse/YARN-3857
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: mujunchao
Priority: Critical


 We register the ClientTokenMasterKey to avoid client may hold an invalid 
ClientToken after RM restarts. In SIMPLE mode, we register 
PairApplicationAttemptId, null ,  But we never remove it from HashMap, as 
unregister only runing while in Security mode, so memory leak coming. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode

2015-06-26 Thread mujunchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mujunchao updated YARN-3857:

Remaining Estimate: (was: 24h)
 Original Estimate: (was: 24h)

 Memory leak in ResourceManager with SIMPLE mode
 ---

 Key: YARN-3857
 URL: https://issues.apache.org/jira/browse/YARN-3857
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: mujunchao
Priority: Critical

  We register the ClientTokenMasterKey to avoid client may hold an invalid 
 ClientToken after RM restarts. In SIMPLE mode, we register 
 PairApplicationAttemptId, null ,  But we never remove it from HashMap, as 
 unregister only runing while in Security mode, so memory leak coming. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3409) Add constraint node labels

2015-06-26 Thread Xianyin Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603869#comment-14603869
 ] 

Xianyin Xin commented on YARN-3409:
---

Thanks for comments, [~grey]. IMO, topology may be hard to deal with node 
labels, as node labels describe the attributes of a node and topology is an 
attribute of the whole cluster. You remind me that YARN-1042 may not be as 
simple as i think.

And expecting your design doc, [~leftnoteasy].

 Add constraint node labels
 --

 Key: YARN-3409
 URL: https://issues.apache.org/jira/browse/YARN-3409
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, capacityscheduler, client
Reporter: Wangda Tan
Assignee: Wangda Tan

 Specify only one label for each node (IAW, partition a cluster) is a way to 
 determinate how resources of a special set of nodes could be shared by a 
 group of entities (like teams, departments, etc.). Partitions of a cluster 
 has following characteristics:
 - Cluster divided to several disjoint sub clusters.
 - ACL/priority can apply on partition (Only market team / marke team has 
 priority to use the partition).
 - Percentage of capacities can apply on partition (Market team has 40% 
 minimum capacity and Dev team has 60% of minimum capacity of the partition).
 Constraints are orthogonal to partition, they’re describing attributes of 
 node’s hardware/software just for affinity. Some example of constraints:
 - glibc version
 - JDK version
 - Type of CPU (x86_64/i686)
 - Type of OS (windows, linux, etc.)
 With this, application can be able to ask for resource has (glibc.version = 
 2.20  JDK.version = 8u20  x86_64).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1449) Protocol changes in NM side to support change container resource

2015-06-26 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603802#comment-14603802
 ] 

MENG DING commented on YARN-1449:
-

The patch is way too big for review. I will split it into several JIRAs. 

 Protocol changes in NM side to support change container resource
 

 Key: YARN-1449
 URL: https://issues.apache.org/jira/browse/YARN-1449
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Wangda Tan (No longer used)
Assignee: MENG DING
 Attachments: YARN-1449.1.patch, YARN-1449.2.patch, yarn-1449.1.patch, 
 yarn-1449.3.patch, yarn-1449.4.patch, yarn-1449.5.patch


 As described in YARN-1197, we need add API/implementation changes,
 1) Add a changeContainersResources method in ContainerManagementProtocol
 2) Can get succeed/failed increased/decreased containers in response of 
 changeContainersResources
 3) Add a new decreased containers field in NodeStatus which can help NM 
 notify RM such changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk

2015-06-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603679#comment-14603679
 ] 

Hadoop QA commented on YARN-2871:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |   6m 43s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 48s | There were no new javac warning 
messages. |
| {color:green}+1{color} | release audit |   0m 18s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 46s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 24s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  51m  2s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  70m 10s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12742218/YARN-2871.002.patch |
| Optional Tests | javac unit findbugs checkstyle |
| git revision | trunk / aa07dea |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8360/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8360/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8360/console |


This message was automatically generated.

 TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
 -

 Key: YARN-2871
 URL: https://issues.apache.org/jira/browse/YARN-2871
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: zhihai xu
Priority: Minor
 Attachments: YARN-2871.000.patch, YARN-2871.001.patch, 
 YARN-2871.002.patch


 From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746):
 {code}
 Failed tests:
   TestRMRestart.testRMRestartGetApplicationList:957
 rMAppManager.logApplicationSummary(
 isA(org.apache.hadoop.yarn.api.records.ApplicationId)
 );
 Wanted 3 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957)
 But was 2 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3695) ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.

2015-06-26 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603710#comment-14603710
 ] 

Jian He commented on YARN-3695:
---

looks good, +1

 ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.
 --

 Key: YARN-3695
 URL: https://issues.apache.org/jira/browse/YARN-3695
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du
Assignee: Raju Bairishetti
 Attachments: YARN-3695.patch


 YARN-3646 fix the retry forever policy in RMProxy that it only applies on 
 limited exceptions rather than all exceptions. Here, we may need the same fix 
 for ServerProxy (NMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1449) Protocol changes in NM side to support change container resource

2015-06-26 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1449:

Attachment: YARN-1449.2.patch

Attach updated patch for review which includes:

* All protocol changes (AM-RM, AM-NM, NM-RM) as described in the design doc 
(see YARN-1197). 
* ContainerManager logic
* NodeStatusUpdater logic
* NodeManager recovery logic
* New and updated unit test cases

The ContainersMonitor logic is covered in YARN-1643, and a patch will be posted 
for review early next week.

 Protocol changes in NM side to support change container resource
 

 Key: YARN-1449
 URL: https://issues.apache.org/jira/browse/YARN-1449
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Wangda Tan (No longer used)
Assignee: MENG DING
 Attachments: YARN-1449.1.patch, YARN-1449.2.patch, yarn-1449.1.patch, 
 yarn-1449.3.patch, yarn-1449.4.patch, yarn-1449.5.patch


 As described in YARN-1197, we need add API/implementation changes,
 1) Add a changeContainersResources method in ContainerManagementProtocol
 2) Can get succeed/failed increased/decreased containers in response of 
 changeContainersResources
 3) Add a new decreased containers field in NodeStatus which can help NM 
 notify RM such changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs

2015-06-26 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603767#comment-14603767
 ] 

Anubhav Dhoot commented on YARN-2005:
-

The checkstyle error is unavoidable (preexisting).
[~jlowe][~sunilg] this is as per the discussion here and is ready for your 
review. [~jianhe][~kasha] appreciate your review as well

 Blacklisting support for scheduling AMs
 ---

 Key: YARN-2005
 URL: https://issues.apache.org/jira/browse/YARN-2005
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Anubhav Dhoot
 Attachments: YARN-2005.001.patch, YARN-2005.002.patch


 It would be nice if the RM supported blacklisting a node for an AM launch 
 after the same node fails a configurable number of AM attempts.  This would 
 be similar to the blacklisting support for scheduling task attempts in the 
 MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3705) forcemanual transitionToStandby in RM-HA automatic-failover mode should change elector state

2015-06-26 Thread Masatake Iwasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated YARN-3705:
---
Attachment: YARN-3705.006.patch

bq. If we call resetLeaderElection inside the rmadmin.transitionToStandby(), it 
will cause a infinite loop.

You are right. I need to make sure that resetLeaderElection is not called when 
EmbeddedElectorService#becomeStandby calls transitionToStandy. Thanks for the 
good catch, [~xgong].

I attached 006. Though I checked that the loop is not caused by starting RM-HA 
manually with patched jar, it is difficult to test that in unit test.

 forcemanual transitionToStandby in RM-HA automatic-failover mode should 
 change elector state
 

 Key: YARN-3705
 URL: https://issues.apache.org/jira/browse/YARN-3705
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
 Attachments: YARN-3705.001.patch, YARN-3705.002.patch, 
 YARN-3705.003.patch, YARN-3705.004.patch, YARN-3705.005.patch, 
 YARN-3705.006.patch


 Executing {{rmadmin -transitionToStandby --forcemanual}} in 
 automatic-failover.enabled mode makes ResouceManager standby while keeping 
 the state of ActiveStandbyElector. It should make elector to quit and rejoin 
 in order to enable other candidates to promote, otherwise forcemanual 
 transition should not be allowed in automatic-failover mode in order to avoid 
 confusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3850) NM fails to read files from full disks which can lead to container logs being lost and other issues

2015-06-26 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603728#comment-14603728
 ] 

Varun Saxena commented on YARN-3850:


Thanks for the review and commit [~jlowe]

 NM fails to read files from full disks which can lead to container logs being 
 lost and other issues
 ---

 Key: YARN-3850
 URL: https://issues.apache.org/jira/browse/YARN-3850
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation, nodemanager
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Blocker
 Fix For: 2.7.1

 Attachments: YARN-3850.01.patch, YARN-3850.02.patch


 *Container logs* can be lost if disk has become full(~90% full).
 When application finishes, we upload logs after aggregation by calling 
 {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns 
 checks the eligible directories on call to 
 {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would 
 return nothing. So none of the container logs are aggregated and uploaded.
 But on application finish, we also call 
 {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the 
 application directory which contains container logs. This is because it calls 
 {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks 
 as well.
 So we are left with neither aggregated logs for the app nor the individual 
 container logs for the app.
 In addition to this, there are 2 more issues :
 # {{ContainerLogsUtil#getContainerLogDirs}} does not consider full disks so 
 NM will fail to serve up logs from full disks from its web interfaces.
 # {{RecoveredContainerLaunch#locatePidFile}} also does not consider full 
 disks so it is possible that on container recovery, PID file is not found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk

2015-06-26 Thread Masatake Iwasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603727#comment-14603727
 ] 

Masatake Iwasaki commented on YARN-2871:


I'm +1(non-binding) on this.

 TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
 -

 Key: YARN-2871
 URL: https://issues.apache.org/jira/browse/YARN-2871
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: zhihai xu
Priority: Minor
 Attachments: YARN-2871.000.patch, YARN-2871.001.patch, 
 YARN-2871.002.patch


 From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746):
 {code}
 Failed tests:
   TestRMRestart.testRMRestartGetApplicationList:957
 rMAppManager.logApplicationSummary(
 isA(org.apache.hadoop.yarn.api.records.ApplicationId)
 );
 Wanted 3 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957)
 But was 2 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk

2015-06-26 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603917#comment-14603917
 ] 

Xuan Gong commented on YARN-2871:
-

+1 LGTM. Checking this in

 TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
 -

 Key: YARN-2871
 URL: https://issues.apache.org/jira/browse/YARN-2871
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: zhihai xu
Priority: Minor
 Attachments: YARN-2871.000.patch, YARN-2871.001.patch, 
 YARN-2871.002.patch


 From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746):
 {code}
 Failed tests:
   TestRMRestart.testRMRestartGetApplicationList:957
 rMAppManager.logApplicationSummary(
 isA(org.apache.hadoop.yarn.api.records.ApplicationId)
 );
 Wanted 3 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957)
 But was 2 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk

2015-06-26 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603919#comment-14603919
 ] 

Xuan Gong commented on YARN-2871:
-

Committed into trunk/branch-2. Thanks, zhihai.

 TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
 -

 Key: YARN-2871
 URL: https://issues.apache.org/jira/browse/YARN-2871
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: zhihai xu
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-2871.000.patch, YARN-2871.001.patch, 
 YARN-2871.002.patch


 From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746):
 {code}
 Failed tests:
   TestRMRestart.testRMRestartGetApplicationList:957
 rMAppManager.logApplicationSummary(
 isA(org.apache.hadoop.yarn.api.records.ApplicationId)
 );
 Wanted 3 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957)
 But was 2 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3695) ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.

2015-06-26 Thread Raju Bairishetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raju Bairishetti updated YARN-3695:
---
Attachment: YARN-3695.01.patch

[~jianhe] Thanks for the review.

Moved the Precondtion Checks before creating RetryPolicy. So that we can avoid 
creating policy if the connection timeout values are invalid.

 ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.
 --

 Key: YARN-3695
 URL: https://issues.apache.org/jira/browse/YARN-3695
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du
Assignee: Raju Bairishetti
 Attachments: YARN-3695.01.patch, YARN-3695.patch


 YARN-3646 fix the retry forever policy in RMProxy that it only applies on 
 limited exceptions rather than all exceptions. Here, we may need the same fix 
 for ServerProxy (NMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk

2015-06-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603945#comment-14603945
 ] 

Hudson commented on YARN-2871:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8076 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8076/])
YARN-2871. TestRMRestart#testRMRestartGetApplicationList sometime fails (xgong: 
rev fe6c1bd73aee188ed58df4d33bbc2d2fe0779a97)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java


 TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
 -

 Key: YARN-2871
 URL: https://issues.apache.org/jira/browse/YARN-2871
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: zhihai xu
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-2871.000.patch, YARN-2871.001.patch, 
 YARN-2871.002.patch


 From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746):
 {code}
 Failed tests:
   TestRMRestart.testRMRestartGetApplicationList:957
 rMAppManager.logApplicationSummary(
 isA(org.apache.hadoop.yarn.api.records.ApplicationId)
 );
 Wanted 3 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957)
 But was 2 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk

2015-06-26 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603957#comment-14603957
 ] 

zhihai xu commented on YARN-2871:
-

thanks [~iwasakims] for the review! thanks [~xgong] for the review and 
committing the patch!

 TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
 -

 Key: YARN-2871
 URL: https://issues.apache.org/jira/browse/YARN-2871
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: zhihai xu
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-2871.000.patch, YARN-2871.001.patch, 
 YARN-2871.002.patch


 From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746):
 {code}
 Failed tests:
   TestRMRestart.testRMRestartGetApplicationList:957
 rMAppManager.logApplicationSummary(
 isA(org.apache.hadoop.yarn.api.records.ApplicationId)
 );
 Wanted 3 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957)
 But was 2 times:
 - at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3695) ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.

2015-06-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603971#comment-14603971
 ] 

Hadoop QA commented on YARN-3695:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  17m 51s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 49s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 56s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 30s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 36s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 46s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   1m 57s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   6m 18s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  50m 44s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12742286/YARN-3695.01.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / fe6c1bd |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8361/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8361/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8361/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8361/console |


This message was automatically generated.

 ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.
 --

 Key: YARN-3695
 URL: https://issues.apache.org/jira/browse/YARN-3695
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du
Assignee: Raju Bairishetti
 Attachments: YARN-3695.01.patch, YARN-3695.patch


 YARN-3646 fix the retry forever policy in RMProxy that it only applies on 
 limited exceptions rather than all exceptions. Here, we may need the same fix 
 for ServerProxy (NMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3837) javadocs of TimelineAuthenticationFilterInitializer give wrong prefix for auth options

2015-06-26 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602822#comment-14602822
 ] 

Bibin A Chundatt commented on YARN-3837:


Whitespace error seems not correct to me. 

 javadocs of TimelineAuthenticationFilterInitializer give wrong prefix for 
 auth options
 --

 Key: YARN-3837
 URL: https://issues.apache.org/jira/browse/YARN-3837
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.8.0
Reporter: Steve Loughran
Assignee: Bibin A Chundatt
Priority: Minor
 Attachments: 0001-YARN-3837.patch, 0002-YARN-3837.patch

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 The javadocs for {{TimelineAuthenticationFilterInitializer}} talk about the 
 prefix {{yarn.timeline-service.authentication.}}, but the code uses {{ 
 yarn.timeline-service.http-authentication.}}  as the prefix.
 best to use {{@value}} and let the javadocs sort it out for themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode

2015-06-26 Thread mujunchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mujunchao updated YARN-3857:

Attachment: hadoop-yarn-server-resourcemanager.patch

never register 
org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.masterKeys
 while SecretKey is null.

 Memory leak in ResourceManager with SIMPLE mode
 ---

 Key: YARN-3857
 URL: https://issues.apache.org/jira/browse/YARN-3857
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: mujunchao
Priority: Critical
 Attachments: hadoop-yarn-server-resourcemanager.patch


  We register the ClientTokenMasterKey to avoid client may hold an invalid 
 ClientToken after RM restarts. In SIMPLE mode, we register 
 PairApplicationAttemptId, null ,  But we never remove it from HashMap, as 
 unregister only runing while in Security mode, so memory leak coming. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages

2015-06-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602747#comment-14602747
 ] 

Hudson commented on YARN-3826:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #240 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/240/])
YARN-3826. Race condition in ResourceTrackerService leads to wrong (devaraj: 
rev 57f1a01eda80f44d3ffcbcb93c4ee290e274946a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/YarnServerBuilderUtils.java


 Race condition in ResourceTrackerService leads to wrong diagnostics messages
 

 Key: YARN-3826
 URL: https://issues.apache.org/jira/browse/YARN-3826
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Fix For: 2.8.0

 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, 
 YARN-3826.03.patch


 Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which 
 can be called concurrently, the static {{resync}} and {{shutdown}} may have 
 wrong diagnostics messages in some cases.
 On the other side, these static members can hardly save any memory, since the 
 normal heartbeat responses are created for each heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3745) SerializedException should also try to instantiate internal exception with the default constructor

2015-06-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602750#comment-14602750
 ] 

Hudson commented on YARN-3745:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #240 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/240/])
YARN-3745. SerializedException should also try to instantiate internal 
(devaraj: rev b381f88c71d18497deb35039372b1e9715d2c038)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/SerializedExceptionPBImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestSerializedExceptionPBImpl.java


 SerializedException should also try to instantiate internal exception with 
 the default constructor
 --

 Key: YARN-3745
 URL: https://issues.apache.org/jira/browse/YARN-3745
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Lavkesh Lahngir
Assignee: Lavkesh Lahngir
 Fix For: 2.8.0

 Attachments: YARN-3745.1.patch, YARN-3745.2.patch, YARN-3745.3.patch, 
 YARN-3745.patch


 While deserialising a SerializedException it tries to create internal 
 exception in instantiateException() with cn = 
 cls.getConstructor(String.class).
 if cls does not has a constructor with String parameter it throws 
 Nosuchmethodexception
 for example ClosedChannelException class.  
 We should also try to instantiate exception with default constructor so that 
 inner exception can to propagated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3745) SerializedException should also try to instantiate internal exception with the default constructor

2015-06-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602765#comment-14602765
 ] 

Hudson commented on YARN-3745:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #970 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/970/])
YARN-3745. SerializedException should also try to instantiate internal 
(devaraj: rev b381f88c71d18497deb35039372b1e9715d2c038)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/SerializedExceptionPBImpl.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestSerializedExceptionPBImpl.java


 SerializedException should also try to instantiate internal exception with 
 the default constructor
 --

 Key: YARN-3745
 URL: https://issues.apache.org/jira/browse/YARN-3745
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Lavkesh Lahngir
Assignee: Lavkesh Lahngir
 Fix For: 2.8.0

 Attachments: YARN-3745.1.patch, YARN-3745.2.patch, YARN-3745.3.patch, 
 YARN-3745.patch


 While deserialising a SerializedException it tries to create internal 
 exception in instantiateException() with cn = 
 cls.getConstructor(String.class).
 if cls does not has a constructor with String parameter it throws 
 Nosuchmethodexception
 for example ClosedChannelException class.  
 We should also try to instantiate exception with default constructor so that 
 inner exception can to propagated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages

2015-06-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602762#comment-14602762
 ] 

Hudson commented on YARN-3826:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #970 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/970/])
YARN-3826. Race condition in ResourceTrackerService leads to wrong (devaraj: 
rev 57f1a01eda80f44d3ffcbcb93c4ee290e274946a)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/YarnServerBuilderUtils.java


 Race condition in ResourceTrackerService leads to wrong diagnostics messages
 

 Key: YARN-3826
 URL: https://issues.apache.org/jira/browse/YARN-3826
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Fix For: 2.8.0

 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, 
 YARN-3826.03.patch


 Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which 
 can be called concurrently, the static {{resync}} and {{shutdown}} may have 
 wrong diagnostics messages in some cases.
 On the other side, these static members can hardly save any memory, since the 
 normal heartbeat responses are created for each heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)