[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM

2014-09-29 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151393#comment-14151393
 ] 

Jian He commented on YARN-2617:
---

[~hex108], thanks for reporting this issue !
- Took a look at the patch. For Apps at New/Init state, we still should not 
prematurely remove the containers from context. I think we should explicitly 
check if apps  are at 
FINISHING_CONTAINERS_WAIT/APPLICATION_RESOURCES_CLEANINGUP/FINISHED state. If 
they are, we are safe to remove the containers from context.
- Also, the following code
{code}
  if (!this.context.getApplications().containsKey(applicationId)) {
context.getContainers().remove(containerId);
continue;
  }
{code}
 needs to be moved inside the following check {{ if 
(containerStatus.getState().equals(ContainerState.COMPLETE))}}, so that if app 
is at one of the states mentioned above, we don't remove containers from the 
context until container reaches COMPLETE state. Does this make sense to you ?
- Could you add unit test for your change too ?

 NM does not need to send finished container whose APP is not running to RM
 --

 Key: YARN-2617
 URL: https://issues.apache.org/jira/browse/YARN-2617
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
 Fix For: 2.6.0

 Attachments: YARN-2617.patch


 We([~chenchun]) are testing RM work preserving restart and found the 
 following logs when we ran a simple MapReduce task PI. NM continuously 
 reported completed containers whose Application had already finished while AM 
 had finished. 
 {code}
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {code}
 In the patch for YARN-1372, ApplicationImpl on NM should guarantee to  clean 
 up already completed applications. But it will only remove appId from  
 'app.context.getApplications()' when ApplicaitonImpl received evnet 
 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might 
 receive this event for a long time or could not receive. 
 * For NonAggregatingLogHandler, it wait for 
 YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, 
 then it will be scheduled to delete Application logs and send the event.
 * For LogAggregationService, it might fail(e.g. if user does not have HDFS 
 write permission), and it will not send the event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2617) NM does not need to send finished container whose APP is not running to RM

2014-09-29 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2617:
--
Assignee: Jun Gong

 NM does not need to send finished container whose APP is not running to RM
 --

 Key: YARN-2617
 URL: https://issues.apache.org/jira/browse/YARN-2617
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.6.0

 Attachments: YARN-2617.patch


 We([~chenchun]) are testing RM work preserving restart and found the 
 following logs when we ran a simple MapReduce task PI. NM continuously 
 reported completed containers whose Application had already finished while AM 
 had finished. 
 {code}
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {code}
 In the patch for YARN-1372, ApplicationImpl on NM should guarantee to  clean 
 up already completed applications. But it will only remove appId from  
 'app.context.getApplications()' when ApplicaitonImpl received evnet 
 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might 
 receive this event for a long time or could not receive. 
 * For NonAggregatingLogHandler, it wait for 
 YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, 
 then it will be scheduled to delete Application logs and send the event.
 * For LogAggregationService, it might fail(e.g. if user does not have HDFS 
 write permission), and it will not send the event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM

2014-09-29 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151396#comment-14151396
 ] 

Jian He commented on YARN-2617:
---

I just added you to the contributor list, thanks again for your contribution !

 NM does not need to send finished container whose APP is not running to RM
 --

 Key: YARN-2617
 URL: https://issues.apache.org/jira/browse/YARN-2617
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.6.0

 Attachments: YARN-2617.patch


 We([~chenchun]) are testing RM work preserving restart and found the 
 following logs when we ran a simple MapReduce task PI. NM continuously 
 reported completed containers whose Application had already finished while AM 
 had finished. 
 {code}
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {code}
 In the patch for YARN-1372, ApplicationImpl on NM should guarantee to  clean 
 up already completed applications. But it will only remove appId from  
 'app.context.getApplications()' when ApplicaitonImpl received evnet 
 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might 
 receive this event for a long time or could not receive. 
 * For NonAggregatingLogHandler, it wait for 
 YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, 
 then it will be scheduled to delete Application logs and send the event.
 * For LogAggregationService, it might fail(e.g. if user does not have HDFS 
 write permission), and it will not send the event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-29 Thread Remus Rusanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated YARN-2198:
---
Attachment: .YARN-2198.delta.10.patch

delta.10 is the delta from YANR-1972 corresponding to .trunk.10

 Remove the need to run NodeManager as privileged account for Windows Secure 
 Container Executor
 --

 Key: YARN-2198
 URL: https://issues.apache.org/jira/browse/YARN-2198
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
 YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
 YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
 YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
 YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
 YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch


 YARN-1972 introduces a Secure Windows Container Executor. However this 
 executor requires a the process launching the container to be LocalSystem or 
 a member of the a local Administrators group. Since the process in question 
 is the NodeManager, the requirement translates to the entire NM to run as a 
 privileged account, a very large surface area to review and protect.
 This proposal is to move the privileged operations into a dedicated NT 
 service. The NM can run as a low privilege account and communicate with the 
 privileged NT service when it needs to launch a container. This would reduce 
 the surface exposed to the high privileges. 
 There has to exist a secure, authenticated and authorized channel of 
 communication between the NM and the privileged NT service. Possible 
 alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
 be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
 specific inter-process communication channel that satisfies all requirements 
 and is easy to deploy. The privileged NT service would register and listen on 
 an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
 with libwinutils which would host the LPC client code. The client would 
 connect to the LPC port (NtConnectPort) and send a message requesting a 
 container launch (NtRequestWaitReplyPort). LPC provides authentication and 
 the privileged NT service can use authorization API (AuthZ) to validate the 
 caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151408#comment-14151408
 ] 

Hadoop QA commented on YARN-2198:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12671751/.YARN-2198.delta.10.patch
  against trunk revision b38e52b.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5168//console

This message is automatically generated.

 Remove the need to run NodeManager as privileged account for Windows Secure 
 Container Executor
 --

 Key: YARN-2198
 URL: https://issues.apache.org/jira/browse/YARN-2198
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
 YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
 YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
 YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
 YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
 YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch


 YARN-1972 introduces a Secure Windows Container Executor. However this 
 executor requires a the process launching the container to be LocalSystem or 
 a member of the a local Administrators group. Since the process in question 
 is the NodeManager, the requirement translates to the entire NM to run as a 
 privileged account, a very large surface area to review and protect.
 This proposal is to move the privileged operations into a dedicated NT 
 service. The NM can run as a low privilege account and communicate with the 
 privileged NT service when it needs to launch a container. This would reduce 
 the surface exposed to the high privileges. 
 There has to exist a secure, authenticated and authorized channel of 
 communication between the NM and the privileged NT service. Possible 
 alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
 be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
 specific inter-process communication channel that satisfies all requirements 
 and is easy to deploy. The privileged NT service would register and listen on 
 an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
 with libwinutils which would host the LPC client code. The client would 
 connect to the LPC port (NtConnectPort) and send a message requesting a 
 container launch (NtRequestWaitReplyPort). LPC provides authentication and 
 the privileged NT service can use authorization API (AuthZ) to validate the 
 caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-29 Thread q79969786 (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

q79969786 updated YARN-2198:

Description: 
YARN-1972 introduces a Secure Windows Container Executor. However this executor 
requires a process launching the container to be LocalSystem or a member of the 
a local Administrators group. Since the process in question is the NodeManager, 
the requirement translates to the entire NM to run as a privileged account, a 
very large surface area to review and protect.

This proposal is to move the privileged operations into a dedicated NT service. 
The NM can run as a low privilege account and communicate with the privileged 
NT service when it needs to launch a container. This would reduce the surface 
exposed to the high privileges. 

There has to exist a secure, authenticated and authorized channel of 
communication between the NM and the privileged NT service. Possible 
alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be 
to use Windows LPC (Local Procedure Calls), which is a Windows platform 
specific inter-process communication channel that satisfies all requirements 
and is easy to deploy. The privileged NT service would register and listen on 
an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with 
libwinutils which would host the LPC client code. The client would connect to 
the LPC port (NtConnectPort) and send a message requesting a container launch 
(NtRequestWaitReplyPort). LPC provides authentication and the privileged NT 
service can use authorization API (AuthZ) to validate the caller.

  was:
YARN-1972 introduces a Secure Windows Container Executor. However this executor 
requires a the process launching the container to be LocalSystem or a member of 
the a local Administrators group. Since the process in question is the 
NodeManager, the requirement translates to the entire NM to run as a privileged 
account, a very large surface area to review and protect.

This proposal is to move the privileged operations into a dedicated NT service. 
The NM can run as a low privilege account and communicate with the privileged 
NT service when it needs to launch a container. This would reduce the surface 
exposed to the high privileges. 

There has to exist a secure, authenticated and authorized channel of 
communication between the NM and the privileged NT service. Possible 
alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be 
to use Windows LPC (Local Procedure Calls), which is a Windows platform 
specific inter-process communication channel that satisfies all requirements 
and is easy to deploy. The privileged NT service would register and listen on 
an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with 
libwinutils which would host the LPC client code. The client would connect to 
the LPC port (NtConnectPort) and send a message requesting a container launch 
(NtRequestWaitReplyPort). LPC provides authentication and the privileged NT 
service can use authorization API (AuthZ) to validate the caller.


 Remove the need to run NodeManager as privileged account for Windows Secure 
 Container Executor
 --

 Key: YARN-2198
 URL: https://issues.apache.org/jira/browse/YARN-2198
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
 YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
 YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
 YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
 YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
 YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch


 YARN-1972 introduces a Secure Windows Container Executor. However this 
 executor requires a process launching the container to be LocalSystem or a 
 member of the a local Administrators group. Since the process in question is 
 the NodeManager, the requirement translates to the entire NM to run as a 
 privileged account, a very large surface area to review and protect.
 This proposal is to move the privileged operations into a dedicated NT 
 service. The NM can run as a low privilege account and communicate with the 
 privileged NT service when it needs to launch a container. This would reduce 
 the surface exposed to the high privileges. 
 There has to exist a secure, authenticated and authorized channel of 
 communication between the NM and the privileged NT service. Possible 
 alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
 be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
 specific 

[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM

2014-09-29 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151466#comment-14151466
 ] 

Jun Gong commented on YARN-2617:


[~jianhe], thank you for the review!

{quote}
I think we should explicitly check if apps are at 
FINISHING_CONTAINERS_WAIT/APPLICATION_RESOURCES_CLEANINGUP/FINISHED state. 
{quote}
My concern is that we will need to modify the code when we add a new state for 
ApplicationImpl. It will be OK if it is not a problem. BTW: is there any case 
that APP has containers but APP is not in RUNNING state?

{quote}
The code needs to be moved inside the following check {{ if 
(containerStatus.getState().equals(ContainerState.COMPLETE))}} ...
{quote}
OK. I will change it.

And I will add an unit test.

 NM does not need to send finished container whose APP is not running to RM
 --

 Key: YARN-2617
 URL: https://issues.apache.org/jira/browse/YARN-2617
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.6.0

 Attachments: YARN-2617.patch


 We([~chenchun]) are testing RM work preserving restart and found the 
 following logs when we ran a simple MapReduce task PI. NM continuously 
 reported completed containers whose Application had already finished while AM 
 had finished. 
 {code}
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {code}
 In the patch for YARN-1372, ApplicationImpl on NM should guarantee to  clean 
 up already completed applications. But it will only remove appId from  
 'app.context.getApplications()' when ApplicaitonImpl received evnet 
 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might 
 receive this event for a long time or could not receive. 
 * For NonAggregatingLogHandler, it wait for 
 YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, 
 then it will be scheduled to delete Application logs and send the event.
 * For LogAggregationService, it might fail(e.g. if user does not have HDFS 
 write permission), and it will not send the event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-29 Thread Remus Rusanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated YARN-2198:
---
Description: 
YARN-1972 introduces a Secure Windows Container Executor. However this executor 
requires the process launching the container to be LocalSystem or a member of 
the a local Administrators group. Since the process in question is the 
NodeManager, the requirement translates to the entire NM to run as a privileged 
account, a very large surface area to review and protect.

This proposal is to move the privileged operations into a dedicated NT service. 
The NM can run as a low privilege account and communicate with the privileged 
NT service when it needs to launch a container. This would reduce the surface 
exposed to the high privileges. 

There has to exist a secure, authenticated and authorized channel of 
communication between the NM and the privileged NT service. Possible 
alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be 
to use Windows LPC (Local Procedure Calls), which is a Windows platform 
specific inter-process communication channel that satisfies all requirements 
and is easy to deploy. The privileged NT service would register and listen on 
an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with 
libwinutils which would host the LPC client code. The client would connect to 
the LPC port (NtConnectPort) and send a message requesting a container launch 
(NtRequestWaitReplyPort). LPC provides authentication and the privileged NT 
service can use authorization API (AuthZ) to validate the caller.

  was:
YARN-1972 introduces a Secure Windows Container Executor. However this executor 
requires a process launching the container to be LocalSystem or a member of the 
a local Administrators group. Since the process in question is the NodeManager, 
the requirement translates to the entire NM to run as a privileged account, a 
very large surface area to review and protect.

This proposal is to move the privileged operations into a dedicated NT service. 
The NM can run as a low privilege account and communicate with the privileged 
NT service when it needs to launch a container. This would reduce the surface 
exposed to the high privileges. 

There has to exist a secure, authenticated and authorized channel of 
communication between the NM and the privileged NT service. Possible 
alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be 
to use Windows LPC (Local Procedure Calls), which is a Windows platform 
specific inter-process communication channel that satisfies all requirements 
and is easy to deploy. The privileged NT service would register and listen on 
an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with 
libwinutils which would host the LPC client code. The client would connect to 
the LPC port (NtConnectPort) and send a message requesting a container launch 
(NtRequestWaitReplyPort). LPC provides authentication and the privileged NT 
service can use authorization API (AuthZ) to validate the caller.


 Remove the need to run NodeManager as privileged account for Windows Secure 
 Container Executor
 --

 Key: YARN-2198
 URL: https://issues.apache.org/jira/browse/YARN-2198
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
 YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
 YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
 YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
 YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
 YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch


 YARN-1972 introduces a Secure Windows Container Executor. However this 
 executor requires the process launching the container to be LocalSystem or a 
 member of the a local Administrators group. Since the process in question is 
 the NodeManager, the requirement translates to the entire NM to run as a 
 privileged account, a very large surface area to review and protect.
 This proposal is to move the privileged operations into a dedicated NT 
 service. The NM can run as a low privilege account and communicate with the 
 privileged NT service when it needs to launch a container. This would reduce 
 the surface exposed to the high privileges. 
 There has to exist a secure, authenticated and authorized channel of 
 communication between the NM and the privileged NT service. Possible 
 alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
 be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
 specific 

[jira] [Updated] (YARN-2493) [YARN-796] API changes for users

2014-09-29 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2493:
-
Attachment: YARN-2493.patch

Hi [~vinodkv],
Thanks for your careful review, all comments are make sense to me. Attached a 
new patch according to your suggestions.

Wangda

 [YARN-796] API changes for users
 

 Key: YARN-2493
 URL: https://issues.apache.org/jira/browse/YARN-2493
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2493.patch, YARN-2493.patch, YARN-2493.patch, 
 YARN-2493.patch


 This JIRA includes API changes for users of YARN-796, like changes in 
 {{ResourceRequest}}, {{ApplicationSubmissionContext}}, etc. This is a common 
 part of YARN-796.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2493) [YARN-796] API changes for users

2014-09-29 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2493:
-
Attachment: (was: YARN-2493.patch)

 [YARN-796] API changes for users
 

 Key: YARN-2493
 URL: https://issues.apache.org/jira/browse/YARN-2493
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2493.patch, YARN-2493.patch, YARN-2493.patch, 
 YARN-2493.patch


 This JIRA includes API changes for users of YARN-796, like changes in 
 {{ResourceRequest}}, {{ApplicationSubmissionContext}}, etc. This is a common 
 part of YARN-796.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2493) [YARN-796] API changes for users

2014-09-29 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2493:
-
Attachment: YARN-2493.patch

 [YARN-796] API changes for users
 

 Key: YARN-2493
 URL: https://issues.apache.org/jira/browse/YARN-2493
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2493.patch, YARN-2493.patch, YARN-2493.patch, 
 YARN-2493.patch


 This JIRA includes API changes for users of YARN-796, like changes in 
 {{ResourceRequest}}, {{ApplicationSubmissionContext}}, etc. This is a common 
 part of YARN-796.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2493) [YARN-796] API changes for users

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151528#comment-14151528
 ] 

Hadoop QA commented on YARN-2493:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671760/YARN-2493.patch
  against trunk revision b38e52b.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

  {color:red}-1 javac{color}.  The applied patch generated 1281 javac 
compiler warnings (more than the trunk's current 1265 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5169//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5169//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5169//console

This message is automatically generated.

 [YARN-796] API changes for users
 

 Key: YARN-2493
 URL: https://issues.apache.org/jira/browse/YARN-2493
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2493.patch, YARN-2493.patch, YARN-2493.patch, 
 YARN-2493.patch


 This JIRA includes API changes for users of YARN-796, like changes in 
 {{ResourceRequest}}, {{ApplicationSubmissionContext}}, etc. This is a common 
 part of YARN-796.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2312) Marking ContainerId#getId as deprecated

2014-09-29 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2312:
-
Attachment: YARN-2312.2-2.patch

Let me attach same patch again.

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, 
 YARN-2312.2-2.patch, YARN-2312.2.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations

2014-09-29 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151658#comment-14151658
 ] 

Wangda Tan commented on YARN-2494:
--

Hi [~vinodkv] and [~cwelch],
Thanks for reply! Still working on handling your last comments, will upload 
patch soon.

Regarding method name of NodeLabelManager, I think following suggestion make 
sense to me:
bq. What I really want is to convey is that these are just system recognized 
nodelabels as opposed to node-lables that are actually mapped against a node. 
How about addToNodeLabelsCollection(), removeFromNodeLabelsCollection(), 
addLabelsToNode() and removeLabelsFromNode(). The point about 
addToNodeLabelsCollection() is that it clearly conveys that there is a 
NodeLabelsCollection - a set of node-labels known by the system.

And regarding
bq. Once you have the store abstraction, this will be less of a problem? 
Clearly NodeLabelsManager is not something that the client needs access to?
I think it still has problem: Even if we have store abstraction, we still need 
some logic to guarantee labels being added are valid (e.g. we need check if a 
label existed in collection, and label existed in node when we trying to remove 
some labels from a node). That makes we need put a greater chunk of logic to 
the store abstraction -- it isn't a simple store abstraction if we do this.
I suggest to keep it in common to make node label major logic are live together.

Thanks,
Wangda

 [YARN-796] Node label manager API and storage implementations
 -

 Key: YARN-2494
 URL: https://issues.apache.org/jira/browse/YARN-2494
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, 
 YARN-2494.patch, YARN-2494.patch, YARN-2494.patch


 This JIRA includes APIs and storage implementations of node label manager,
 NodeLabelManager is an abstract class used to manage labels of nodes in the 
 cluster, it has APIs to query/modify
 - Nodes according to given label
 - Labels according to given hostname
 - Add/remove labels
 - Set labels of nodes in the cluster
 - Persist/recover changes of labels/labels-on-nodes to/from storage
 And it has two implementations to store modifications
 - Memory based storage: It will not persist changes, so all labels will be 
 lost when RM restart
 - FileSystem based storage: It will persist/recover to/from FileSystem (like 
 HDFS), and all labels and labels-on-nodes will be recovered upon RM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151668#comment-14151668
 ] 

Hadoop QA commented on YARN-2312:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671773/YARN-2312.2-2.patch
  against trunk revision b38e52b.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 16 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.mapred.pipes.TestPipeApplication
  org.apache.hadoop.mapreduce.lib.input.TestMRCJCFileInputFormat
  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5170//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5170//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5170//console

This message is automatically generated.

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, 
 YARN-2312.2-2.patch, YARN-2312.2.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2617) NM does not need to send finished container whose APP is not running to RM

2014-09-29 Thread Jun Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong updated YARN-2617:
---
Attachment: YARN-2617.2.patch

 NM does not need to send finished container whose APP is not running to RM
 --

 Key: YARN-2617
 URL: https://issues.apache.org/jira/browse/YARN-2617
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Fix For: 2.6.0

 Attachments: YARN-2617.2.patch, YARN-2617.patch


 We([~chenchun]) are testing RM work preserving restart and found the 
 following logs when we ran a simple MapReduce task PI. NM continuously 
 reported completed containers whose Application had already finished while AM 
 had finished. 
 {code}
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {code}
 In the patch for YARN-1372, ApplicationImpl on NM should guarantee to  clean 
 up already completed applications. But it will only remove appId from  
 'app.context.getApplications()' when ApplicaitonImpl received evnet 
 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might 
 receive this event for a long time or could not receive. 
 * For NonAggregatingLogHandler, it wait for 
 YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, 
 then it will be scheduled to delete Application logs and send the event.
 * For LogAggregationService, it might fail(e.g. if user does not have HDFS 
 write permission), and it will not send the event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-29 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151706#comment-14151706
 ] 

Jason Lowe commented on YARN-1769:
--

+1 lgtm.  Committing this.

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151726#comment-14151726
 ] 

Hudson commented on YARN-1769:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6135 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6135/])
YARN-1769. CapacityScheduler: Improve reservations. Contributed by Thomas 
Graves (jlowe: rev 9c22065109a77681bc2534063eabe8692fbcb3cd)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestChildQueueOrder.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java
* hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerContext.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestParentQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java


 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Fix For: 2.6.0

 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades

2014-09-29 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151796#comment-14151796
 ] 

Junping Du commented on YARN-2613:
--

Thanks [~jianhe] for the patch. I am reviewing your patch, and some initiative 
comments below. More comments may come later.
{code}
-  public static final int DEFAULT_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS =
+  public static final long DEFAULT_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS =
   15 * 60 * 1000;
+  public static final int DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS =
+  15 * 60 * 1000;
+  public static final long DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS
+  = 10 * 1000;
{code}
I think it is better to keep consistent to use int or long for time intervals 
or wait. IMO, int should be fine enough as it supports up to (2 ^ 31) 
millseconds ~ 50 days.

{code}
-//TO DO: after HADOOP-9576,  IOException can be changed to EOFException
-exceptionToPolicyMap.put(IOException.class, retryPolicy);
{code}
Do we have plan to get HADOOP-9576 in? If yes, shall we keep the todo comments 
here?

 NMClient doesn't have retries for supporting rolling-upgrades
 -

 Key: YARN-2613
 URL: https://issues.apache.org/jira/browse/YARN-2613
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2613.1.patch, YARN-2613.2.patch


 While NM is rolling upgrade, client should retry NM until it comes up. This 
 jira is to add a NMProxy (similar to RMProxy) with retry implementation to 
 support rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-29 Thread Mit Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YARN-2606:

Attachment: YARN-2606.patch

Refining the patch to remove the unwanted serviceInit() as all the work is done 
in serviceStart()

 Application History Server tries to access hdfs before doing secure login
 -

 Key: YARN-2606
 URL: https://issues.apache.org/jira/browse/YARN-2606
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Mit Desai
Assignee: Mit Desai
 Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch


 While testing the Application Timeline Server, the server would not come up 
 in a secure cluster, as it would keep trying to access hdfs without having 
 done the secure login. It would repeatedly try authenticating and finally hit 
 stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-29 Thread Mit Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YARN-2606:

Attachment: YARN-2606.patch

Yet some more refining. Attached updated patch.

 Application History Server tries to access hdfs before doing secure login
 -

 Key: YARN-2606
 URL: https://issues.apache.org/jira/browse/YARN-2606
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Mit Desai
Assignee: Mit Desai
 Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, 
 YARN-2606.patch


 While testing the Application Timeline Server, the server would not come up 
 in a secure cluster, as it would keep trying to access hdfs without having 
 done the secure login. It would repeatedly try authenticating and finally hit 
 stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151851#comment-14151851
 ] 

Hadoop QA commented on YARN-2606:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671803/YARN-2606.patch
  against trunk revision 4666440.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5171//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5171//console

This message is automatically generated.

 Application History Server tries to access hdfs before doing secure login
 -

 Key: YARN-2606
 URL: https://issues.apache.org/jira/browse/YARN-2606
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Mit Desai
Assignee: Mit Desai
 Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, 
 YARN-2606.patch


 While testing the Application Timeline Server, the server would not come up 
 in a secure cluster, as it would keep trying to access hdfs without having 
 done the secure login. It would repeatedly try authenticating and finally hit 
 stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades

2014-09-29 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151914#comment-14151914
 ] 

Jian He commented on YARN-2613:
---

bq. I think it is better to keep consistent to use int or long
good catch, I  changed the one for RMProxy, but missed this. 
bq. Do we have plan to get HADOOP-9576 in? If yes, shall we keep the todo 
comments here?
I forgot my initial intent to add this comment. As now I followed 
FailoverOnNetworkExceptionRetry for the exception-retry policy, I found maybe 
we don't need to do this for now.

 NMClient doesn't have retries for supporting rolling-upgrades
 -

 Key: YARN-2613
 URL: https://issues.apache.org/jira/browse/YARN-2613
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2613.1.patch, YARN-2613.2.patch


 While NM is rolling upgrade, client should retry NM until it comes up. This 
 jira is to add a NMProxy (similar to RMProxy) with retry implementation to 
 support rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-29 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151977#comment-14151977
 ] 

Karthik Kambatla commented on YARN-2179:


[~vinodkv] - do you have any further comments on this? 

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
 YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, 
 YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, 
 YARN-2179-trunk-v9.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2618) Add API support for disk I/O resources

2014-09-29 Thread Wei Yan (JIRA)
Wei Yan created YARN-2618:
-

 Summary: Add API support for disk I/O resources
 Key: YARN-2618
 URL: https://issues.apache.org/jira/browse/YARN-2618
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wei Yan
Assignee: Wei Yan


Subtask of YARN-2139. Add API support for introducing disk I/O as the 3rd type 
resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2619) NodeManager: Add cgroups support for disk I/O isolation

2014-09-29 Thread Wei Yan (JIRA)
Wei Yan created YARN-2619:
-

 Summary: NodeManager: Add cgroups support for disk I/O isolation
 Key: YARN-2619
 URL: https://issues.apache.org/jira/browse/YARN-2619
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wei Yan
Assignee: Wei Yan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2620) FairScheduler: Add disk I/O resource to the DRF implementation

2014-09-29 Thread Wei Yan (JIRA)
Wei Yan created YARN-2620:
-

 Summary: FairScheduler: Add disk I/O resource to the DRF 
implementation
 Key: YARN-2620
 URL: https://issues.apache.org/jira/browse/YARN-2620
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wei Yan
Assignee: Wei Yan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2610) Hamlet should close table tags

2014-09-29 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2610:
---
Summary: Hamlet should close table tags  (was: Hamlet doesn't close table 
tags)

 Hamlet should close table tags
 --

 Key: YARN-2610
 URL: https://issues.apache.org/jira/browse/YARN-2610
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: supportability
 Attachments: YARN-2610-01.patch, YARN-2610-02.patch


 Revisiting a subset of MAPREDUCE-2993.
 The th, td, thead, tfoot, tr tags are not configured to close 
 properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
 table tags tends to wreak havoc with a lot of HTML processors (although not 
 usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-29 Thread Remus Rusanu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152042#comment-14152042
 ] 

Remus Rusanu commented on YARN-2198:


the last QA -1 is for delta.10.patch, which is not trunk diff.

 Remove the need to run NodeManager as privileged account for Windows Secure 
 Container Executor
 --

 Key: YARN-2198
 URL: https://issues.apache.org/jira/browse/YARN-2198
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
 YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
 YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
 YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
 YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
 YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch


 YARN-1972 introduces a Secure Windows Container Executor. However this 
 executor requires the process launching the container to be LocalSystem or a 
 member of the a local Administrators group. Since the process in question is 
 the NodeManager, the requirement translates to the entire NM to run as a 
 privileged account, a very large surface area to review and protect.
 This proposal is to move the privileged operations into a dedicated NT 
 service. The NM can run as a low privilege account and communicate with the 
 privileged NT service when it needs to launch a container. This would reduce 
 the surface exposed to the high privileges. 
 There has to exist a secure, authenticated and authorized channel of 
 communication between the NM and the privileged NT service. Possible 
 alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
 be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
 specific inter-process communication channel that satisfies all requirements 
 and is easy to deploy. The privileged NT service would register and listen on 
 an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
 with libwinutils which would host the LPC client code. The client would 
 connect to the LPC port (NtConnectPort) and send a message requesting a 
 container launch (NtRequestWaitReplyPort). LPC provides authentication and 
 the privileged NT service can use authorization API (AuthZ) to validate the 
 caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152075#comment-14152075
 ] 

Hadoop QA commented on YARN-2606:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671811/YARN-2606.patch
  against trunk revision b3d5d26.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5172//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5172//console

This message is automatically generated.

 Application History Server tries to access hdfs before doing secure login
 -

 Key: YARN-2606
 URL: https://issues.apache.org/jira/browse/YARN-2606
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Mit Desai
Assignee: Mit Desai
 Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, 
 YARN-2606.patch


 While testing the Application Timeline Server, the server would not come up 
 in a secure cluster, as it would keep trying to access hdfs without having 
 done the secure login. It would repeatedly try authenticating and finally hit 
 stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations

2014-09-29 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152078#comment-14152078
 ] 

Craig Welch commented on YARN-2494:
---

Not to dither about names - but Collection is still not terribly clear to me 
(overly generic), I was thinking previously about Cluster as the 
differentiator, so:

addToClusterNodeLabels(), removeFromClusterNodeLabels(), addLabelsToNode() and 
removeLabelsFromNode(). 

I think this conveys the different notions of what the operations are applying 
to in a pretty clear way.  Thoughts?

 [YARN-796] Node label manager API and storage implementations
 -

 Key: YARN-2494
 URL: https://issues.apache.org/jira/browse/YARN-2494
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, 
 YARN-2494.patch, YARN-2494.patch, YARN-2494.patch


 This JIRA includes APIs and storage implementations of node label manager,
 NodeLabelManager is an abstract class used to manage labels of nodes in the 
 cluster, it has APIs to query/modify
 - Nodes according to given label
 - Labels according to given hostname
 - Add/remove labels
 - Set labels of nodes in the cluster
 - Persist/recover changes of labels/labels-on-nodes to/from storage
 And it has two implementations to store modifications
 - Memory based storage: It will not persist changes, so all labels will be 
 lost when RM restart
 - FileSystem based storage: It will persist/recover to/from FileSystem (like 
 HDFS), and all labels and labels-on-nodes will be recovered upon RM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-29 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152077#comment-14152077
 ] 

Jonathan Eagles commented on YARN-2606:
---

+1. Will commit at the end of the day in case any one else has comments.

 Application History Server tries to access hdfs before doing secure login
 -

 Key: YARN-2606
 URL: https://issues.apache.org/jira/browse/YARN-2606
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Mit Desai
Assignee: Mit Desai
 Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, 
 YARN-2606.patch


 While testing the Application Timeline Server, the server would not come up 
 in a secure cluster, as it would keep trying to access hdfs without having 
 done the secure login. It would repeatedly try authenticating and finally hit 
 stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-29 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152084#comment-14152084
 ] 

Vinod Kumar Vavilapalli commented on YARN-2179:
---

Looking now..

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
 YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, 
 YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, 
 YARN-2179-trunk-v9.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-29 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152103#comment-14152103
 ] 

Vinod Kumar Vavilapalli commented on YARN-2179:
---

Looks so much better now. One minor suggestion - in the test, instead of 
overriding all of YarnClient, you could simply mock it to override behaviour of 
only those methods that you are interested in.

+1 otherwise.

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
 YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, 
 YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, 
 YARN-2179-trunk-v9.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2014-09-29 Thread Jian Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152137#comment-14152137
 ] 

Jian Fang commented on YARN-1680:
-

Hi, any update on the fix? We saw quick some jobs failed due to this issue.

 availableResources sent to applicationMaster in heartbeat should exclude 
 blacklistedNodes free memory.
 --

 Key: YARN-1680
 URL: https://issues.apache.org/jira/browse/YARN-1680
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0, 2.3.0
 Environment: SuSE 11 SP2 + Hadoop-2.3 
Reporter: Rohith
Assignee: Chen He
 Attachments: YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch


 There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
 slow start is set to 1.
 Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
 become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
 NodeManager(NM-4). All reducer task are running in cluster now.
 MRAppMaster does not preempt the reducers because for Reducer preemption 
 calculation, headRoom is considering blacklisted nodes memory. This makes 
 jobs to hang forever(ResourceManager does not assing any new containers on 
 blacklisted nodes but returns availableResouce considers cluster free 
 memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2056) Disable preemption at Queue level

2014-09-29 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152153#comment-14152153
 ] 

Eric Payne commented on YARN-2056:
--

[~leftnoteasy]. Thanks again for helping to review this patch. Have you had a 
chance to look over the updated changes?

 Disable preemption at Queue level
 -

 Key: YARN-2056
 URL: https://issues.apache.org/jira/browse/YARN-2056
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal
Assignee: Eric Payne
 Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, 
 YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, 
 YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, 
 YARN-2056.201409232329.txt, YARN-2056.201409242210.txt


 We need to be able to disable preemption at individual queue level



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2014-09-29 Thread Chen He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152154#comment-14152154
 ] 

Chen He commented on YARN-1680:
---

Thank you for remaindering me, [~john.jian.fang]. I will post the updated patch 
before end of tomorrow.

 availableResources sent to applicationMaster in heartbeat should exclude 
 blacklistedNodes free memory.
 --

 Key: YARN-1680
 URL: https://issues.apache.org/jira/browse/YARN-1680
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0, 2.3.0
 Environment: SuSE 11 SP2 + Hadoop-2.3 
Reporter: Rohith
Assignee: Chen He
 Attachments: YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch


 There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
 slow start is set to 1.
 Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
 become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
 NodeManager(NM-4). All reducer task are running in cluster now.
 MRAppMaster does not preempt the reducers because for Reducer preemption 
 calculation, headRoom is considering blacklisted nodes memory. This makes 
 jobs to hang forever(ResourceManager does not assing any new containers on 
 blacklisted nodes but returns availableResouce considers cluster free 
 memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2014-09-29 Thread Jian Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152156#comment-14152156
 ] 

Jian Fang commented on YARN-1680:
-

Thanks. Looking forward to your patch.

 availableResources sent to applicationMaster in heartbeat should exclude 
 blacklistedNodes free memory.
 --

 Key: YARN-1680
 URL: https://issues.apache.org/jira/browse/YARN-1680
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0, 2.3.0
 Environment: SuSE 11 SP2 + Hadoop-2.3 
Reporter: Rohith
Assignee: Chen He
 Attachments: YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch


 There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
 slow start is set to 1.
 Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
 become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
 NodeManager(NM-4). All reducer task are running in cluster now.
 MRAppMaster does not preempt the reducers because for Reducer preemption 
 calculation, headRoom is considering blacklisted nodes memory. This makes 
 jobs to hang forever(ResourceManager does not assing any new containers on 
 blacklisted nodes but returns availableResouce considers cluster free 
 memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2610) Hamlet should close table tags

2014-09-29 Thread Mit Desai (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152160#comment-14152160
 ] 

Mit Desai commented on YARN-2610:
-

Why is the change specific to some tags and not the others?

 Hamlet should close table tags
 --

 Key: YARN-2610
 URL: https://issues.apache.org/jira/browse/YARN-2610
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: supportability
 Attachments: YARN-2610-01.patch, YARN-2610-02.patch


 Revisiting a subset of MAPREDUCE-2993.
 The th, td, thead, tfoot, tr tags are not configured to close 
 properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
 table tags tends to wreak havoc with a lot of HTML processors (although not 
 usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2610) Hamlet should close table tags

2014-09-29 Thread Ray Chiang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152176#comment-14152176
 ] 

Ray Chiang commented on YARN-2610:
--

I would have been fine with changing all the tags to close cleanly, except for 
the feedback from MAPREDUCE-2993.  So, I limited these changes to just the 
table rendering ones--which tends to cause the most problems anyhow.

Or is there some table related tag that I missed?

 Hamlet should close table tags
 --

 Key: YARN-2610
 URL: https://issues.apache.org/jira/browse/YARN-2610
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: supportability
 Attachments: YARN-2610-01.patch, YARN-2610-02.patch


 Revisiting a subset of MAPREDUCE-2993.
 The th, td, thead, tfoot, tr tags are not configured to close 
 properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
 table tags tends to wreak havoc with a lot of HTML processors (although not 
 usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2610) Hamlet should close table tags

2014-09-29 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152206#comment-14152206
 ] 

Karthik Kambatla commented on YARN-2610:


I just ran all YARN tests with the latest patch to be safe. None of the test 
failures are related.

+1. I ll commit this later today if no one objects. 

 Hamlet should close table tags
 --

 Key: YARN-2610
 URL: https://issues.apache.org/jira/browse/YARN-2610
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: supportability
 Attachments: YARN-2610-01.patch, YARN-2610-02.patch


 Revisiting a subset of MAPREDUCE-2993.
 The th, td, thead, tfoot, tr tags are not configured to close 
 properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
 table tags tends to wreak havoc with a lot of HTML processors (although not 
 usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2610) Hamlet should close table tags

2014-09-29 Thread Mit Desai (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152219#comment-14152219
 ] 

Mit Desai commented on YARN-2610:
-

[~rchiang], I did not see the comments on that MAPREDUCE-2993 before. Just 
wanted to know the reason behind leaving some tags open.
The patch looks good to me.
+1 (non-binding)

 Hamlet should close table tags
 --

 Key: YARN-2610
 URL: https://issues.apache.org/jira/browse/YARN-2610
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: supportability
 Attachments: YARN-2610-01.patch, YARN-2610-02.patch


 Revisiting a subset of MAPREDUCE-2993.
 The th, td, thead, tfoot, tr tags are not configured to close 
 properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
 table tags tends to wreak havoc with a lot of HTML processors (although not 
 usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations

2014-09-29 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152270#comment-14152270
 ] 

Vinod Kumar Vavilapalli commented on YARN-2494:
---

bq. I think it still has problem: Even if we have store abstraction, we still 
need some logic to guarantee labels being added are valid (e.g. we need check 
if a label existed in collection, and label existed in node when we trying to 
remove some labels from a node).
Then that validation code needs to get pulled out in a common layer. My goal it 
not put the entire NodelabelsManager in yarn-common - it just doesn't belong 
there.

bq. How about addToNodeLabelsCollection(), removeFromNodeLabelsCollection(), 
addLabelsToNode() and removeLabelsFromNode()
bq. addToClusterNodeLabels(), removeFromClusterNodeLabels(), addLabelsToNode() 
and removeLabelsFromNode(). 
[~leftnoteasy], [~cwelch], I'm okay with either of the above. Or should we call 
it {{ClusterNodeLabelsCollection}}? :)

 [YARN-796] Node label manager API and storage implementations
 -

 Key: YARN-2494
 URL: https://issues.apache.org/jira/browse/YARN-2494
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, 
 YARN-2494.patch, YARN-2494.patch, YARN-2494.patch


 This JIRA includes APIs and storage implementations of node label manager,
 NodeLabelManager is an abstract class used to manage labels of nodes in the 
 cluster, it has APIs to query/modify
 - Nodes according to given label
 - Labels according to given hostname
 - Add/remove labels
 - Set labels of nodes in the cluster
 - Persist/recover changes of labels/labels-on-nodes to/from storage
 And it has two implementations to store modifications
 - Memory based storage: It will not persist changes, so all labels will be 
 lost when RM restart
 - FileSystem based storage: It will persist/recover to/from FileSystem (like 
 HDFS), and all labels and labels-on-nodes will be recovered upon RM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-09-29 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152301#comment-14152301
 ] 

Anubhav Dhoot commented on YARN-1879:
-

The patch needs to be updated 

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.2-wip.patch, 
 YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-29 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152307#comment-14152307
 ] 

Jason Lowe commented on YARN-90:


Thanks for updating the patch, Varun.

bq. I've changed it to Disk(s) health report: . My only concern with this is 
that there might be scripts looking for the Disk(s) failed log line for 
monitoring. What do you think?

If that's true then the code should bother to do a diff between the old disk 
list and the new one, logging which disks turned bad using the Disk(s) failed 
line and which disks became healthy with some other log message.

bq. Directories are only cleaned up during startup. The code tests for 
existence of the directories and the correct permissions. This does mean that 
container directories left behind for any reason won't get cleaned up unit the 
NodeManager is restarted. Is that ok?

This could still be problematic for the NM work-preserving restart case, as we 
could try to delete an entire disk tree with active containers on it due to a 
hiccup when the NM restarts.  I think a better approach is a periodic cleanup 
scan that looks for directories under yarn-local and yarn-logs that shouldn't 
be there.  This could be part of the health check scan or done separately.  
That way we don't have to wait for a disk to turn good or bad to catch leaked 
entities on the disk due to some hiccup.  Sorta like an fsck for the NM state 
on disk.  That is best done as a separate JIRA, as I think this functionality 
is still an incremental improvement without it.

Other comments:

checkDirs unnecessarily calls union(errorDirs, fullDirs) twice.

isDiskFreeSpaceOverLimt is now named backwards, as the code returns true if the 
free space is under the limit.

getLocalDirsForCleanup and getLogDirsForCleanup should have javadoc comments 
like the other methods.

Nit: The union utility function doesn't technically perform a union but rather 
a concatenation, and it'd be a little clearer if the name reflected that.  Also 
the function should leverage the fact that it knows how big the ArrayList will 
be after the operations and give it the appropriate hint to its constructor to 
avoid reallocations.


 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
 apache-yarn-90.5.patch, apache-yarn-90.6.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2446) Using TimelineNamespace to shield the entities of a user

2014-09-29 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152313#comment-14152313
 ] 

Zhijie Shen commented on YARN-2446:
---

bq. Get domains API: If callerUGI is not the owner or the admin of the domain, 
we need to hide the details from him, and only allow him to see the ID: Why is 
that, I think we should just not allow non-owners to see anything. Is there a 
user-case for this?

bq. Based on the above decision, 
TestTimelineWebServices.testGetDomainsYarnACLsEnabled() should be changed to 
either validate that only IDs are visible or nothing is visible.

The rationale before is to let users to check whether the namespace Id is 
occupied or not before putting one. Talked to vindo offline, since it cannot 
save the race condition of multiple putting requests anyway, let's simplify the 
behavior as is suggested above. It's not related to code in this patch. Let me 
file a separate Jira for it.

bq. Shouldn't the server completely own DEFAULT_DOMAIN_ID, instead of letting 
anyone create it with potentially arbitrary permission?

Yes, DEFAULT_DOMAIN_ID is owned by the timeline server. When 
TimelineDataManager is constructed, if the default domain is not created 
before, the timeline server is going to create one. Users can not create or 
modify the domain with DEFAULT_DOMAIN_ID.

bq. testGetEntitiesWithYarnACLsEnabled()

The test cases seem to be problematic. I've updated these test cases and add 
the validation of cross-domain entity relationship.

One more issue I've noticed that after this patch, we should make RM put the 
application metrics into a secured domain instead of the default one. Will file 
a Jira for it as well.

 Using TimelineNamespace to shield the entities of a user
 

 Key: YARN-2446
 URL: https://issues.apache.org/jira/browse/YARN-2446
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2446.1.patch, YARN-2446.2.patch, YARN-2446.3.patch


 Given YARN-2102 adds TimelineNamespace, we can make use of it to shield the 
 entities, preventing them from being accessed or affected by other users' 
 operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2446) Using TimelineNamespace to shield the entities of a user

2014-09-29 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2446:
--
Attachment: YARN-2446.3.patch

 Using TimelineNamespace to shield the entities of a user
 

 Key: YARN-2446
 URL: https://issues.apache.org/jira/browse/YARN-2446
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2446.1.patch, YARN-2446.2.patch, YARN-2446.3.patch


 Given YARN-2102 adds TimelineNamespace, we can make use of it to shield the 
 entities, preventing them from being accessed or affected by other users' 
 operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2621) Simplify the output when the user doesn't have the access for getDomain(s)

2014-09-29 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-2621:
-

 Summary: Simplify the output when the user doesn't have the access 
for getDomain(s) 
 Key: YARN-2621
 URL: https://issues.apache.org/jira/browse/YARN-2621
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen


Per discussion in 
[YARN-2446|https://issues.apache.org/jira/browse/YARN-2446?focusedCommentId=14151272page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14151272],
 we should simply reject the user if it doesn't have access the domain(s), 
instead of returning the entity without detail information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2622) RM should put the application related timeline data into a secured domain

2014-09-29 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2622:
--
Component/s: timelineserver

 RM should put the application related timeline data into a secured domain
 -

 Key: YARN-2622
 URL: https://issues.apache.org/jira/browse/YARN-2622
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 After YARN-2446, SystemMetricsPublisher doesn't specify any domain, and the 
 application related timeline data is put into the default domain. It is not 
 secured. We should let RM to choose a secured domain to put the system 
 metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2622) RM should put the application related timeline data into a secured domain

2014-09-29 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-2622:
-

 Summary: RM should put the application related timeline data into 
a secured domain
 Key: YARN-2622
 URL: https://issues.apache.org/jira/browse/YARN-2622
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen


After YARN-2446, SystemMetricsPublisher doesn't specify any domain, and the 
application related timeline data is put into the default domain. It is not 
secured. We should let RM to choose a secured domain to put the system metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2622) RM should put the application related timeline data into a secured domain

2014-09-29 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2622:
--
Affects Version/s: 2.6.0

 RM should put the application related timeline data into a secured domain
 -

 Key: YARN-2622
 URL: https://issues.apache.org/jira/browse/YARN-2622
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 After YARN-2446, SystemMetricsPublisher doesn't specify any domain, and the 
 application related timeline data is put into the default domain. It is not 
 secured. We should let RM to choose a secured domain to put the system 
 metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2622) RM should put the application related timeline data into a secured domain

2014-09-29 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2622:
--
Target Version/s: 2.6.0

 RM should put the application related timeline data into a secured domain
 -

 Key: YARN-2622
 URL: https://issues.apache.org/jira/browse/YARN-2622
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 After YARN-2446, SystemMetricsPublisher doesn't specify any domain, and the 
 application related timeline data is put into the default domain. It is not 
 secured. We should let RM to choose a secured domain to put the system 
 metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-29 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152348#comment-14152348
 ] 

Jonathan Eagles commented on YARN-2606:
---

Committed to trunk and branch-2

 Application History Server tries to access hdfs before doing secure login
 -

 Key: YARN-2606
 URL: https://issues.apache.org/jira/browse/YARN-2606
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Mit Desai
Assignee: Mit Desai
 Fix For: 2.6.0

 Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, 
 YARN-2606.patch


 While testing the Application Timeline Server, the server would not come up 
 in a secure cluster, as it would keep trying to access hdfs without having 
 done the secure login. It would repeatedly try authenticating and finally hit 
 stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152352#comment-14152352
 ] 

Hudson commented on YARN-2606:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #6146 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6146/])
YARN-2606. Application History Server tries to access hdfs before doing secure 
login (Mit Desai via jeagles) (jeagles: rev 
e10eeaabce2a21840cfd5899493c9d2d4fe2e322)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestFileSystemApplicationHistoryStore.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java


 Application History Server tries to access hdfs before doing secure login
 -

 Key: YARN-2606
 URL: https://issues.apache.org/jira/browse/YARN-2606
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Mit Desai
Assignee: Mit Desai
 Fix For: 2.6.0

 Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, 
 YARN-2606.patch


 While testing the Application Timeline Server, the server would not come up 
 in a secure cluster, as it would keep trying to access hdfs without having 
 done the secure login. It would repeatedly try authenticating and finally hit 
 stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-29 Thread Chris Trezzo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152355#comment-14152355
 ] 

Chris Trezzo commented on YARN-2179:


[~vinodkv] Mocking YarnClient seems to be tricky due to it being an 
AbstractService. Would extending YarnClientImpl and only overriding methods I 
need to stub be a more reasonable approach? For this approach I would need to 
make the serviceStart and serviceStop methods in YarnClientImpl publicly 
visible for testing. It is still a little tricky due to the serviceStart and 
serviceStop methods of YarnClientImpl using ClientRMProxy. That is originally 
why I decided to just create a different dummy YarnClient implementation. Any 
thoughts on these alternative approaches, or am I just missing an easy way to 
mock YarnClient (which is highly possible)?

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
 YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, 
 YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, 
 YARN-2179-trunk-v9.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2446) Using TimelineNamespace to shield the entities of a user

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152362#comment-14152362
 ] 

Hadoop QA commented on YARN-2446:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671870/YARN-2446.3.patch
  against trunk revision 7f0efe9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5173//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5173//console

This message is automatically generated.

 Using TimelineNamespace to shield the entities of a user
 

 Key: YARN-2446
 URL: https://issues.apache.org/jira/browse/YARN-2446
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2446.1.patch, YARN-2446.2.patch, YARN-2446.3.patch


 Given YARN-2102 adds TimelineNamespace, we can make use of it to shield the 
 entities, preventing them from being accessed or affected by other users' 
 operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.

2014-09-29 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152360#comment-14152360
 ] 

Karthik Kambatla commented on YARN-2566:


We should probably have the same mechanism of picking directories in both the 
default and linux container-executors. It appears LCE picks these at random. 
Can we do the same here? I understand picking directories at random might 
result in a skew due to not-so-random randomness or different applications 
localizing different sizes of data. 

May be, in the future, we could pick the directory with most available space? 

 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2566.000.patch, YARN-2566.001.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,186 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1410663092546_0004_01_01 transitioned from 
 LOCALIZING to 

[jira] [Updated] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization

2014-09-29 Thread Mit Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YARN-2387:

Attachment: YARN-2387.patch

Updated the patch

 Resource Manager crashes with NPE due to lack of synchronization
 

 Key: YARN-2387
 URL: https://issues.apache.org/jira/browse/YARN-2387
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.0
Reporter: Mit Desai
Assignee: Mit Desai
Priority: Blocker
 Attachments: YARN-2387.patch, YARN-2387.patch


 We recently came across a 0.23 RM crashing with an NPE. Here is the 
 stacktrace for it.
 {noformat}
 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
 handling event type NODE_UPDATE to the scheduler
 java.lang.NullPointerException
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34)
 at
 org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82)
 at
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339)
 at java.lang.Thread.run(Thread.java:722)
 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
 {noformat}
 On investigating a on the issue we found that the ContainerStatusPBImpl has 
 methods that are called by different threads and are not synchronized. Even 
 the 2.X code looks alike.
 We need to make these methods synchronized so that we do not encounter this 
 problem in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2623) Linux container executor only use the first local directory to copy token file in container-executor.c.

2014-09-29 Thread zhihai xu (JIRA)
zhihai xu created YARN-2623:
---

 Summary: Linux container executor only use the first local 
directory to copy token file in container-executor.c.
 Key: YARN-2623
 URL: https://issues.apache.org/jira/browse/YARN-2623
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
 Environment: Linux container executor only use the first local 
directory to copy token file in container-executor.c.
Reporter: zhihai xu
Assignee: zhihai xu


Linux container executor only use the first local directory to copy token file 
in container-executor.c. if It failed to copy token file to the first local 
directory, the  localization failure event will happen. Even though it can copy 
token file to the other local directory successfully. The correct way should be 
to copy token file  to the next local directory  if the first one failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.

2014-09-29 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152428#comment-14152428
 ] 

zhihai xu commented on YARN-2566:
-

For linux container-executors, it is done at C file container-executor.c:
It also pick the first directory to copy the token file:
see the following code in container-executor.c:
{code}
  char *primary_app_dir = NULL;
  for(nm_root=local_dirs; *nm_root != NULL; ++nm_root) {
char *app_dir = get_app_directory(*nm_root, user, app_id);
if (app_dir == NULL) {
  // try the next one
} else if (mkdirs(app_dir, permissions) != 0) {
  free(app_dir);
} else if (primary_app_dir == NULL) {
  primary_app_dir = app_dir;
} else {
  free(app_dir);
}
  }
  char *cred_file_name = concatenate(%s/%s, cred file, 2,
   primary_app_dir, 
basename(nmPrivate_credentials_file_copy));
  if (copy_file(cred_file, nmPrivate_credentials_file,
  cred_file_name, S_IRUSR|S_IWUSR) != 0){
free(nmPrivate_credentials_file_copy);
return -1;
  }
{code}

I created a new jira YARN-2623 for LCE.


 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2566.000.patch, YARN-2566.001.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
   at 

[jira] [Commented] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152446#comment-14152446
 ] 

Hadoop QA commented on YARN-2387:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671880/YARN-2387.patch
  against trunk revision c88c6c5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5174//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5174//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5174//console

This message is automatically generated.

 Resource Manager crashes with NPE due to lack of synchronization
 

 Key: YARN-2387
 URL: https://issues.apache.org/jira/browse/YARN-2387
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.0
Reporter: Mit Desai
Assignee: Mit Desai
Priority: Blocker
 Attachments: YARN-2387.patch, YARN-2387.patch


 We recently came across a 0.23 RM crashing with an NPE. Here is the 
 stacktrace for it.
 {noformat}
 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
 handling event type NODE_UPDATE to the scheduler
 java.lang.NullPointerException
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34)
 at
 org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82)
 at
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339)
 at java.lang.Thread.run(Thread.java:722)
 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
 {noformat}
 On investigating a on the issue we found that the ContainerStatusPBImpl has 
 methods that are called by different threads and are not synchronized. Even 
 the 2.X code looks alike.
 We need to make these methods synchronized so that we do not encounter this 
 problem in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-09-29 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152483#comment-14152483
 ] 

Anubhav Dhoot commented on YARN-1879:
-

Nit in ProtocolHATestBase

 method will be re-entry
 method will be re-entered

the entire logic test.
the entire logic of the test?

APIs that added trigger flag.
APIs that added Idempotent/AtOnce annotation?

Looks good otherwise

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.2-wip.patch, 
 YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager

2014-09-29 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152486#comment-14152486
 ] 

Zhijie Shen commented on YARN-2527:
---

The patch almost looks good to me, in particular the additional test cases for 
ApplicationACLsManager. Just one nit:

1. The logic here is a bit counter-intuitive. Can we just assign 
acls.get(applicationAccessType) to applicationACL only when it is not null?
{code}
  applicationACL = acls.get(applicationAccessType);
  if (applicationACL == null) {
if (LOG.isDebugEnabled()) {
  LOG.debug(ACL not found for access-type  + applicationAccessType
  +  for application  + applicationId +  owned by 
  + applicationOwner + . Using default [
  + YarnConfiguration.DEFAULT_YARN_APP_ACL + ]);
}
applicationACL = DEFAULT_YARN_APP_ACL;
{code}

 NPE in ApplicationACLsManager
 -

 Key: YARN-2527
 URL: https://issues.apache.org/jira/browse/YARN-2527
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: YARN-2527.patch, YARN-2527.patch


 NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error.
 The relevant stacktrace snippet from the ResourceManager logs is as below
 {code}
 Caused by: java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101)
 at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66)
 at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76)
 at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
 {code}
 This issue was reported by [~miguenther].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2301) Improve yarn container command

2014-09-29 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152503#comment-14152503
 ] 

Naganarasimha G R commented on YARN-2301:
-

Attaching patch with corrected test cases.

 Improve yarn container command
 --

 Key: YARN-2301
 URL: https://issues.apache.org/jira/browse/YARN-2301
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Naganarasimha G R
  Labels: usability
 Attachments: YARN-2301.01.patch


 While running yarn container -list Application Attempt ID command, some 
 observations:
 1) the scheme (e.g. http/https  ) before LOG-URL is missing
 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to 
 print as time format.
 3) finish-time is 0 if container is not yet finished. May be N/A
 4) May have an option to run as yarn container -list appId OR  yarn 
 application -list-containers appId also.  
 As attempt Id is not shown on console, this is easier for user to just copy 
 the appId and run it, may  also be useful for container-preserving AM 
 restart. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user

2014-09-29 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152504#comment-14152504
 ] 

Craig Welch commented on YARN-1063:
---

When looking this over to pickup context for 2198, I noticed a couple things:

libwinutils.c CreateLogonForUser - confusing name, makes me think a new
account is being created - CreateLogonTokenForUser?  LogonUser?

TestWinUtils - can we add testing specific to security?

 Winutils needs ability to create task as domain user
 

 Key: YARN-1063
 URL: https://issues.apache.org/jira/browse/YARN-1063
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
 Environment: Windows
Reporter: Kyle Leckie
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, 
 YARN-1063.5.patch, YARN-1063.6.patch, YARN-1063.patch


 h1. Summary:
 Securing a Hadoop cluster requires constructing some form of security 
 boundary around the processes executed in YARN containers. Isolation based on 
 Windows user isolation seems most feasible. This approach is similar to the 
 approach taken by the existing LinuxContainerExecutor. The current patch to 
 winutils.exe adds the ability to create a process as a domain user. 
 h1. Alternative Methods considered:
 h2. Process rights limited by security token restriction:
 On Windows access decisions are made by examining the security token of a 
 process. It is possible to spawn a process with a restricted security token. 
 Any of the rights granted by SIDs of the default token may be restricted. It 
 is possible to see this in action by examining the security tone of a 
 sandboxed process launch be a web browser. Typically the launched process 
 will have a fully restricted token and need to access machine resources 
 through a dedicated broker process that enforces a custom security policy. 
 This broker process mechanism would break compatibility with the typical 
 Hadoop container process. The Container process must be able to utilize 
 standard function calls for disk and network IO. I performed some work 
 looking at ways to ACL the local files to the specific launched without 
 granting rights to other processes launched on the same machine but found 
 this to be an overly complex solution. 
 h2. Relying on APP containers:
 Recent versions of windows have the ability to launch processes within an 
 isolated container. Application containers are supported for execution of 
 WinRT based executables. This method was ruled out due to the lack of 
 official support for standard windows APIs. At some point in the future 
 windows may support functionality similar to BSD jails or Linux containers, 
 at that point support for containers should be added.
 h1. Create As User Feature Description:
 h2. Usage:
 A new sub command was added to the set of task commands. Here is the syntax:
 winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE]
 Some notes:
 * The username specified is in the format of user@domain
 * The machine executing this command must be joined to the domain of the user 
 specified
 * The domain controller must allow the account executing the command access 
 to the user information. For this join the account to the predefined group 
 labeled Pre-Windows 2000 Compatible Access
 * The account running the command must have several rights on the local 
 machine. These can be managed manually using secpol.msc: 
 ** Act as part of the operating system - SE_TCB_NAME
 ** Replace a process-level token - SE_ASSIGNPRIMARYTOKEN_NAME
 ** Adjust memory quotas for a process - SE_INCREASE_QUOTA_NAME
 * The launched process will not have rights to the desktop so will not be 
 able to display any information or create UI.
 * The launched process will have no network credentials. Any access of 
 network resources that requires domain authentication will fail.
 h2. Implementation:
 Winutils performs the following steps:
 # Enable the required privileges for the current process.
 # Register as a trusted process with the Local Security Authority (LSA).
 # Create a new logon for the user passed on the command line.
 # Load/Create a profile on the local machine for the new logon.
 # Create a new environment for the new logon.
 # Launch the new process in a job with the task name specified and using the 
 created logon.
 # Wait for the JOB to exit.
 h2. Future work:
 The following work was scoped out of this check in:
 * Support for non-domain users or machine that are not domain joined.
 * Support for privilege isolation by running the task launcher in a high 
 privilege service with access over an ACLed named pipe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-29 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152523#comment-14152523
 ] 

Craig Welch commented on YARN-2198:
---

pom.xml - don’t see a /etc/hadoop or a wsce-site.xml, missed?

RawLocalFileSystem

Is someone from HDFS looking at this?

protected boolean mkOneDir(File p2f) throws IOException - nit, generalize arg 
name pls

return (parent == null || parent2f.exists() || mkdirs(parent)) 
+  (mkOneDir(p2f) || p2f.isDirectory());

so, I don't get this logic,  believe it will fail if the path exists and is 
not a directory.  Why not just do if p2f doesn't exist mkdirs(p2f)? seems much 
simpler, and drops the need for mkOneDir

NativeIO

Elevated class - I believe this is Windows specific, WindowsElevated or 
ElevatedWindows?  Why doesn't it extend Windows - I don't think secure and 
insecure windows should become wholly dissimilar

createTaskAsUser, killTask, ProcessStub:

These aren't really io, I think they should be factored out to their own 
process-specific class


 Remove the need to run NodeManager as privileged account for Windows Secure 
 Container Executor
 --

 Key: YARN-2198
 URL: https://issues.apache.org/jira/browse/YARN-2198
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
 YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
 YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
 YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
 YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
 YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch


 YARN-1972 introduces a Secure Windows Container Executor. However this 
 executor requires the process launching the container to be LocalSystem or a 
 member of the a local Administrators group. Since the process in question is 
 the NodeManager, the requirement translates to the entire NM to run as a 
 privileged account, a very large surface area to review and protect.
 This proposal is to move the privileged operations into a dedicated NT 
 service. The NM can run as a low privilege account and communicate with the 
 privileged NT service when it needs to launch a container. This would reduce 
 the surface exposed to the high privileges. 
 There has to exist a secure, authenticated and authorized channel of 
 communication between the NM and the privileged NT service. Possible 
 alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
 be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
 specific inter-process communication channel that satisfies all requirements 
 and is easy to deploy. The privileged NT service would register and listen on 
 an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
 with libwinutils which would host the LPC client code. The client would 
 connect to the LPC port (NtConnectPort) and send a message requesting a 
 container launch (NtRequestWaitReplyPort). LPC provides authentication and 
 the privileged NT service can use authorization API (AuthZ) to validate the 
 caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS

2014-09-29 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152532#comment-14152532
 ] 

Zhijie Shen commented on YARN-2583:
---

Some thoughts about the log deletion service of LRS:

1. I'm not sure if it's good to do normal log deletion in 
AggregatedLogDeletionService, while deleting rolling logs in 
AppLogAggregatorImpl. AggregatedLogDeletionService (inside JHS) will still try 
to delete the whole log dir while the LRS is still running.

2. Usually we do retention by time instead of by size, and it's inconsistent 
between AggregatedLogDeletionService and AppLogAggregatorImpl. While 
AggregatedLogDeletionService keeps all the logs newer than T1, 
AppLogAggregatorImpl may have already deleted logs newer than T1 to limit the 
number of logs of the LRS. It's going to be unpredictable after what time the 
logs should be still available for access.

3. Another problem w.r.t. NM_LOG_AGGREGATION_RETAIN_RETENTION_SIZE_PER_APP is 
that the config is favor of the longer rollingIntervalSeconds. For example, 
NM_LOG_AGGREGATION_RETAIN_RETENTION_SIZE_PER_APP = 10. If a LRS sets 
rollingIntervalSeconds = 1D, after 10D, it's still going to keep all the logs. 
However, If the LRS sets rollingIntervalSeconds = 0.5D, after 10D, it can only 
keep the last 5D's logs, even though the amount of generated logs is the same.

4. Assume we want to do deletion in AppLogAggregatorImpl, should we do deletion 
first and uploading next to avoid that the number of logs can go beyond the cap 
temporally?

 Modify the LogDeletionService to support Log aggregation for LRS
 

 Key: YARN-2583
 URL: https://issues.apache.org/jira/browse/YARN-2583
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2583.1.patch


 Currently, AggregatedLogDeletionService will delete old logs from HDFS. It 
 will check the cut-off-time, if all logs for this application is older than 
 this cut-off-time. The app-log-dir from HDFS will be deleted. This will not 
 work for LRS. We expect a LRS application can keep running for a long time. 
 Two different scenarios: 
 1) If we configured the rollingIntervalSeconds, the new log file will be 
 always uploaded to HDFS. The number of log files for this application will 
 become larger and larger. And there is no log files will be deleted.
 2) If we did not configure the rollingIntervalSeconds, the log file can only 
 be uploaded to HDFS after the application is finished. It is very possible 
 that the logs are uploaded after the cut-off-time. It will cause problem 
 because at that time the app-log-dir for this application in HDFS has been 
 deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2301) Improve yarn container command

2014-09-29 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-2301:

Attachment: YARN-2303.patch

Attaching patch for the unit test failures.


 Improve yarn container command
 --

 Key: YARN-2301
 URL: https://issues.apache.org/jira/browse/YARN-2301
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Naganarasimha G R
  Labels: usability
 Attachments: YARN-2301.01.patch, YARN-2303.patch


 While running yarn container -list Application Attempt ID command, some 
 observations:
 1) the scheme (e.g. http/https  ) before LOG-URL is missing
 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to 
 print as time format.
 3) finish-time is 0 if container is not yet finished. May be N/A
 4) May have an option to run as yarn container -list appId OR  yarn 
 application -list-containers appId also.  
 As attempt Id is not shown on console, this is easier for user to just copy 
 the appId and run it, may  also be useful for container-preserving AM 
 restart. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2320) Removing old application history store after we store the history data to timeline store

2014-09-29 Thread Mayank Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152559#comment-14152559
 ] 

Mayank Bansal commented on YARN-2320:
-

I think  overal looks ok however Have to run.

some small comments

shouldn't we use N/A in convertToApplicationAttemptReport instead of null ?
Similarly for convertToApplicationReport?
Similary for convertToContainerReport?

 Removing old application history store after we store the history data to 
 timeline store
 

 Key: YARN-2320
 URL: https://issues.apache.org/jira/browse/YARN-2320
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2320.1.patch, YARN-2320.2.patch


 After YARN-2033, we should deprecate application history store set. There's 
 no need to maintain two sets of store interfaces. In addition, we should 
 conclude the outstanding jira's under YARN-321 about the application history 
 store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2301) Improve yarn container command

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152579#comment-14152579
 ] 

Hadoop QA commented on YARN-2301:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671912/YARN-2303.patch
  against trunk revision c88c6c5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5175//console

This message is automatically generated.

 Improve yarn container command
 --

 Key: YARN-2301
 URL: https://issues.apache.org/jira/browse/YARN-2301
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Naganarasimha G R
  Labels: usability
 Attachments: YARN-2301.01.patch, YARN-2303.patch


 While running yarn container -list Application Attempt ID command, some 
 observations:
 1) the scheme (e.g. http/https  ) before LOG-URL is missing
 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to 
 print as time format.
 3) finish-time is 0 if container is not yet finished. May be N/A
 4) May have an option to run as yarn container -list appId OR  yarn 
 application -list-containers appId also.  
 As attempt Id is not shown on console, this is easier for user to just copy 
 the appId and run it, may  also be useful for container-preserving AM 
 restart. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2468) Log handling for LRS

2014-09-29 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2468:

Attachment: YARN-2468.9.patch

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
 YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
 YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, 
 YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2179) Initial cache manager structure and context

2014-09-29 Thread Chris Trezzo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated YARN-2179:
---
Attachment: YARN-2179-trunk-v10.patch

[~vinodkv] [~kasha]

Attached is v10.

Here is a new approach where I extend YarnClientImpl, stub out the service 
init/start/stop methods and mock the relevant methods to test. Does this seem 
like a cleaner approach to you guys?

I tried to do a straight mocking without extending the abstract class, but 
continually ran into the issue that AbstractService.stateModel is initialized 
in the constructor. This creates a problem when trying to stub 
AbstractService.getServiceState(), which is required for the AbstractService to 
work with a CompositeService.

Let me know if you don't like this approach or you know of an easier method and 
I can readjust the patch. Thanks!

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v10.patch, 
 YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, 
 YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, 
 YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2468) Log handling for LRS

2014-09-29 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152599#comment-14152599
 ] 

Xuan Gong commented on YARN-2468:
-

bq. Why is the test in TestAggregatedLogsBlock ignored?

We will have YARN-2583 for web UI related changes. This test will be failed 
right now. So, I add @ignored

bq. pendingUploadFiles is really not neded to be a class field. Rename 
getNumOfLogFilesToUpload() to be getPendingLogFilesToUploadForThisContainer() 
and return the set of pending files. LogValue.write() can then take SetFile 
pendingLogFilesToUpload as one of the arguments.

I would like to check how many log files we can upload this time. If the number 
is 0, we can skip this time. And this check is also happened before 
LogKey.write(), otherwise, we will write key, but without value.

bq. If deletion of previously uploaded file takes a while and the file remains 
by the time of the next cycle, we will upload it again? It seems to be, let's 
validate this via a test-case.

No, it will not. That is why I saved many information, such as 
allExistingFiles, alreadyUploadedFiles and etc. We will those to check whether 
the logs have been uploaded before.

bq. testLogAggregationServiceWithInterval: doLogAggregationOutOfBand + 
Thread.sleep() is unreliable. Use a clock and refactor AppLogAggregatorImpl to 
have the cyclic aggregation directly callable via a method.

The Thread.sleep() is not used to trigger the logAggregation. It is used to 
make sure the logs has been uploaded into the remote directory. But, deleted 
those Thread.sleep() from the testcases.

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
 YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
 YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, 
 YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2468) Log handling for LRS

2014-09-29 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152600#comment-14152600
 ] 

Xuan Gong commented on YARN-2468:
-

New patch addressed all other comments

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
 YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
 YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, 
 YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2624) Resource Localization fails on a secure cluster until nm are restarted

2014-09-29 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2624:
---

 Summary: Resource Localization fails on a secure cluster until nm 
are restarted
 Key: YARN-2624
 URL: https://issues.apache.org/jira/browse/YARN-2624
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


We have found resource localization fails on a secure cluster with following 
error in certain cases. This happens at some indeterminate point after which it 
will keep failing until NM is restarted.

{noformat}
INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Failed to download rsrc { { 
hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml,
 1412027745352, FILE, null 
},pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING}
java.io.IOException: Rename cannot overwrite non empty destination directory 
/data/yarn/nm/filecache/27
at 
org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716)
at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228)
at 
org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659)
at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152637#comment-14152637
 ] 

Hadoop QA commented on YARN-2179:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12671924/YARN-2179-trunk-v10.patch
  against trunk revision c88c6c5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5176//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5176//console

This message is automatically generated.

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v10.patch, 
 YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, 
 YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, 
 YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2621) Simplify the output when the user doesn't have the access for getDomain(s)

2014-09-29 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2621:
--
Attachment: YARN-2621.1.patch

Create a patch to fix the problem

 Simplify the output when the user doesn't have the access for getDomain(s) 
 ---

 Key: YARN-2621
 URL: https://issues.apache.org/jira/browse/YARN-2621
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2621.1.patch


 Per discussion in 
 [YARN-2446|https://issues.apache.org/jira/browse/YARN-2446?focusedCommentId=14151272page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14151272],
  we should simply reject the user if it doesn't have access the domain(s), 
 instead of returning the entity without detail information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2621) Simplify the output when the user doesn't have the access for getDomain(s)

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152695#comment-14152695
 ] 

Hadoop QA commented on YARN-2621:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671931/YARN-2621.1.patch
  against trunk revision 0577eb3.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5177//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5177//console

This message is automatically generated.

 Simplify the output when the user doesn't have the access for getDomain(s) 
 ---

 Key: YARN-2621
 URL: https://issues.apache.org/jira/browse/YARN-2621
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2621.1.patch


 Per discussion in 
 [YARN-2446|https://issues.apache.org/jira/browse/YARN-2446?focusedCommentId=14151272page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14151272],
  we should simply reject the user if it doesn't have access the domain(s), 
 instead of returning the entity without detail information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2468) Log handling for LRS

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152717#comment-14152717
 ] 

Hadoop QA commented on YARN-2468:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671923/YARN-2468.9.patch
  against trunk revision 0577eb3.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5178//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5178//console

This message is automatically generated.

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
 YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
 YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, 
 YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization

2014-09-29 Thread Mit Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YARN-2387:

Attachment: YARN-2387.patch

 Resource Manager crashes with NPE due to lack of synchronization
 

 Key: YARN-2387
 URL: https://issues.apache.org/jira/browse/YARN-2387
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.0
Reporter: Mit Desai
Assignee: Mit Desai
Priority: Blocker
 Attachments: YARN-2387.patch, YARN-2387.patch, YARN-2387.patch


 We recently came across a 0.23 RM crashing with an NPE. Here is the 
 stacktrace for it.
 {noformat}
 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
 handling event type NODE_UPDATE to the scheduler
 java.lang.NullPointerException
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34)
 at
 org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82)
 at
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339)
 at java.lang.Thread.run(Thread.java:722)
 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
 {noformat}
 On investigating a on the issue we found that the ContainerStatusPBImpl has 
 methods that are called by different threads and are not synchronized. Even 
 the 2.X code looks alike.
 We need to make these methods synchronized so that we do not encounter this 
 problem in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2468) Log handling for LRS

2014-09-29 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2468:

Attachment: YARN-2468.9.1.patch

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
 YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
 YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, 
 YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, 
 YARN-2468.9.1.patch, YARN-2468.9.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152752#comment-14152752
 ] 

Hadoop QA commented on YARN-2387:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671946/YARN-2387.patch
  against trunk revision 0577eb3.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5179//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5179//console

This message is automatically generated.

 Resource Manager crashes with NPE due to lack of synchronization
 

 Key: YARN-2387
 URL: https://issues.apache.org/jira/browse/YARN-2387
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.0
Reporter: Mit Desai
Assignee: Mit Desai
Priority: Blocker
 Attachments: YARN-2387.patch, YARN-2387.patch, YARN-2387.patch


 We recently came across a 0.23 RM crashing with an NPE. Here is the 
 stacktrace for it.
 {noformat}
 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
 handling event type NODE_UPDATE to the scheduler
 java.lang.NullPointerException
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34)
 at
 org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82)
 at
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339)
 at java.lang.Thread.run(Thread.java:722)
 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
 {noformat}
 On investigating a on the issue we found that the ContainerStatusPBImpl has 
 methods that are called by different threads and are not synchronized. Even 
 the 2.X code looks alike.
 We need to make these methods synchronized so that we do not encounter this 
 problem in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2545) RMApp should transit to FAILED when AM calls finishApplicationMaster with FAILED

2014-09-29 Thread Hong Zhiguo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152804#comment-14152804
 ] 

Hong Zhiguo commented on YARN-2545:
---

[~leftnoteasy], [~jianhe], [~ozawa], please have a look, should we set state of 
app/appAttempt to FAILED instead of FINISHED, or just count it as Apps Failed 
instead of Apps Completed?

 RMApp should transit to FAILED when AM calls finishApplicationMaster with 
 FAILED
 

 Key: YARN-2545
 URL: https://issues.apache.org/jira/browse/YARN-2545
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor

 If AM calls finishApplicationMaster with getFinalApplicationStatus()==FAILED, 
 and then exits, the corresponding RMApp and RMAppAttempt transit to state 
 FINISHED.
 I think this is wrong and confusing. On RM WebUI, this application is 
 displayed as State=FINISHED, FinalStatus=FAILED, and is counted as Apps 
 Completed, not as Apps Failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)