[jira] [Commented] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)

2014-10-22 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179673#comment-14179673
 ] 

Wangda Tan commented on YARN-2495:
--

[~Naganarasimha], [~aw], let me first give you an overview about what we need 
to do to support labels in capacity scheduler, that will help you better 
understanding why we need central node label validation now. 
In existing capacity scheduler (patch of YARN-2496), we can support specify 
what labels of each queue can access (to make sure important resource can only 
be used by privileged users), and proportion of resource on label ("marketing" 
queue can access 80% of GPU resource). Now if user want to leverage change of 
capacity scheduler, user *MUST* specify 1) labels can be accessed by the queue 
and 2) proportion of resource can be accessed by a queue of each label.
Back to the central node label validation discussion, without this, we cannot 
get capacity scheduler work for now. (user cannot specify capacity for a 
unknown node-label for a queue, etc.). So I still insist to have central node 
label valication for both centralized/distribtued node label configuration at 
least for 2.6 release. This might be changed in the future, I suggest to move 
disable central node label configuration to a separated task for further 
discussions.

And I've looked at patch uploaded by [~Naganarasimha], thanks for this WIP 
patch, took a quick glance at the patch, several suggestions on this patch:
- According to above comments, do not change {{CommonNodeLabelsManager}}, move 
the changes to disable central node label validation to a separated patch for 
further discussion. 
- Make this patch contains a {{NodeLabelProvider}} only and create separate 
JIRA for {{ScriptNodeLabelProvider}} and an implementation to read node label 
from yarn-site.xml for easier review.

> Allow admin specify labels in each NM (Distributed configuration)
> -
>
> Key: YARN-2495
> URL: https://issues.apache.org/jira/browse/YARN-2495
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Naganarasimha G R
> Attachments: YARN-2495_20141022.1.patch
>
>
> Target of this JIRA is to allow admin specify labels in each NM, this covers
> - User can set labels in each NM (by setting yarn-site.xml or using script 
> suggested by [~aw])
> - NM will send labels to RM via ResourceTracker API
> - RM will set labels in NodeLabelManager when NM register/update labels



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2398) TestResourceTrackerOnHA crashes

2014-10-22 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2398:
-
Attachment: (was: TestResourceTrackerOnHA-output.txt)

> TestResourceTrackerOnHA crashes
> ---
>
> Key: YARN-2398
> URL: https://issues.apache.org/jira/browse/YARN-2398
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jason Lowe
>
> TestResourceTrackerOnHA is currently crashing and failing trunk builds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2398) TestResourceTrackerOnHA crashes

2014-10-22 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179797#comment-14179797
 ] 

Tsuyoshi OZAWA commented on YARN-2398:
--

Rohith, Wangda, yeah, thanks for your pointing. the log I attached looks not 
related to the issue Jason mentioned. Removing it. 

> TestResourceTrackerOnHA crashes
> ---
>
> Key: YARN-2398
> URL: https://issues.apache.org/jira/browse/YARN-2398
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jason Lowe
>
> TestResourceTrackerOnHA is currently crashing and failing trunk builds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS

2014-10-22 Thread cntic (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179802#comment-14179802
 ] 

cntic commented on YARN-2681:
-

Find the concept at http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf 

> Support bandwidth enforcement for containers while reading from HDFS
> 
>
> Key: YARN-2681
> URL: https://issues.apache.org/jira/browse/YARN-2681
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.5.1
> Environment: Linux
>Reporter: cntic
> Attachments: HADOOP-2681.patch, Traffic Control Design.png
>
>
> To read/write data from HDFS on data node, applications establise TCP/IP 
> connections with the datanode. The HDFS read can be controled by setting 
> Linux Traffic Control  (TC) subsystem on the data node to make filters on 
> appropriate connections.
> The current cgroups net_cls concept can not be applied on the node where the 
> container is launched, netheir on data node since:
> -   TC hanldes outgoing bandwidth only, so it can be set on container node 
> (HDFS read = incoming data for the container)
> -   Since HDFS data node is handled by only one process,  it is not possible 
> to use net_cls to separate connections from different containers to the 
> datanode.
> Tasks:
> 1) Extend Resource model to define bandwidth enforcement rate
> 2) Monitor TCP/IP connection estabilised by container handling process and 
> its child processes
> 3) Set Linux Traffic Control rules on data node base on address:port pairs in 
> order to enforce bandwidth of outgoing data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2721) Race condition: ZKRMStateStore retry logic may throw NodeExist exception

2014-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179831#comment-14179831
 ] 

Hudson commented on YARN-2721:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #720 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/720/])
YARN-2721. Suppress NodeExist exception thrown by ZKRMStateStore when it 
retries creating znode. Contributed by Jian He. (zjshen: rev 
7e3b5e6f5cb4945b4fab27e8a83d04280df50e17)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java


> Race condition: ZKRMStateStore retry logic may throw NodeExist exception 
> -
>
> Key: YARN-2721
> URL: https://issues.apache.org/jira/browse/YARN-2721
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Fix For: 2.6.0
>
> Attachments: YARN-2721.1.patch
>
>
> Blindly retrying operations in zookeeper will not work for non-idempotent 
> operations (like create znode). The reason is that the client can do a create 
> znode, but the response may not be returned because the server can die or 
> timeout. In case of retrying the create znode, it will throw a NODE_EXISTS 
> exception from the earlier create from the same session.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2720) Windows: Wildcard classpath variables not expanded against resources contained in archives

2014-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179829#comment-14179829
 ] 

Hudson commented on YARN-2720:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #720 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/720/])
YARN-2720. Windows: Wildcard classpath variables not expanded against resources 
contained in archives. Contributed by Craig Welch. (cnauroth: rev 
6637e3cf95b3a9be8d6b9cd66bc849a0607e8ed5)
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
* 
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestFileUtil.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Classpath.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/WindowsSecureContainerExecutor.java


> Windows: Wildcard classpath variables not expanded against resources 
> contained in archives
> --
>
> Key: YARN-2720
> URL: https://issues.apache.org/jira/browse/YARN-2720
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Craig Welch
>Assignee: Craig Welch
> Fix For: 2.6.0
>
> Attachments: YARN-2720.2.patch, YARN-2720.3.patch, YARN-2720.4.patch
>
>
> On windows there are limitations to the length of command lines and 
> environment variables which prevent placing all classpath resources into 
> these elements.  Instead, a jar containing only a classpath manifest is 
> created to provide the classpath.  During this process wildcard references 
> are expanded by inspecting the filesystem.  Since archives are extracted to a 
> different location and linked into the final location after the classpath jar 
> is created, resources referred to via wildcards which exist in localized 
> archives  (.zip, tar.gz) are not added to the classpath manifest jar.  Since 
> these entries are removed from the final classpath for the container they are 
> not on the container's classpath as they should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2709) Add retry for timeline client getDelegationToken method

2014-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179835#comment-14179835
 ] 

Hudson commented on YARN-2709:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #720 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/720/])
YARN-2709. Made timeline client getDelegationToken API retry if 
ConnectException happens. Contributed by Li Lu. (zjshen: rev 
b2942762d7f76d510ece5621c71116346a6b12f6)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* hadoop-yarn-project/CHANGES.txt


> Add retry for timeline client getDelegationToken method
> ---
>
> Key: YARN-2709
> URL: https://issues.apache.org/jira/browse/YARN-2709
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Fix For: 2.6.0
>
> Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch, 
> YARN-2709-102114-2.patch, YARN-2709-102114.patch
>
>
> As mentioned in YARN-2673, we need to add retry mechanism to timeline client 
> for secured clusters. This means if the timeline server is not available, a 
> timeline client needs to retry to get a delegation token. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.

2014-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179832#comment-14179832
 ] 

Hudson commented on YARN-2715:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #720 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/720/])
YARN-2715. Fixed ResourceManager to respect common configurations for proxy 
users/groups beyond just the YARN level config. Contributed by Zhijie Shen. 
(vinodkv: rev c0e034336c85296be6f549d88d137fb2b2b79a15)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokenAuthentication.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMAdminService.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMProxyUsersConf.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilterInitializer.java


> Proxy user is problem for RPC interface if 
> yarn.resourcemanager.webapp.proxyuser is not set.
> 
>
> Key: YARN-2715
> URL: https://issues.apache.org/jira/browse/YARN-2715
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: YARN-2715.1.patch, YARN-2715.2.patch, YARN-2715.3.patch, 
> YARN-2715.4.patch
>
>
> After YARN-2656, if people set hadoop.proxyuser for the client<-->RM RPC 
> interface, it's not going to work, because ProxyUsers#sip is a singleton per 
> daemon. After YARN-2656, RM has both channels that want to set this 
> configuration: RPC and HTTP. RPC interface sets it first by reading 
> hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to 
> empty because yarn.resourcemanager.webapp.proxyuser doesn't exist.
> The fix for it could be similar to what we've done for YARN-2676: make the 
> HTTP interface anyway source hadoop.proxyuser first, then 
> yarn.resourcemanager.webapp.proxyuser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again

2014-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179830#comment-14179830
 ] 

Hudson commented on YARN-90:


SUCCESS: Integrated in Hadoop-Yarn-trunk #720 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/720/])
YARN-90. NodeManager should identify failed disks becoming good again. 
Contributed by Varun Vasudev (jlowe: rev 
6f2028bd1514d90b831f889fd0ee7f2ba5c15000)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDirectoryCollection.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLocalDirsHandlerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/TestNonAggregatingLogHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/NonAggregatingLogHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeHealthService.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java


> NodeManager should identify failed disks becoming good again
> 
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Fix For: 2.6.0
>
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
> apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
> apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2725) Adding retry requests about ZKRMStateStore

2014-10-22 Thread Tsuyoshi OZAWA (JIRA)
Tsuyoshi OZAWA created YARN-2725:


 Summary: Adding retry requests about ZKRMStateStore
 Key: YARN-2725
 URL: https://issues.apache.org/jira/browse/YARN-2725
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Tsuyoshi OZAWA


YARN-2721 found a race condition for ZK-specific retry semantics. We should add 
tests about the case of retry requests to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2721) Race condition: ZKRMStateStore retry logic may throw NodeExist exception

2014-10-22 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179842#comment-14179842
 ] 

Tsuyoshi OZAWA commented on YARN-2721:
--

Good job, Jian. Created YARN-2725 for adding tests to cover these cases.

> Race condition: ZKRMStateStore retry logic may throw NodeExist exception 
> -
>
> Key: YARN-2721
> URL: https://issues.apache.org/jira/browse/YARN-2721
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Fix For: 2.6.0
>
> Attachments: YARN-2721.1.patch
>
>
> Blindly retrying operations in zookeeper will not work for non-idempotent 
> operations (like create znode). The reason is that the client can do a create 
> znode, but the response may not be returned because the server can die or 
> timeout. In case of retrying the create znode, it will throw a NODE_EXISTS 
> exception from the earlier create from the same session.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2725) Adding test cases of retrying requests about ZKRMStateStore

2014-10-22 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2725:
-
Summary: Adding test cases of retrying requests about ZKRMStateStore  (was: 
Adding retry requests about ZKRMStateStore)

> Adding test cases of retrying requests about ZKRMStateStore
> ---
>
> Key: YARN-2725
> URL: https://issues.apache.org/jira/browse/YARN-2725
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Tsuyoshi OZAWA
>
> YARN-2721 found a race condition for ZK-specific retry semantics. We should 
> add tests about the case of retry requests to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2726) CapacityScheduler should explicitly log when an accessible label has no capacity

2014-10-22 Thread Phil D'Amore (JIRA)
Phil D'Amore created YARN-2726:
--

 Summary: CapacityScheduler should explicitly log when an 
accessible label has no capacity
 Key: YARN-2726
 URL: https://issues.apache.org/jira/browse/YARN-2726
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Reporter: Phil D'Amore
Priority: Minor


Given:

- Node label defined: test-label
- Two queues defined: a, b
- label accessibility and and capacity defined as follows (properties 
abbreviated for readability):

root.a.accessible-node-labels = test-label
root.a.accessible-node-labels.test-label.capacity = 100

If you restart the RM or do a 'rmadmin -refreshQueues' you will get a stack 
trace with the following error buried within:

"Illegal capacity of -1.0 for label=test-label in queue=root.b"

This of course occurs because test-label is accessible to b due to inheritance 
from the root, and -1 is the UNDEFINED value.  To my mind this might not be 
obvious to the admin, and the error message which results does not help guide 
someone to the source of the issue.

I propose that this situation be updated so that when the capacity on an 
accessible label is undefined, it is explicitly called out instead of falling 
through to the illegal capacity check.  Something like:

{code}
if (capacity == UNDEFINED) {
throw new IllegalArgumentException("Configuration issue: " + " label=" + 
label + " is accessible from queue=" + queue + " but has no capacity set.");
}
{code}

I'll leave it to better judgement than mine as to whether I'm throwing the 
appropriate exception there.  I think this check should be added to both 
getNodeLabelCapacities and getMaximumNodeLabelCapacities in 
CapacitySchedulerConfiguration.java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2692) ktutil test hanging on some machines/ktutil versions

2014-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179903#comment-14179903
 ] 

Hudson commented on YARN-2692:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6310 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6310/])
YARN-2692 ktutil test hanging on some machines/ktutil versions (stevel) 
(stevel: rev 85a88649c3f3fb7280aa511b2035104bcef28a6f)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/RegistryTestHelper.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/secure/TestSecureLogins.java


> ktutil test hanging on some machines/ktutil versions
> 
>
> Key: YARN-2692
> URL: https://issues.apache.org/jira/browse/YARN-2692
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.6.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Fix For: 2.6.0
>
> Attachments: YARN-2692-001.patch
>
>
> a couple of the registry security tests run native {{ktutil}}; this is 
> primarily to debug the keytab generation. [~cnauroth] reports that some 
> versions of {{kinit}} hang. Fix: rm the tests. [YARN-2689]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2721) Race condition: ZKRMStateStore retry logic may throw NodeExist exception

2014-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179919#comment-14179919
 ] 

Hudson commented on YARN-2721:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1909 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1909/])
YARN-2721. Suppress NodeExist exception thrown by ZKRMStateStore when it 
retries creating znode. Contributed by Jian He. (zjshen: rev 
7e3b5e6f5cb4945b4fab27e8a83d04280df50e17)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
* hadoop-yarn-project/CHANGES.txt


> Race condition: ZKRMStateStore retry logic may throw NodeExist exception 
> -
>
> Key: YARN-2721
> URL: https://issues.apache.org/jira/browse/YARN-2721
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Fix For: 2.6.0
>
> Attachments: YARN-2721.1.patch
>
>
> Blindly retrying operations in zookeeper will not work for non-idempotent 
> operations (like create znode). The reason is that the client can do a create 
> znode, but the response may not be returned because the server can die or 
> timeout. In case of retrying the create znode, it will throw a NODE_EXISTS 
> exception from the earlier create from the same session.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2720) Windows: Wildcard classpath variables not expanded against resources contained in archives

2014-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179917#comment-14179917
 ] 

Hudson commented on YARN-2720:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1909 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1909/])
YARN-2720. Windows: Wildcard classpath variables not expanded against resources 
contained in archives. Contributed by Craig Welch. (cnauroth: rev 
6637e3cf95b3a9be8d6b9cd66bc849a0607e8ed5)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestFileUtil.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Classpath.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/WindowsSecureContainerExecutor.java


> Windows: Wildcard classpath variables not expanded against resources 
> contained in archives
> --
>
> Key: YARN-2720
> URL: https://issues.apache.org/jira/browse/YARN-2720
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Craig Welch
>Assignee: Craig Welch
> Fix For: 2.6.0
>
> Attachments: YARN-2720.2.patch, YARN-2720.3.patch, YARN-2720.4.patch
>
>
> On windows there are limitations to the length of command lines and 
> environment variables which prevent placing all classpath resources into 
> these elements.  Instead, a jar containing only a classpath manifest is 
> created to provide the classpath.  During this process wildcard references 
> are expanded by inspecting the filesystem.  Since archives are extracted to a 
> different location and linked into the final location after the classpath jar 
> is created, resources referred to via wildcards which exist in localized 
> archives  (.zip, tar.gz) are not added to the classpath manifest jar.  Since 
> these entries are removed from the final classpath for the container they are 
> not on the container's classpath as they should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.

2014-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179921#comment-14179921
 ] 

Hudson commented on YARN-2715:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1909 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1909/])
YARN-2715. Fixed ResourceManager to respect common configurations for proxy 
users/groups beyond just the YARN level config. Contributed by Zhijie Shen. 
(vinodkv: rev c0e034336c85296be6f549d88d137fb2b2b79a15)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilterInitializer.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMProxyUsersConf.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokenAuthentication.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMAdminService.java


> Proxy user is problem for RPC interface if 
> yarn.resourcemanager.webapp.proxyuser is not set.
> 
>
> Key: YARN-2715
> URL: https://issues.apache.org/jira/browse/YARN-2715
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: YARN-2715.1.patch, YARN-2715.2.patch, YARN-2715.3.patch, 
> YARN-2715.4.patch
>
>
> After YARN-2656, if people set hadoop.proxyuser for the client<-->RM RPC 
> interface, it's not going to work, because ProxyUsers#sip is a singleton per 
> daemon. After YARN-2656, RM has both channels that want to set this 
> configuration: RPC and HTTP. RPC interface sets it first by reading 
> hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to 
> empty because yarn.resourcemanager.webapp.proxyuser doesn't exist.
> The fix for it could be similar to what we've done for YARN-2676: make the 
> HTTP interface anyway source hadoop.proxyuser first, then 
> yarn.resourcemanager.webapp.proxyuser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again

2014-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179918#comment-14179918
 ] 

Hudson commented on YARN-90:


FAILURE: Integrated in Hadoop-Hdfs-trunk #1909 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1909/])
YARN-90. NodeManager should identify failed disks becoming good again. 
Contributed by Varun Vasudev (jlowe: rev 
6f2028bd1514d90b831f889fd0ee7f2ba5c15000)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/TestNonAggregatingLogHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/NonAggregatingLogHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeHealthService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDirectoryCollection.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLocalDirsHandlerService.java


> NodeManager should identify failed disks becoming good again
> 
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Fix For: 2.6.0
>
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
> apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
> apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2709) Add retry for timeline client getDelegationToken method

2014-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179924#comment-14179924
 ] 

Hudson commented on YARN-2709:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1909 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1909/])
YARN-2709. Made timeline client getDelegationToken API retry if 
ConnectException happens. Contributed by Li Lu. (zjshen: rev 
b2942762d7f76d510ece5621c71116346a6b12f6)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java


> Add retry for timeline client getDelegationToken method
> ---
>
> Key: YARN-2709
> URL: https://issues.apache.org/jira/browse/YARN-2709
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Fix For: 2.6.0
>
> Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch, 
> YARN-2709-102114-2.patch, YARN-2709-102114.patch
>
>
> As mentioned in YARN-2673, we need to add retry mechanism to timeline client 
> for secured clusters. This means if the timeline server is not available, a 
> timeline client needs to retry to get a delegation token. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2683) document registry config options

2014-10-22 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-2683:
-
Attachment: YARN-2683-002.patch

correct patch as applied to branch-2

> document registry config options
> 
>
> Key: YARN-2683
> URL: https://issues.apache.org/jira/browse/YARN-2683
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: YARN-2683-001.patch, YARN-2683-002.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Add to {{yarn-site}} a page on registry configuration parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed

2014-10-22 Thread Mit Desai (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179929#comment-14179929
 ] 

Mit Desai commented on YARN-2724:
-

The problem here is due to calculation of file length before even trying to 
open the file. Log aggregator reads the file length of the log file that is to 
be aggregated and records it. Then it tries to go and read the file contents. 
If the log aggregator does not have the permissions to access the file, it will 
get "Permission Denied". Just like what is seen here.

What application were you guys trying to run while you encountered this error?

My guess is if there is a specific application where this happens, the NM user 
should have the access to the log file that is created by that application. As 
the log aggregation is done by NM user, giving it the permissions to access the 
generated log file should fix this issue.

> If an unreadable file is encountered during log aggregation then aggregated 
> file in HDFS badly formed
> -
>
> Key: YARN-2724
> URL: https://issues.apache.org/jira/browse/YARN-2724
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.5.1
>Reporter: Sumit Mohanty
>Assignee: Xuan Gong
>
> Look into the log output snippet. It looks like there is an issue during 
> aggregation when an unreadable file is encountered. Likely, this results in 
> bad encoding.
> {noformat}
> LogType: command-13.json
> LogLength: 13934
> Log Contents:
> Error aggregating log file. Log file : 
> /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json
>  (Permission denied)command-3.json13983Error aggregating log file. Log file : 
> /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json
>  (Permission denied)
>   
> errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: 
> [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K->15575K(184320K), 
> 0.0488700 secs] 163840K->15575K(1028096K), 0.0492510 secs] [Times: user=0.06 
> sys=0.01, real=0.05 secs]
> 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: 
> [ParNew: 179415K->11865K(184320K), 0.0941310 secs] 179415K->17228K(1028096K), 
> 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs]
> 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: 
> 95.187: [ParNew: 175705K->12802K(184320K), 0.0466420 secs] 
> 181068K->18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, 
> real=0.04 secs]
> {noformat}
> Specifically, look at the text after the exception text. There should be two 
> more entries for log files but none exist. This is likely due to the fact 
> that command-13.json is expected to be of length 13934 but its is not as the 
> file was never read.
> I think, it should have been
> {noformat}
> LogType: command-13.json
> LogLength: 
> Log Contents:
> Error aggregating log file. Log file : 
> /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json
>  (Permission denied)command-3.json13983Error aggregating log file. Log file : 
> /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json
>  (Permission denied)
> {noformat}
> {noformat}
> LogType: errors-3.txt
> LogLength:0
> Log Contents:
> {noformat}
> {noformat}
> LogType:gc.log
> LogLength:???
> Log Contents:
> ..-20141021044514484052014-10-21T04:45:12.046+: 5.134: 
> [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K- ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-2726) CapacityScheduler should explicitly log when an accessible label has no capacity

2014-10-22 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R reassigned YARN-2726:
---

Assignee: Naganarasimha G R

> CapacityScheduler should explicitly log when an accessible label has no 
> capacity
> 
>
> Key: YARN-2726
> URL: https://issues.apache.org/jira/browse/YARN-2726
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Phil D'Amore
>Assignee: Naganarasimha G R
>Priority: Minor
>
> Given:
> - Node label defined: test-label
> - Two queues defined: a, b
> - label accessibility and and capacity defined as follows (properties 
> abbreviated for readability):
> root.a.accessible-node-labels = test-label
> root.a.accessible-node-labels.test-label.capacity = 100
> If you restart the RM or do a 'rmadmin -refreshQueues' you will get a stack 
> trace with the following error buried within:
> "Illegal capacity of -1.0 for label=test-label in queue=root.b"
> This of course occurs because test-label is accessible to b due to 
> inheritance from the root, and -1 is the UNDEFINED value.  To my mind this 
> might not be obvious to the admin, and the error message which results does 
> not help guide someone to the source of the issue.
> I propose that this situation be updated so that when the capacity on an 
> accessible label is undefined, it is explicitly called out instead of falling 
> through to the illegal capacity check.  Something like:
> {code}
> if (capacity == UNDEFINED) {
> throw new IllegalArgumentException("Configuration issue: " + " label=" + 
> label + " is accessible from queue=" + queue + " but has no capacity set.");
> }
> {code}
> I'll leave it to better judgement than mine as to whether I'm throwing the 
> appropriate exception there.  I think this check should be added to both 
> getNodeLabelCapacities and getMaximumNodeLabelCapacities in 
> CapacitySchedulerConfiguration.java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2700) TestSecureRMRegistryOperations failing on windows: auth problems

2014-10-22 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179953#comment-14179953
 ] 

Steve Loughran commented on YARN-2700:
--

logs
{code}
2014-10-21 03:25:26,022 [NIOServerCxn.Factory:localhost/127.0.0.1:0] INFO  
server.NIOServerCnxnFactory (NIOServerCnxnFactory.java:run(197)) - Accepted 
socket connection from /127.0.0.1:49869
2014-10-21 03:25:26,024 [JUnit-SendThread(127.0.0.1:49864)] DEBUG 
zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(892)) - Session 
establishment request sent on 127.0.0.1/127.0.0.1:49864
Found KeyTab
Found KerberosKey for zookeeper/localh...@example.com
Found KerberosKey for zookeeper/localh...@example.com
Found KerberosKey for zookeeper/localh...@example.com
Found KerberosKey for zookeeper/localh...@example.com
Found KerberosKey for zookeeper/localh...@example.com
2014-10-21 03:25:26,035 [NIOServerCxn.Factory:localhost/127.0.0.1:0] INFO  
server.ZooKeeperServer (ZooKeeperServer.java:processConnectRequest(868)) - 
Client attempting to establish new session at /127.0.0.1:49869
2014-10-21 03:25:26,039 [SyncThread:0] INFO  persistence.FileTxnLog 
(FileTxnLog.java:append(199)) - Creating new log file: log.1
2014-10-21 03:25:26,057 [SyncThread:0] INFO  server.ZooKeeperServer 
(ZooKeeperServer.java:finishSessionInit(617)) - Established session 
0x149323d6882 with negotiated timeout 6 for client /127.0.0.1:49869
2014-10-21 03:25:26,059 [JUnit-SendThread(127.0.0.1:49864)] INFO  
zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1235)) - Session 
establishment complete on server 127.0.0.1/127.0.0.1:49864, sessionid = 
0x149323d6882, negotiated timeout = 6
Found ticket for zookee...@example.com to go to krbtgt/example@example.com 
expiring on Wed Oct 22 03:25:25 PDT 2014
Entered Krb5Context.initSecContext with state=STATE_NEW
Found ticket for zookee...@example.com to go to krbtgt/example@example.com 
expiring on Wed Oct 22 03:25:25 PDT 2014
Service ticket not found in the subject
KrbException: Server not found in Kerberos database (7) - Server not found in 
Kerberos database
at sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:73)
at sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:192)
at sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:203)
at 
sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:309)
at 
sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:115)
at 
sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:454)
at 
sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:641)
at 
sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:248)
at 
sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193)
at 
org.apache.zookeeper.client.ZooKeeperSaslClient$2.run(ZooKeeperSaslClient.java:366)
at 
org.apache.zookeeper.client.ZooKeeperSaslClient$2.run(ZooKeeperSaslClient.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslToken(ZooKeeperSaslClient.java:362)
at 
org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslToken(ZooKeeperSaslClient.java:348)
at 
org.apache.zookeeper.client.ZooKeeperSaslClient.sendSaslPacket(ZooKeeperSaslClient.java:420)
at 
org.apache.zookeeper.client.ZooKeeperSaslClient.initialize(ZooKeeperSaslClient.java:458)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1013)
Caused by: KrbException: Identifier doesn't match expected value (906)
at sun.security.krb5.internal.KDCRep.init(KDCRep.java:143)
at sun.security.krb5.internal.TGSRep.init(TGSRep.java:66)
at sun.security.krb5.internal.TGSRep.(TGSRep.java:61)
at sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:55)
... 18 more
2014-10-21 03:25:26,145 [JUnit-SendThread(127.0.0.1:49864)] ERROR 
client.ZooKeeperSaslClient (ZooKeeperSaslClient.java:createSaslToken(384)) - An 
error: (java.security.PrivilegedActionException: 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Server not found in Kerberos 
database (7) - Server not found in Kerberos database)]) occurred when 
evaluating Zookeeper Quorum Member's  received SASL token. Zookeeper Client 
will go to AUTH_FAILED state.
2014-10-21 03:25:26,146 [JUnit-SendThread(127.0.0.1:49864)] ERROR 
zookeeper.ClientCnxn (ClientCnxn.java:run(1015)) - SASL authentication with 
Zookeeper Quorum member failed: javax.security.sasl.SaslException: An error: 
(java.security.PrivilegedActionException: javax.security.sa

[jira] [Commented] (YARN-2683) document registry config options

2014-10-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179958#comment-14179958
 ] 

Hadoop QA commented on YARN-2683:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12676323/YARN-2683-002.patch
  against trunk revision 85a8864.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5492//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5492//console

This message is automatically generated.

> document registry config options
> 
>
> Key: YARN-2683
> URL: https://issues.apache.org/jira/browse/YARN-2683
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: YARN-2683-001.patch, YARN-2683-002.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Add to {{yarn-site}} a page on registry configuration parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS

2014-10-22 Thread cntic (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cntic updated YARN-2681:

Description: 
To read/write data from HDFS on data node, applications establise TCP/IP 
connections with the datanode. The HDFS read can be controled by setting Linux 
Traffic Control  (TC) subsystem on the data node to make filters on appropriate 
connections.

The current cgroups net_cls concept can not be applied on the node where the 
container is launched, netheir on data node since:
-   TC hanldes outgoing bandwidth only, so it can be set on container node 
(HDFS read = incoming data for the container)
-   Since HDFS data node is handled by only one process,  it is not possible to 
use net_cls to separate connections from different containers to the datanode.

Tasks:
1) Extend Resource model to define bandwidth enforcement rate
2) Monitor TCP/IP connection estabilised by container handling process and its 
child processes
3) Set Linux Traffic Control rules on data node base on address:port pairs in 
order to enforce bandwidth of outgoing data

Concept:
http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf

  was:
To read/write data from HDFS on data node, applications establise TCP/IP 
connections with the datanode. The HDFS read can be controled by setting Linux 
Traffic Control  (TC) subsystem on the data node to make filters on appropriate 
connections.

The current cgroups net_cls concept can not be applied on the node where the 
container is launched, netheir on data node since:
-   TC hanldes outgoing bandwidth only, so it can be set on container node 
(HDFS read = incoming data for the container)
-   Since HDFS data node is handled by only one process,  it is not possible to 
use net_cls to separate connections from different containers to the datanode.

Tasks:
1) Extend Resource model to define bandwidth enforcement rate
2) Monitor TCP/IP connection estabilised by container handling process and its 
child processes
3) Set Linux Traffic Control rules on data node base on address:port pairs in 
order to enforce bandwidth of outgoing data


> Support bandwidth enforcement for containers while reading from HDFS
> 
>
> Key: YARN-2681
> URL: https://issues.apache.org/jira/browse/YARN-2681
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.5.1
> Environment: Linux
>Reporter: cntic
> Attachments: HADOOP-2681.patch, Traffic Control Design.png
>
>
> To read/write data from HDFS on data node, applications establise TCP/IP 
> connections with the datanode. The HDFS read can be controled by setting 
> Linux Traffic Control  (TC) subsystem on the data node to make filters on 
> appropriate connections.
> The current cgroups net_cls concept can not be applied on the node where the 
> container is launched, netheir on data node since:
> -   TC hanldes outgoing bandwidth only, so it can be set on container node 
> (HDFS read = incoming data for the container)
> -   Since HDFS data node is handled by only one process,  it is not possible 
> to use net_cls to separate connections from different containers to the 
> datanode.
> Tasks:
> 1) Extend Resource model to define bandwidth enforcement rate
> 2) Monitor TCP/IP connection estabilised by container handling process and 
> its child processes
> 3) Set Linux Traffic Control rules on data node base on address:port pairs in 
> order to enforce bandwidth of outgoing data
> Concept:
> http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-10-22 Thread Remus Rusanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated YARN-2198:
---
Attachment: YARN-2198.16.patch

.16.patch rebased to current trunk and resolves the conflict from YARN-2720

> Remove the need to run NodeManager as privileged account for Windows Secure 
> Container Executor
> --
>
> Key: YARN-2198
> URL: https://issues.apache.org/jira/browse/YARN-2198
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
> YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, 
> YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.16.patch, 
> YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
> YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
> YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
> YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
> YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch
>
>
> YARN-1972 introduces a Secure Windows Container Executor. However this 
> executor requires the process launching the container to be LocalSystem or a 
> member of the a local Administrators group. Since the process in question is 
> the NodeManager, the requirement translates to the entire NM to run as a 
> privileged account, a very large surface area to review and protect.
> This proposal is to move the privileged operations into a dedicated NT 
> service. The NM can run as a low privilege account and communicate with the 
> privileged NT service when it needs to launch a container. This would reduce 
> the surface exposed to the high privileges. 
> There has to exist a secure, authenticated and authorized channel of 
> communication between the NM and the privileged NT service. Possible 
> alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
> be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
> specific inter-process communication channel that satisfies all requirements 
> and is easy to deploy. The privileged NT service would register and listen on 
> an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
> with libwinutils which would host the LPC client code. The client would 
> connect to the LPC port (NtConnectPort) and send a message requesting a 
> container launch (NtRequestWaitReplyPort). LPC provides authentication and 
> the privileged NT service can use authorization API (AuthZ) to validate the 
> caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS

2014-10-22 Thread cntic (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cntic updated YARN-2681:

Attachment: HADOOP-2681.patch

- fix findbugs warnings
- testing purpose: 
  +TC class rate can be given by read HDFS file defined in YARN 
configuration 
 + TC class burst can be defined in configration. Otherwise default value 
will be set when TC class is added

> Support bandwidth enforcement for containers while reading from HDFS
> 
>
> Key: YARN-2681
> URL: https://issues.apache.org/jira/browse/YARN-2681
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.5.1
> Environment: Linux
>Reporter: cntic
> Attachments: HADOOP-2681.patch, HADOOP-2681.patch, Traffic Control 
> Design.png
>
>
> To read/write data from HDFS on data node, applications establise TCP/IP 
> connections with the datanode. The HDFS read can be controled by setting 
> Linux Traffic Control  (TC) subsystem on the data node to make filters on 
> appropriate connections.
> The current cgroups net_cls concept can not be applied on the node where the 
> container is launched, netheir on data node since:
> -   TC hanldes outgoing bandwidth only, so it can be set on container node 
> (HDFS read = incoming data for the container)
> -   Since HDFS data node is handled by only one process,  it is not possible 
> to use net_cls to separate connections from different containers to the 
> datanode.
> Tasks:
> 1) Extend Resource model to define bandwidth enforcement rate
> 2) Monitor TCP/IP connection estabilised by container handling process and 
> its child processes
> 3) Set Linux Traffic Control rules on data node base on address:port pairs in 
> order to enforce bandwidth of outgoing data
> Concept:
> http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2700) TestSecureRMRegistryOperations failing on windows: auth problems

2014-10-22 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-2700:
-
Attachment: YARN-2700-001.patch

Patch. The problem (as explained by Chris Nauroth) is that windows doesn't rDNS 
127.0.0.1 to localhost —the principals there need to use the raw IP Address.

> TestSecureRMRegistryOperations failing on windows: auth problems
> 
>
> Key: YARN-2700
> URL: https://issues.apache.org/jira/browse/YARN-2700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, resourcemanager
>Affects Versions: 2.6.0
> Environment: Windows Server, Win7
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: YARN-2700-001.patch
>
>
> TestSecureRMRegistryOperations failing on windows: unable to create the root 
> /registry path with permissions problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.

2014-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180012#comment-14180012
 ] 

Hudson commented on YARN-2715:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1934 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1934/])
YARN-2715. Fixed ResourceManager to respect common configurations for proxy 
users/groups beyond just the YARN level config. Contributed by Zhijie Shen. 
(vinodkv: rev c0e034336c85296be6f549d88d137fb2b2b79a15)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMAdminService.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokenAuthentication.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilterInitializer.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMProxyUsersConf.java


> Proxy user is problem for RPC interface if 
> yarn.resourcemanager.webapp.proxyuser is not set.
> 
>
> Key: YARN-2715
> URL: https://issues.apache.org/jira/browse/YARN-2715
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: YARN-2715.1.patch, YARN-2715.2.patch, YARN-2715.3.patch, 
> YARN-2715.4.patch
>
>
> After YARN-2656, if people set hadoop.proxyuser for the client<-->RM RPC 
> interface, it's not going to work, because ProxyUsers#sip is a singleton per 
> daemon. After YARN-2656, RM has both channels that want to set this 
> configuration: RPC and HTTP. RPC interface sets it first by reading 
> hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to 
> empty because yarn.resourcemanager.webapp.proxyuser doesn't exist.
> The fix for it could be similar to what we've done for YARN-2676: make the 
> HTTP interface anyway source hadoop.proxyuser first, then 
> yarn.resourcemanager.webapp.proxyuser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again

2014-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180010#comment-14180010
 ] 

Hudson commented on YARN-90:


FAILURE: Integrated in Hadoop-Mapreduce-trunk #1934 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1934/])
YARN-90. NodeManager should identify failed disks becoming good again. 
Contributed by Varun Vasudev (jlowe: rev 
6f2028bd1514d90b831f889fd0ee7f2ba5c15000)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/NonAggregatingLogHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeHealthService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDirectoryCollection.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLocalDirsHandlerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/TestNonAggregatingLogHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java


> NodeManager should identify failed disks becoming good again
> 
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Fix For: 2.6.0
>
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
> apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
> apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2721) Race condition: ZKRMStateStore retry logic may throw NodeExist exception

2014-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180011#comment-14180011
 ] 

Hudson commented on YARN-2721:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1934 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1934/])
YARN-2721. Suppress NodeExist exception thrown by ZKRMStateStore when it 
retries creating znode. Contributed by Jian He. (zjshen: rev 
7e3b5e6f5cb4945b4fab27e8a83d04280df50e17)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java


> Race condition: ZKRMStateStore retry logic may throw NodeExist exception 
> -
>
> Key: YARN-2721
> URL: https://issues.apache.org/jira/browse/YARN-2721
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Fix For: 2.6.0
>
> Attachments: YARN-2721.1.patch
>
>
> Blindly retrying operations in zookeeper will not work for non-idempotent 
> operations (like create znode). The reason is that the client can do a create 
> znode, but the response may not be returned because the server can die or 
> timeout. In case of retrying the create znode, it will throw a NODE_EXISTS 
> exception from the earlier create from the same session.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2720) Windows: Wildcard classpath variables not expanded against resources contained in archives

2014-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180009#comment-14180009
 ] 

Hudson commented on YARN-2720:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1934 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1934/])
YARN-2720. Windows: Wildcard classpath variables not expanded against resources 
contained in archives. Contributed by Craig Welch. (cnauroth: rev 
6637e3cf95b3a9be8d6b9cd66bc849a0607e8ed5)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Classpath.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/WindowsSecureContainerExecutor.java
* 
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestFileUtil.java
* hadoop-yarn-project/CHANGES.txt


> Windows: Wildcard classpath variables not expanded against resources 
> contained in archives
> --
>
> Key: YARN-2720
> URL: https://issues.apache.org/jira/browse/YARN-2720
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Craig Welch
>Assignee: Craig Welch
> Fix For: 2.6.0
>
> Attachments: YARN-2720.2.patch, YARN-2720.3.patch, YARN-2720.4.patch
>
>
> On windows there are limitations to the length of command lines and 
> environment variables which prevent placing all classpath resources into 
> these elements.  Instead, a jar containing only a classpath manifest is 
> created to provide the classpath.  During this process wildcard references 
> are expanded by inspecting the filesystem.  Since archives are extracted to a 
> different location and linked into the final location after the classpath jar 
> is created, resources referred to via wildcards which exist in localized 
> archives  (.zip, tar.gz) are not added to the classpath manifest jar.  Since 
> these entries are removed from the final classpath for the container they are 
> not on the container's classpath as they should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2709) Add retry for timeline client getDelegationToken method

2014-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180015#comment-14180015
 ] 

Hudson commented on YARN-2709:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1934 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1934/])
YARN-2709. Made timeline client getDelegationToken API retry if 
ConnectException happens. Contributed by Li Lu. (zjshen: rev 
b2942762d7f76d510ece5621c71116346a6b12f6)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java


> Add retry for timeline client getDelegationToken method
> ---
>
> Key: YARN-2709
> URL: https://issues.apache.org/jira/browse/YARN-2709
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Li Lu
>Assignee: Li Lu
> Fix For: 2.6.0
>
> Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch, 
> YARN-2709-102114-2.patch, YARN-2709-102114.patch
>
>
> As mentioned in YARN-2673, we need to add retry mechanism to timeline client 
> for secured clusters. This means if the timeline server is not available, a 
> timeline client needs to retry to get a delegation token. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2714) Localizer thread might stuck if NM is OOM

2014-10-22 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180042#comment-14180042
 ] 

Ming Ma commented on YARN-2714:
---

Thanks Zhihai for the information. Yes, setting the RPC timeout at the hadoop 
common layer will address the issue. For other suggestions, they might be good 
to have even with RPC timeout. We can open separate jiras if necessary.

> Localizer thread might stuck if NM is OOM
> -
>
> Key: YARN-2714
> URL: https://issues.apache.org/jira/browse/YARN-2714
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ming Ma
>
> When NM JVM runs out of memory; normally it is uncaught exception and the 
> process will exit. But RPC server used by node manager catches 
> OutOfMemoryError to give a chance GC to catch up so NM doesn't need to exit 
> and can recover from OutOfMemoryError situation.
> However, in some rare situation when this happens, one of the NM localizer 
> thread didn't get the RPC response from node manager and just waited there. 
> The explanation of why node manager RPC server doesn't respond is because RPC 
> server responder thread swallowed OutOfMemoryError and didn't process 
> outstanding RPC response. On the RPC client side, the RPC timeout is set to 0 
> and it relies on Ping to detect RPC server availability.
> {noformat}
> Thread 481 (LocalizerRunner for container_1413487737702_2948_01_013383):
>   State: WAITING
>   Blocked count: 27
>   Waited count: 84
>   Waiting on org.apache.hadoop.ipc.Client$Call@6be5add3
>   Stack:
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:503)
> org.apache.hadoop.ipc.Client.call(Client.java:1396)
> org.apache.hadoop.ipc.Client.call(Client.java:1363)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> com.sun.proxy.$Proxy36.heartbeat(Unknown Source)
> 
> org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62)
> 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:235)
> 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
> 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:107)
> 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:995)
> {noformat}
> The consequence of this depends on which ContainerExecutor NM uses. If it 
> uses DefaultContainerExecutor, given its startLocalizer method is 
> synchronized, it will blocks other localizer threads. If you use 
> LinuxContainerExecutor, at least other localizer threads can still proceed. 
> But in theory it can slowly drain all available localizer threads.
> There are couple ways to fix it. Some of these fixes are complementary.
> 1. Fix it at haoop-common layer. It seems RPC server hosted by worker 
> services such ad NM doesn't really need to catch OutOfMemoryError; the 
> service JVM can just exit. Even for the NN and RM, given we have HA, it might 
> be ok to do so.
> 2. Set RPC timeout at HadoopYarnProtoRPC layer so that all YARN clients will 
> timeout if RPC server drops the response.
> 3. Fix it at yarn localization service. For example,
> a) Fix DefaultContainerExecutor so that synchronization isn't required for 
> startLocalizer method.
> b) Download executor thread used by ContainerLocalizer currently catches any 
> exceptions. We can fix ContainerLocalizer so that when Download executor 
> thread catches OutOfMemoryError, it can exit its host process.
> IMHO, fix it at RPC server layer is better as it addresses other scenarios. 
> Appreciate any input others might have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS

2014-10-22 Thread cntic (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cntic updated YARN-2681:

Attachment: yarn-site.xml

1) Configuration for testing HDFS Bandwidth Enforcement (yarn-site,xml)
- Enable enforcement: yarn.nodemanager.hdfs-bandwidth-enforcement.enable = true
- Port which Datanodes are listening:  
yarn.nodemanager.hdfs-bandwidth-enforcement.port = 50010
- Devices' list of machine of data node: 
yarn.nodemanager.hdfs-bandwidth-enforcement.devices = lo, eth0
- Interval for checking new tc config from persistence (ms): 
yarn.nodemanager.hdfs-bandwidth-enforcement.check-tc-config-interval = 1000
- Since  only the API for Resource has been upgrated to get/set HDFS Bandwidth 
Enforcement, but the ResourceRequest has not been implemented yet, for test 
purpose, the rate and burst use to define tc class can be given in YARN 
configuration file:
   + test rate will be writen to an HDFS file: 
yarn.nodemanager.hdfs-bandwidth-enforcement.test-rate-file = test-rate-file
   + test rate file example content : 30mbps ( rate unit:  kbps, mbps, kbit, 
mbit. See also: http://lartc.org/manpages/tc.txt)
   + test burst : yarn.nodemanager.hdfs-bandwidth-enforcement.test-burst = 50Kb

2) The patch is tested by running TestDFSIO-read

> Support bandwidth enforcement for containers while reading from HDFS
> 
>
> Key: YARN-2681
> URL: https://issues.apache.org/jira/browse/YARN-2681
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.5.1
> Environment: Linux
>Reporter: cntic
> Attachments: HADOOP-2681.patch, HADOOP-2681.patch, Traffic Control 
> Design.png, yarn-site.xml
>
>
> To read/write data from HDFS on data node, applications establise TCP/IP 
> connections with the datanode. The HDFS read can be controled by setting 
> Linux Traffic Control  (TC) subsystem on the data node to make filters on 
> appropriate connections.
> The current cgroups net_cls concept can not be applied on the node where the 
> container is launched, netheir on data node since:
> -   TC hanldes outgoing bandwidth only, so it can be set on container node 
> (HDFS read = incoming data for the container)
> -   Since HDFS data node is handled by only one process,  it is not possible 
> to use net_cls to separate connections from different containers to the 
> datanode.
> Tasks:
> 1) Extend Resource model to define bandwidth enforcement rate
> 2) Monitor TCP/IP connection estabilised by container handling process and 
> its child processes
> 3) Set Linux Traffic Control rules on data node base on address:port pairs in 
> order to enforce bandwidth of outgoing data
> Concept:
> http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS

2014-10-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180047#comment-14180047
 ] 

Hadoop QA commented on YARN-2681:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12676351/yarn-site.xml
  against trunk revision 85a8864.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5496//console

This message is automatically generated.

> Support bandwidth enforcement for containers while reading from HDFS
> 
>
> Key: YARN-2681
> URL: https://issues.apache.org/jira/browse/YARN-2681
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.5.1
> Environment: Linux
>Reporter: cntic
> Attachments: HADOOP-2681.patch, HADOOP-2681.patch, Traffic Control 
> Design.png, yarn-site.xml
>
>
> To read/write data from HDFS on data node, applications establise TCP/IP 
> connections with the datanode. The HDFS read can be controled by setting 
> Linux Traffic Control  (TC) subsystem on the data node to make filters on 
> appropriate connections.
> The current cgroups net_cls concept can not be applied on the node where the 
> container is launched, netheir on data node since:
> -   TC hanldes outgoing bandwidth only, so it can be set on container node 
> (HDFS read = incoming data for the container)
> -   Since HDFS data node is handled by only one process,  it is not possible 
> to use net_cls to separate connections from different containers to the 
> datanode.
> Tasks:
> 1) Extend Resource model to define bandwidth enforcement rate
> 2) Monitor TCP/IP connection estabilised by container handling process and 
> its child processes
> 3) Set Linux Traffic Control rules on data node base on address:port pairs in 
> order to enforce bandwidth of outgoing data
> Concept:
> http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2647) Add yarn queue CLI to get queue info including labels of such queue

2014-10-22 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180061#comment-14180061
 ] 

Sunil G commented on YARN-2647:
---

Hi [~mayank_bansal]

Sorry for the delay here, I have done some ground work on this. 

I was taking the mapred queue CLI changes to YARN.
Namely using the *GetQueueInfoRequest* and *GetQueueInfoResponse*.
I have added the node label related information to the response object and 
wanted to take back to client.

Now for apis, YarnClientImpl already have apis like getQueueInfo and 
getQueueAclsInfo etc.
I wanted to merge all these under "yarn queue  " command followed by 
queue name. 
The  can be namely *queue-acl*, *node-label*, *all* (can print all 
information in queueInfo) 

I may need a day more to upload this patch, kindly suggest if the approach is 
fine, and also if its needed before that.

> Add yarn queue CLI to get queue info including labels of such queue
> ---
>
> Key: YARN-2647
> URL: https://issues.apache.org/jira/browse/YARN-2647
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Reporter: Wangda Tan
>Assignee: Sunil G
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-10-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180066#comment-14180066
 ] 

Hadoop QA commented on YARN-2198:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12676333/YARN-2198.16.patch
  against trunk revision 85a8864.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-common-project/hadoop-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  org.apache.hadoop.metrics2.impl.TestMetricsSystemImpl

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5493//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5493//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5493//console

This message is automatically generated.

> Remove the need to run NodeManager as privileged account for Windows Secure 
> Container Executor
> --
>
> Key: YARN-2198
> URL: https://issues.apache.org/jira/browse/YARN-2198
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
> YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, 
> YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.16.patch, 
> YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
> YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
> YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
> YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
> YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch
>
>
> YARN-1972 introduces a Secure Windows Container Executor. However this 
> executor requires the process launching the container to be LocalSystem or a 
> member of the a local Administrators group. Since the process in question is 
> the NodeManager, the requirement translates to the entire NM to run as a 
> privileged account, a very large surface area to review and protect.
> This proposal is to move the privileged operations into a dedicated NT 
> service. The NM can run as a low privilege account and communicate with the 
> privileged NT service when it needs to launch a container. This would reduce 
> the surface exposed to the high privileges. 
> There has to exist a secure, authenticated and authorized channel of 
> communication between the NM and the privileged NT service. Possible 
> alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
> be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
> specific inter-process communication channel that satisfies all requirements 
> and is easy to deploy. The privileged NT service would register and listen on 
> an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
> with libwinutils which would host the LPC client code. The client would 
> connect to the LPC port (NtConnectPort) and send a message requesting a 
> container launch (NtRequestWaitReplyPort). LPC provides authentication and 
> the privileged NT service can use authorization API (AuthZ) to validate the 
> caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2700) TestSecureRMRegistryOperations failing on windows: auth problems

2014-10-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180069#comment-14180069
 ] 

Hadoop QA commented on YARN-2700:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12676344/YARN-2700-001.patch
  against trunk revision 85a8864.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5495//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5495//console

This message is automatically generated.

> TestSecureRMRegistryOperations failing on windows: auth problems
> 
>
> Key: YARN-2700
> URL: https://issues.apache.org/jira/browse/YARN-2700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, resourcemanager
>Affects Versions: 2.6.0
> Environment: Windows Server, Win7
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: YARN-2700-001.patch
>
>
> TestSecureRMRegistryOperations failing on windows: unable to create the root 
> /registry path with permissions problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS

2014-10-22 Thread cntic (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cntic updated YARN-2681:

Attachment: yarn-site.xml.example

> Support bandwidth enforcement for containers while reading from HDFS
> 
>
> Key: YARN-2681
> URL: https://issues.apache.org/jira/browse/YARN-2681
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.5.1
> Environment: Linux
>Reporter: cntic
> Attachments: HADOOP-2681.patch, HADOOP-2681.patch, Traffic Control 
> Design.png, yarn-site.xml.example
>
>
> To read/write data from HDFS on data node, applications establise TCP/IP 
> connections with the datanode. The HDFS read can be controled by setting 
> Linux Traffic Control  (TC) subsystem on the data node to make filters on 
> appropriate connections.
> The current cgroups net_cls concept can not be applied on the node where the 
> container is launched, netheir on data node since:
> -   TC hanldes outgoing bandwidth only, so it can be set on container node 
> (HDFS read = incoming data for the container)
> -   Since HDFS data node is handled by only one process,  it is not possible 
> to use net_cls to separate connections from different containers to the 
> datanode.
> Tasks:
> 1) Extend Resource model to define bandwidth enforcement rate
> 2) Monitor TCP/IP connection estabilised by container handling process and 
> its child processes
> 3) Set Linux Traffic Control rules on data node base on address:port pairs in 
> order to enforce bandwidth of outgoing data
> Concept:
> http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2700) TestSecureRMRegistryOperations failing on windows: auth problems

2014-10-22 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-2700:

Hadoop Flags: Reviewed

+1 for the patch, pending Jenkins.  Thanks for the fix, Steve.

> TestSecureRMRegistryOperations failing on windows: auth problems
> 
>
> Key: YARN-2700
> URL: https://issues.apache.org/jira/browse/YARN-2700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, resourcemanager
>Affects Versions: 2.6.0
> Environment: Windows Server, Win7
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: YARN-2700-001.patch
>
>
> TestSecureRMRegistryOperations failing on windows: unable to create the root 
> /registry path with permissions problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS

2014-10-22 Thread cntic (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cntic updated YARN-2681:

Attachment: (was: yarn-site.xml)

> Support bandwidth enforcement for containers while reading from HDFS
> 
>
> Key: YARN-2681
> URL: https://issues.apache.org/jira/browse/YARN-2681
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.5.1
> Environment: Linux
>Reporter: cntic
> Attachments: HADOOP-2681.patch, HADOOP-2681.patch, Traffic Control 
> Design.png
>
>
> To read/write data from HDFS on data node, applications establise TCP/IP 
> connections with the datanode. The HDFS read can be controled by setting 
> Linux Traffic Control  (TC) subsystem on the data node to make filters on 
> appropriate connections.
> The current cgroups net_cls concept can not be applied on the node where the 
> container is launched, netheir on data node since:
> -   TC hanldes outgoing bandwidth only, so it can be set on container node 
> (HDFS read = incoming data for the container)
> -   Since HDFS data node is handled by only one process,  it is not possible 
> to use net_cls to separate connections from different containers to the 
> datanode.
> Tasks:
> 1) Extend Resource model to define bandwidth enforcement rate
> 2) Monitor TCP/IP connection estabilised by container handling process and 
> its child processes
> 3) Set Linux Traffic Control rules on data node base on address:port pairs in 
> order to enforce bandwidth of outgoing data
> Concept:
> http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2700) TestSecureRMRegistryOperations failing on windows: auth problems

2014-10-22 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180086#comment-14180086
 ] 

Chris Nauroth commented on YARN-2700:
-

bq. ...pending Jenkins...

Never mind.  It looks like Jenkins and I had a race condition commenting.  :-)  
You have a full +1 from me now.

> TestSecureRMRegistryOperations failing on windows: auth problems
> 
>
> Key: YARN-2700
> URL: https://issues.apache.org/jira/browse/YARN-2700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, resourcemanager
>Affects Versions: 2.6.0
> Environment: Windows Server, Win7
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: YARN-2700-001.patch
>
>
> TestSecureRMRegistryOperations failing on windows: unable to create the root 
> /registry path with permissions problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS

2014-10-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180089#comment-14180089
 ] 

Hadoop QA commented on YARN-2681:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12676356/yarn-site.xml.example
  against trunk revision 85a8864.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5497//console

This message is automatically generated.

> Support bandwidth enforcement for containers while reading from HDFS
> 
>
> Key: YARN-2681
> URL: https://issues.apache.org/jira/browse/YARN-2681
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.5.1
> Environment: Linux
>Reporter: cntic
> Attachments: HADOOP-2681.patch, HADOOP-2681.patch, Traffic Control 
> Design.png, yarn-site.xml.example
>
>
> To read/write data from HDFS on data node, applications establise TCP/IP 
> connections with the datanode. The HDFS read can be controled by setting 
> Linux Traffic Control  (TC) subsystem on the data node to make filters on 
> appropriate connections.
> The current cgroups net_cls concept can not be applied on the node where the 
> container is launched, netheir on data node since:
> -   TC hanldes outgoing bandwidth only, so it can be set on container node 
> (HDFS read = incoming data for the container)
> -   Since HDFS data node is handled by only one process,  it is not possible 
> to use net_cls to separate connections from different containers to the 
> datanode.
> Tasks:
> 1) Extend Resource model to define bandwidth enforcement rate
> 2) Monitor TCP/IP connection estabilised by container handling process and 
> its child processes
> 3) Set Linux Traffic Control rules on data node base on address:port pairs in 
> order to enforce bandwidth of outgoing data
> Concept:
> http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor

2014-10-22 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180119#comment-14180119
 ] 

zhihai xu commented on YARN-2701:
-

One nit in the addendum patch:
Can we change
{code}
  if (stat(path, &sb) == 0) {
if (check_dir(path, sb.st_mode, perm, 1) == -1) {
  return -1;
}
return 0;
  }
{code}
to
{code}
  if (stat(path, &sb) == 0) {
return check_dir(path, sb.st_mode, perm, 1);
  }
{code}

> Potential race condition in startLocalizer when using LinuxContainerExecutor  
> --
>
> Key: YARN-2701
> URL: https://issues.apache.org/jira/browse/YARN-2701
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
> YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch, 
> YARN-2701.addendum.1.patch
>
>
> When using LinuxContainerExecutor do startLocalizer, we are using native code 
> container-executor.c. 
> {code}
>  if (stat(npath, &sb) != 0) {
>if (mkdir(npath, perm) != 0) {
> {code}
> We are using check and create method to create the appDir under /usercache. 
> But if there are two containers trying to do this at the same time, race 
> condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt

2014-10-22 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180130#comment-14180130
 ] 

Jason Lowe commented on YARN-2010:
--

We recently ran into a case where an application tried to recover with an 
expired token and the InvalidToken exception thrown by the delegation token 
secret manager for this application prevented the RM from coming up.

> RM can't transition to active if it can't recover an app attempt
> 
>
> Key: YARN-2010
> URL: https://issues.apache.org/jira/browse/YARN-2010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: bc Wong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, 
> yarn-2010-3.patch, yarn-2010-3.patch
>
>
> If the RM fails to recover an app attempt, it won't come up. We should make 
> it more resilient.
> Specifically, the underlying error is that the app was submitted before 
> Kerberos security got turned on. Makes sense for the app to fail in this 
> case. But YARN should still start.
> {noformat}
> 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election 
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
> Active 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>  
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
>  
> ... 4 more 
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
>  
> ... 5 more 
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
> ... 8 more 
> Caused by: java.lang.IllegalArgumentException: Missing argument 
> at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) 
> at 
> org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
>  
> ... 13 more 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS

2014-10-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180134#comment-14180134
 ] 

Hadoop QA commented on YARN-2681:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12676343/HADOOP-2681.patch
  against trunk revision 85a8864.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 3 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5494//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5494//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5494//console

This message is automatically generated.

> Support bandwidth enforcement for containers while reading from HDFS
> 
>
> Key: YARN-2681
> URL: https://issues.apache.org/jira/browse/YARN-2681
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.5.1
> Environment: Linux
>Reporter: cntic
> Attachments: HADOOP-2681.patch, HADOOP-2681.patch, Traffic Control 
> Design.png, yarn-site.xml.example
>
>
> To read/write data from HDFS on data node, applications establise TCP/IP 
> connections with the datanode. The HDFS read can be controled by setting 
> Linux Traffic Control  (TC) subsystem on the data node to make filters on 
> appropriate connections.
> The current cgroups net_cls concept can not be applied on the node where the 
> container is launched, netheir on data node since:
> -   TC hanldes outgoing bandwidth only, so it can be set on container node 
> (HDFS read = incoming data for the container)
> -   Since HDFS data node is handled by only one process,  it is not possible 
> to use net_cls to separate connections from different containers to the 
> datanode.
> Tasks:
> 1) Extend Resource model to define bandwidth enforcement rate
> 2) Monitor TCP/IP connection estabilised by container handling process and 
> its child processes
> 3) Set Linux Traffic Control rules on data node base on address:port pairs in 
> order to enforce bandwidth of outgoing data
> Concept:
> http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt

2014-10-22 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180142#comment-14180142
 ] 

Karthik Kambatla commented on YARN-2010:


I should have an updated patch with tests later today. Would be nice to fix 
this for 2.6. 

> RM can't transition to active if it can't recover an app attempt
> 
>
> Key: YARN-2010
> URL: https://issues.apache.org/jira/browse/YARN-2010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: bc Wong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, 
> yarn-2010-3.patch, yarn-2010-3.patch
>
>
> If the RM fails to recover an app attempt, it won't come up. We should make 
> it more resilient.
> Specifically, the underlying error is that the app was submitted before 
> Kerberos security got turned on. Makes sense for the app to fail in this 
> case. But YARN should still start.
> {noformat}
> 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election 
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
> Active 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>  
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
>  
> ... 4 more 
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
>  
> ... 5 more 
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
> ... 8 more 
> Caused by: java.lang.IllegalArgumentException: Missing argument 
> at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) 
> at 
> org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
>  
> ... 13 more 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-10-22 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180144#comment-14180144
 ] 

Ming Ma commented on YARN-2578:
---

Yeah, it is more than just * -> RM, it could be * -> NM and * -> AM. Agree it 
is better to fix it at hadoop common layer. From HDFS-4858, it looks like the 
concern of fixing it at hadoop common layer is the test coverage.

Is there any follow up on hadoop common? Perhaps we can fix hadoop common layer 
so that rpc timeout is still off by default; but if ping is set to false, then 
rpc timeout will be set to the ping value in the code Karthik refers to. In 
that way, YARN and MR don't need to change and people can experiment with rpc 
timeout. After enough test coverage, we can then set ping default value to 
false.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor

2014-10-22 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2701:

Attachment: YARN-2701.addendum.2.patch

> Potential race condition in startLocalizer when using LinuxContainerExecutor  
> --
>
> Key: YARN-2701
> URL: https://issues.apache.org/jira/browse/YARN-2701
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
> YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch, 
> YARN-2701.addendum.1.patch, YARN-2701.addendum.2.patch
>
>
> When using LinuxContainerExecutor do startLocalizer, we are using native code 
> container-executor.c. 
> {code}
>  if (stat(npath, &sb) != 0) {
>if (mkdir(npath, perm) != 0) {
> {code}
> We are using check and create method to create the appDir under /usercache. 
> But if there are two containers trying to do this at the same time, race 
> condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor

2014-10-22 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180174#comment-14180174
 ] 

Xuan Gong commented on YARN-2701:
-

[~zxu] Thanks for reviewing this patch again. New patch addressed your comment.

> Potential race condition in startLocalizer when using LinuxContainerExecutor  
> --
>
> Key: YARN-2701
> URL: https://issues.apache.org/jira/browse/YARN-2701
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
> YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch, 
> YARN-2701.addendum.1.patch, YARN-2701.addendum.2.patch
>
>
> When using LinuxContainerExecutor do startLocalizer, we are using native code 
> container-executor.c. 
> {code}
>  if (stat(npath, &sb) != 0) {
>if (mkdir(npath, perm) != 0) {
> {code}
> We are using check and create method to create the appDir under /usercache. 
> But if there are two containers trying to do this at the same time, race 
> condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2723) rmadmin -replaceLabelsOnNode does not correctly parse port

2014-10-22 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180206#comment-14180206
 ] 

Wangda Tan commented on YARN-2723:
--

[~Naganarasimha], is there any updates on this patch? It should be a one line 
fix with a new test, if you didn't start on it, I can take it over.
Thanks.

> rmadmin -replaceLabelsOnNode does not correctly parse port
> --
>
> Key: YARN-2723
> URL: https://issues.apache.org/jira/browse/YARN-2723
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Reporter: Phil D'Amore
>Assignee: Naganarasimha G R
>
> There is an off-by-one issue in RMAdminCLI.java (line 457):
> port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")));
> should probably be:
> port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")+1));
> Currently attempting to add a label to a node with a port specified looks 
> like this:
> [yarn@ip-10-0-0-66 ~]$ yarn rmadmin -replaceLabelsOnNode 
> node.example.com:45454,test-label
> replaceLabelsOnNode: For input string: ":45454"
> Usage: yarn rmadmin [-replaceLabelsOnNode [node1:port,label1,label2 
> node2:port,label1,label2]]
> It appears to be trying to parse the ':' as part of the integer because the 
> substring index is off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2723) rmadmin -replaceLabelsOnNode does not correctly parse port

2014-10-22 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180244#comment-14180244
 ] 

Naganarasimha G R commented on YARN-2723:
-

Its almost done i will attach the patch in an hour

> rmadmin -replaceLabelsOnNode does not correctly parse port
> --
>
> Key: YARN-2723
> URL: https://issues.apache.org/jira/browse/YARN-2723
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Reporter: Phil D'Amore
>Assignee: Naganarasimha G R
>
> There is an off-by-one issue in RMAdminCLI.java (line 457):
> port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")));
> should probably be:
> port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")+1));
> Currently attempting to add a label to a node with a port specified looks 
> like this:
> [yarn@ip-10-0-0-66 ~]$ yarn rmadmin -replaceLabelsOnNode 
> node.example.com:45454,test-label
> replaceLabelsOnNode: For input string: ":45454"
> Usage: yarn rmadmin [-replaceLabelsOnNode [node1:port,label1,label2 
> node2:port,label1,label2]]
> It appears to be trying to parse the ':' as part of the integer because the 
> substring index is off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2723) rmadmin -replaceLabelsOnNode does not correctly parse port

2014-10-22 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180248#comment-14180248
 ] 

Wangda Tan commented on YARN-2723:
--

Thanks :)

> rmadmin -replaceLabelsOnNode does not correctly parse port
> --
>
> Key: YARN-2723
> URL: https://issues.apache.org/jira/browse/YARN-2723
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Reporter: Phil D'Amore
>Assignee: Naganarasimha G R
>
> There is an off-by-one issue in RMAdminCLI.java (line 457):
> port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")));
> should probably be:
> port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")+1));
> Currently attempting to add a label to a node with a port specified looks 
> like this:
> [yarn@ip-10-0-0-66 ~]$ yarn rmadmin -replaceLabelsOnNode 
> node.example.com:45454,test-label
> replaceLabelsOnNode: For input string: ":45454"
> Usage: yarn rmadmin [-replaceLabelsOnNode [node1:port,label1,label2 
> node2:port,label1,label2]]
> It appears to be trying to parse the ':' as part of the integer because the 
> substring index is off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2647) Add yarn queue CLI to get queue info including labels of such queue

2014-10-22 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180304#comment-14180304
 ] 

Wangda Tan commented on YARN-2647:
--

[~sunilg],
Thanks for working on this item,
bq. Namely using the GetQueueInfoRequest and GetQueueInfoResponse.
I think we may not need extra PB object, 
{{org.apache.hadoop.yarn.api.records.QueueInfo}} already has labels and 
default-label-expression field.

bq. I wanted to merge all these under "yarn queue  " command followed 
by queue name. 
Agree merging all to "yarn queue", I think by default, "yarn queue -list" is 
list all information of all queues. And if user want to see specific queue, 
he/she can use "yarn queue -list ". And 
extra options can be applied like -queue-acl -node-label can be applied if 
he/she want to see some specific field(s) only. 

Wangda

> Add yarn queue CLI to get queue info including labels of such queue
> ---
>
> Key: YARN-2647
> URL: https://issues.apache.org/jira/browse/YARN-2647
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Reporter: Wangda Tan
>Assignee: Sunil G
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2723) rmadmin -replaceLabelsOnNode does not correctly parse port

2014-10-22 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-2723:

Attachment: YARN-2723.20141023.1.patch

attaching patch for this issue

> rmadmin -replaceLabelsOnNode does not correctly parse port
> --
>
> Key: YARN-2723
> URL: https://issues.apache.org/jira/browse/YARN-2723
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Reporter: Phil D'Amore
>Assignee: Naganarasimha G R
> Attachments: YARN-2723.20141023.1.patch
>
>
> There is an off-by-one issue in RMAdminCLI.java (line 457):
> port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")));
> should probably be:
> port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")+1));
> Currently attempting to add a label to a node with a port specified looks 
> like this:
> [yarn@ip-10-0-0-66 ~]$ yarn rmadmin -replaceLabelsOnNode 
> node.example.com:45454,test-label
> replaceLabelsOnNode: For input string: ":45454"
> Usage: yarn rmadmin [-replaceLabelsOnNode [node1:port,label1,label2 
> node2:port,label1,label2]]
> It appears to be trying to parse the ':' as part of the integer because the 
> substring index is off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2727) In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is being displayed

2014-10-22 Thread Naganarasimha G R (JIRA)
Naganarasimha G R created YARN-2727:
---

 Summary: In RMAdminCLI usage display, instead of 
"yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is being 
displayed
 Key: YARN-2727
 URL: https://issues.apache.org/jira/browse/YARN-2727
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
Priority: Minor


In org.apache.hadoop.yarn.client.cli.RMAdminCLI usage display instead of 
"yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is being 
used
And also some modifications for the description



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2727) In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is being displayed

2014-10-22 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-2727:

Attachment: YARN-2727.20141023.1.patch

attaching patch

> In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", 
> "yarn.node-labels.fs-store.uri" is being displayed
> 
>
> Key: YARN-2727
> URL: https://issues.apache.org/jira/browse/YARN-2727
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>Priority: Minor
> Attachments: YARN-2727.20141023.1.patch
>
>
> In org.apache.hadoop.yarn.client.cli.RMAdminCLI usage display instead of 
> "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is 
> being used
> And also some modifications for the description



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-10-22 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180437#comment-14180437
 ] 

Jian He commented on YARN-2198:
---

Not sure if the test failure is related. re-triger jenkins

> Remove the need to run NodeManager as privileged account for Windows Secure 
> Container Executor
> --
>
> Key: YARN-2198
> URL: https://issues.apache.org/jira/browse/YARN-2198
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
> YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, 
> YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.16.patch, 
> YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
> YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
> YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
> YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
> YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch
>
>
> YARN-1972 introduces a Secure Windows Container Executor. However this 
> executor requires the process launching the container to be LocalSystem or a 
> member of the a local Administrators group. Since the process in question is 
> the NodeManager, the requirement translates to the entire NM to run as a 
> privileged account, a very large surface area to review and protect.
> This proposal is to move the privileged operations into a dedicated NT 
> service. The NM can run as a low privilege account and communicate with the 
> privileged NT service when it needs to launch a container. This would reduce 
> the surface exposed to the high privileges. 
> There has to exist a secure, authenticated and authorized channel of 
> communication between the NM and the privileged NT service. Possible 
> alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
> be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
> specific inter-process communication channel that satisfies all requirements 
> and is easy to deploy. The privileged NT service would register and listen on 
> an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
> with libwinutils which would host the LPC client code. The client would 
> connect to the LPC port (NtConnectPort) and send a message requesting a 
> container launch (NtRequestWaitReplyPort). LPC provides authentication and 
> the privileged NT service can use authorization API (AuthZ) to validate the 
> caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2727) In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is being displayed

2014-10-22 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180456#comment-14180456
 ] 

Wangda Tan commented on YARN-2727:
--

[~Naganarasimha],
Thanks for the patch, some comments:
1)
bq. +port = 
Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")+1));
As convention, please leave a space before and after "+"

2) 
{code}
+// no labels, should fail
+args = new String[] { "-replaceLabelsOnNode" };
+assertTrue(0 != rmAdminCLI.run(args));
+
+// no labels, should fail
+args =
+new String[] { "-replaceLabelsOnNode",
+"-directlyAccessNodeLabelStore" };
+assertTrue(0 != rmAdminCLI.run(args));
{code}
These two check were already included by {{testReplaceLabelsOnNode}}

Thanks,

> In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", 
> "yarn.node-labels.fs-store.uri" is being displayed
> 
>
> Key: YARN-2727
> URL: https://issues.apache.org/jira/browse/YARN-2727
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>Priority: Minor
> Attachments: YARN-2727.20141023.1.patch
>
>
> In org.apache.hadoop.yarn.client.cli.RMAdminCLI usage display instead of 
> "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is 
> being used
> And also some modifications for the description



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2728) Support for disabling the Centralized NodeLabel validation in Distributed Node Label Configuration setup

2014-10-22 Thread Naganarasimha G R (JIRA)
Naganarasimha G R created YARN-2728:
---

 Summary: Support for disabling the Centralized NodeLabel 
validation in Distributed Node Label Configuration setup
 Key: YARN-2728
 URL: https://issues.apache.org/jira/browse/YARN-2728
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Naganarasimha G R


Currently without Central List of Valid Labels, Capacity scheduler will not be 
able to work (user cannot specify capacity for a unknown node-label for a 
queue, etc.). But without disabling the central label validation, Distributed 
Node Label configuration  feature is not complete. so we need to support this 
feature



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2723) rmadmin -replaceLabelsOnNode does not correctly parse port

2014-10-22 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180458#comment-14180458
 ] 

Wangda Tan commented on YARN-2723:
--

[~Naganarasimha]
Thanks for the patch, some comments:
1)
bq. + port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")+1));
As convention, please leave a space before and after "+"
2)
{code}
+// no labels, should fail
+args = new String[] { "-replaceLabelsOnNode" };
+assertTrue(0 != rmAdminCLI.run(args));
+
+// no labels, should fail
+args =
+new String[] { "-replaceLabelsOnNode",
+"-directlyAccessNodeLabelStore" };
+assertTrue(0 != rmAdminCLI.run(args));
{code}
These two check were already included by testReplaceLabelsOnNode
Thanks,

> rmadmin -replaceLabelsOnNode does not correctly parse port
> --
>
> Key: YARN-2723
> URL: https://issues.apache.org/jira/browse/YARN-2723
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Reporter: Phil D'Amore
>Assignee: Naganarasimha G R
> Attachments: YARN-2723.20141023.1.patch
>
>
> There is an off-by-one issue in RMAdminCLI.java (line 457):
> port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")));
> should probably be:
> port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")+1));
> Currently attempting to add a label to a node with a port specified looks 
> like this:
> [yarn@ip-10-0-0-66 ~]$ yarn rmadmin -replaceLabelsOnNode 
> node.example.com:45454,test-label
> replaceLabelsOnNode: For input string: ":45454"
> Usage: yarn rmadmin [-replaceLabelsOnNode [node1:port,label1,label2 
> node2:port,label1,label2]]
> It appears to be trying to parse the ':' as part of the integer because the 
> substring index is off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2727) In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is being displayed

2014-10-22 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180457#comment-14180457
 ] 

Wangda Tan commented on YARN-2727:
--

Oh sorry, this comment is for YARN-2723, please ignore above comment

> In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", 
> "yarn.node-labels.fs-store.uri" is being displayed
> 
>
> Key: YARN-2727
> URL: https://issues.apache.org/jira/browse/YARN-2727
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>Priority: Minor
> Attachments: YARN-2727.20141023.1.patch
>
>
> In org.apache.hadoop.yarn.client.cli.RMAdminCLI usage display instead of 
> "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is 
> being used
> And also some modifications for the description



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup

2014-10-22 Thread Naganarasimha G R (JIRA)
Naganarasimha G R created YARN-2729:
---

 Summary: Support script based NodeLabelsProvider Interface in 
Distributed Node Label Configuration Setup
 Key: YARN-2729
 URL: https://issues.apache.org/jira/browse/YARN-2729
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R


Support script based NodeLabelsProvider Interface in Distributed Node Label 
Configuration Setup . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2700) TestSecureRMRegistryOperations failing on windows: auth problems

2014-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180468#comment-14180468
 ] 

Hudson commented on YARN-2700:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6313 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6313/])
YARN-2700 TestSecureRMRegistryOperations failing on windows: auth problems 
(stevel: rev 90e5ca24fbd3bb2da2a3879cc9b73f0b1d7f3e03)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/secure/AbstractSecureRegistryTest.java


> TestSecureRMRegistryOperations failing on windows: auth problems
> 
>
> Key: YARN-2700
> URL: https://issues.apache.org/jira/browse/YARN-2700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, resourcemanager
>Affects Versions: 2.6.0
> Environment: Windows Server, Win7
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Fix For: 2.6.0
>
> Attachments: YARN-2700-001.patch
>
>
> TestSecureRMRegistryOperations failing on windows: unable to create the root 
> /registry path with permissions problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup

2014-10-22 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-2729:

Attachment: YARN-2729.20141023-1.patch

Attaching the WIP patch for this part of the issue...

> Support script based NodeLabelsProvider Interface in Distributed Node Label 
> Configuration Setup
> ---
>
> Key: YARN-2729
> URL: https://issues.apache.org/jira/browse/YARN-2729
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
> Attachments: YARN-2729.20141023-1.patch
>
>
> Support script based NodeLabelsProvider Interface in Distributed Node Label 
> Configuration Setup . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2723) rmadmin -replaceLabelsOnNode does not correctly parse port

2014-10-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180484#comment-14180484
 ] 

Hadoop QA commented on YARN-2723:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12676406/YARN-2723.20141023.1.patch
  against trunk revision d67214f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client:

  org.apache.hadoop.yarn.client.TestResourceTrackerOnHA
  
org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5498//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5498//console

This message is automatically generated.

> rmadmin -replaceLabelsOnNode does not correctly parse port
> --
>
> Key: YARN-2723
> URL: https://issues.apache.org/jira/browse/YARN-2723
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: client
>Reporter: Phil D'Amore
>Assignee: Naganarasimha G R
> Attachments: YARN-2723.20141023.1.patch
>
>
> There is an off-by-one issue in RMAdminCLI.java (line 457):
> port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")));
> should probably be:
> port = Integer.valueOf(nodeIdStr.substring(nodeIdStr.indexOf(":")+1));
> Currently attempting to add a label to a node with a port specified looks 
> like this:
> [yarn@ip-10-0-0-66 ~]$ yarn rmadmin -replaceLabelsOnNode 
> node.example.com:45454,test-label
> replaceLabelsOnNode: For input string: ":45454"
> Usage: yarn rmadmin [-replaceLabelsOnNode [node1:port,label1,label2 
> node2:port,label1,label2]]
> It appears to be trying to parse the ':' as part of the integer because the 
> substring index is off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-10-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180493#comment-14180493
 ] 

Hadoop QA commented on YARN-2198:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12676333/YARN-2198.16.patch
  against trunk revision d67214f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client:

  org.apache.hadoop.yarn.client.TestResourceTrackerOnHA
  
org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA

  The following test timeouts occurred in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client:

org.apache.hadoop.yarnTests

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5499//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5499//console

This message is automatically generated.

> Remove the need to run NodeManager as privileged account for Windows Secure 
> Container Executor
> --
>
> Key: YARN-2198
> URL: https://issues.apache.org/jira/browse/YARN-2198
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
> YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, 
> YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.16.patch, 
> YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
> YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
> YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
> YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
> YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch
>
>
> YARN-1972 introduces a Secure Windows Container Executor. However this 
> executor requires the process launching the container to be LocalSystem or a 
> member of the a local Administrators group. Since the process in question is 
> the NodeManager, the requirement translates to the entire NM to run as a 
> privileged account, a very large surface area to review and protect.
> This proposal is to move the privileged operations into a dedicated NT 
> service. The NM can run as a low privilege account and communicate with the 
> privileged NT service when it needs to launch a container. This would reduce 
> the surface exposed to the high privileges. 
> There has to exist a secure, authenticated and authorized channel of 
> communication between the NM and the privileged NT service. Possible 
> alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
> be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
> specific inter-process communication channel that satisfies all requirements 
> and is easy to deploy. The privileged NT service would register and listen on 
> an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
> with libwinutils which would host the LPC client code. The client would 
> connect to the LPC port (NtConnectPort) and send a message requesting a 
> container launch (NtRequestWaitReplyPort). LPC provides authentication and 
> the privileged NT service can use authorization API (AuthZ) to validate the 
> caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2727) In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is being displayed

2014-10-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180494#comment-14180494
 ] 

Hadoop QA commented on YARN-2727:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12676407/YARN-2727.20141023.1.patch
  against trunk revision d67214f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client:

  
org.apache.hadoop.yarn.client.TestApplicationMasterServiceProtocolOnHA
  org.apache.hadoop.yarn.client.TestGetGroups

  The following test timeouts occurred in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client:

org.apache.hadoop.yarnTests

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5500//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5500//console

This message is automatically generated.

> In RMAdminCLI usage display, instead of "yarn.node-labels.fs-store.root-dir", 
> "yarn.node-labels.fs-store.uri" is being displayed
> 
>
> Key: YARN-2727
> URL: https://issues.apache.org/jira/browse/YARN-2727
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>Priority: Minor
> Attachments: YARN-2727.20141023.1.patch
>
>
> In org.apache.hadoop.yarn.client.cli.RMAdminCLI usage display instead of 
> "yarn.node-labels.fs-store.root-dir", "yarn.node-labels.fs-store.uri" is 
> being used
> And also some modifications for the description



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)

2014-10-22 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-2495:

Attachment: YARN-2495.20141023-1.patch

As per Wangda's comments
1> raised new jira for  "disable central node label configuration"
2> removed modifications in the current jira for the CommonNodeLabelManager
3> ScriptNodeLabelProvider  in separate jira for better review

Currently attached WIP(Earlier patch bifurcated into 2 jiras YARN-2495 and 
YARN-2729). Will update with actual patch at the earliest.

> Allow admin specify labels in each NM (Distributed configuration)
> -
>
> Key: YARN-2495
> URL: https://issues.apache.org/jira/browse/YARN-2495
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Naganarasimha G R
> Attachments: YARN-2495.20141023-1.patch, YARN-2495_20141022.1.patch
>
>
> Target of this JIRA is to allow admin specify labels in each NM, this covers
> - User can set labels in each NM (by setting yarn-site.xml or using script 
> suggested by [~aw])
> - NM will send labels to RM via ResourceTracker API
> - RM will set labels in NodeLabelManager when NM register/update labels



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed

2014-10-22 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2724:
--
Target Version/s:   (was: 2.5.1)

bq. As the log aggregation is done by NM user, giving it the permissions to 
access the generated log file should fix this issue.
Agreed. I guess the problem that YARN should address is to surface the issue 
with aggregation to the end-user - right now it's not clear what really 
happened.

> If an unreadable file is encountered during log aggregation then aggregated 
> file in HDFS badly formed
> -
>
> Key: YARN-2724
> URL: https://issues.apache.org/jira/browse/YARN-2724
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.5.1
>Reporter: Sumit Mohanty
>Assignee: Xuan Gong
>
> Look into the log output snippet. It looks like there is an issue during 
> aggregation when an unreadable file is encountered. Likely, this results in 
> bad encoding.
> {noformat}
> LogType: command-13.json
> LogLength: 13934
> Log Contents:
> Error aggregating log file. Log file : 
> /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json
>  (Permission denied)command-3.json13983Error aggregating log file. Log file : 
> /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json
>  (Permission denied)
>   
> errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: 
> [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K->15575K(184320K), 
> 0.0488700 secs] 163840K->15575K(1028096K), 0.0492510 secs] [Times: user=0.06 
> sys=0.01, real=0.05 secs]
> 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: 
> [ParNew: 179415K->11865K(184320K), 0.0941310 secs] 179415K->17228K(1028096K), 
> 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs]
> 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: 
> 95.187: [ParNew: 175705K->12802K(184320K), 0.0466420 secs] 
> 181068K->18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, 
> real=0.04 secs]
> {noformat}
> Specifically, look at the text after the exception text. There should be two 
> more entries for log files but none exist. This is likely due to the fact 
> that command-13.json is expected to be of length 13934 but its is not as the 
> file was never read.
> I think, it should have been
> {noformat}
> LogType: command-13.json
> LogLength: 
> Log Contents:
> Error aggregating log file. Log file : 
> /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json
>  (Permission denied)command-3.json13983Error aggregating log file. Log file : 
> /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json
>  (Permission denied)
> {noformat}
> {noformat}
> LogType: errors-3.txt
> LogLength:0
> Log Contents:
> {noformat}
> {noformat}
> LogType:gc.log
> LogLength:???
> Log Contents:
> ..-20141021044514484052014-10-21T04:45:12.046+: 5.134: 
> [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K- ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2503) Changes in RM Web UI to better show labels to end users

2014-10-22 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2503:
-
Description: 
Include but not limited to:
- Show labels of nodes in RM/nodes page
- Show labels of queue in RM/scheduler page

  was:
Include but not limited to:
- Show labels of nodes in RM/nodes page
- Show labels of queue in RM/scheduler page
- Warn user/admin if capacity of queue cannot be guaranteed according to mis 
config of labels.


> Changes in RM Web UI to better show labels to end users
> ---
>
> Key: YARN-2503
> URL: https://issues.apache.org/jira/browse/YARN-2503
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-2503.patch
>
>
> Include but not limited to:
> - Show labels of nodes in RM/nodes page
> - Show labels of queue in RM/scheduler page



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2503) Changes in RM Web UI to better show labels to end users

2014-10-22 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2503:
-
Attachment: YARN-2503-20141022-1.patch

Attached an updated patch

> Changes in RM Web UI to better show labels to end users
> ---
>
> Key: YARN-2503
> URL: https://issues.apache.org/jira/browse/YARN-2503
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>     Attachments: YARN-2503-20141022-1.patch, YARN-2503.patch
>
>
> Include but not limited to:
> - Show labels of nodes in RM/nodes page
> - Show labels of queue in RM/scheduler page



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed

2014-10-22 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180554#comment-14180554
 ] 

Xuan Gong commented on YARN-2724:
-

As [~mitdesai] mentioned, "the problem here is due to calculation of file 
length before even trying to open the file. Log aggregator reads the file 
length of the log file that is to be aggregated and records it. Then it tries 
to go and read the file contents."

For the issue reported by [~sumitmohanty], it is because of file permission. We 
can not aggregate the log file.

Looking at the code
{code}
final long fileLength = logFile.length();
// Write the logFile Type
out.writeUTF(logFile.getName());

// Write the log length as UTF so that it is printable
out.writeUTF(String.valueOf(fileLength));

// Write the log itself
FileInputStream in = null;
try {
  in = SecureIOUtils.openForRead(logFile, getUser(), null);
  byte[] buf = new byte[65535];
  int len = 0;
  long bytesLeft = fileLength;
  while ((len = in.read(buf)) != -1) {
//If buffer contents within fileLength, write
if (len < bytesLeft) {
  out.write(buf, 0, len);
  bytesLeft-=len;
}
//else only write contents within fileLength, then exit early
else {
  out.write(buf, 0, (int)bytesLeft);
  break;
}
  }
  long newLength = logFile.length();
  if(fileLength < newLength) {
LOG.warn("Aggregated logs truncated by approximately "+
(newLength-fileLength) +" bytes.");
  }
  this.uploadedFiles.add(logFile);
} catch (IOException e) {
  String message = "Error aggregating log file. Log file : "
  + logFile.getAbsolutePath() + e.getMessage();
  LOG.error(message, e);
  out.write(message.getBytes());
} finally {
  if (in != null) {
in.close();
  }
}
{code}
Excluding the permission issue, there will be more issues which can cause the 
same problem.


> If an unreadable file is encountered during log aggregation then aggregated 
> file in HDFS badly formed
> -
>
> Key: YARN-2724
> URL: https://issues.apache.org/jira/browse/YARN-2724
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.5.1
>Reporter: Sumit Mohanty
>Assignee: Xuan Gong
>
> Look into the log output snippet. It looks like there is an issue during 
> aggregation when an unreadable file is encountered. Likely, this results in 
> bad encoding.
> {noformat}
> LogType: command-13.json
> LogLength: 13934
> Log Contents:
> Error aggregating log file. Log file : 
> /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json
>  (Permission denied)command-3.json13983Error aggregating log file. Log file : 
> /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json
>  (Permission denied)
>   
> errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: 
> [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K->15575K(184320K), 
> 0.0488700 secs] 163840K->15575K(1028096K), 0.0492510 secs] [Times: user=0.06 
> sys=0.01, real=0.05 secs]
> 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: 
> [ParNew: 179415K->11865K(184320K), 0.0941310 secs] 179415K->17228K(1028096K), 
> 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs]
> 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: 
> 95.187: [ParNew: 175705K->12802K(184320K), 0.0466420 secs] 
> 181068K->18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, 
> real=0.04 secs]
> {noformat}
> Specifically, look at the text after the exception text. There should be two 
> more entries for log files but none exist. This is likely due to the fact 
> that command-13.json is expected to be of length 13934 but its is not as the 
> file was never read.
> I think, it should have been
> {noformat}
> LogType: command-13.json
> LogLength: 
> Log Contents:
> Error aggregating log file. Log file : 
> /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json
>  (Permission denied)command-3.json13983Error aggregating log file. Log file

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-10-22 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180575#comment-14180575
 ] 

Jian He commented on YARN-2314:
---

Thanks Jason, I looked at the patch, looks good overall. just one thing:
- IIUC, {{mayBeCloseProxy}} can be invoked by MR/NMClient, but 
{{proxy.scheduledForClose}} is always false.  So it won’t call the following 
stopProxy. If cache is disabled, this doesn’t matter too much as the 
idleTimeout is set to 0. But if the cache is enabled, MR/NMClient, won’t be 
able to explicitly close the proxy ?

Also, Can you help me understand one point:
bq. See ClientCache.stopClient for details. Given that the whole point of the 
ContainerManagementProtocolProxy cache is to preserve at least one reference to 
the Client, the IPC Client stop method will never be called in practice and IPC 
client threads will never be explicitly torn down as a result of calling 
stopProxy.
once {{ContainerManagementProtocolProxy#tryCloseProxy}} is called, internally 
it’ll call {{rpc.stopProxy}}, will it eventually call 
{{ClientCache#stopClient}} ?


> ContainerManagementProtocolProxy can create thousands of threads for a large 
> cluster
> 
>
> Key: YARN-2314
> URL: https://issues.apache.org/jira/browse/YARN-2314
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: YARN-2314.patch, YARN-2314v2.patch, 
> disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch, 
> tez-yarn-2314.xlsx
>
>
> ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
> this cache is configurable.  However the cache can grow far beyond the 
> configured size when running on a large cluster and blow AM address/container 
> limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed

2014-10-22 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180581#comment-14180581
 ] 

Xuan Gong commented on YARN-2724:
-

So, for the exception reported here is because of the file permission, and we 
could not aggregated this log file. And aggregated file is badly format.
We could fix this issue first in this ticket.

> If an unreadable file is encountered during log aggregation then aggregated 
> file in HDFS badly formed
> -
>
> Key: YARN-2724
> URL: https://issues.apache.org/jira/browse/YARN-2724
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.5.1
>Reporter: Sumit Mohanty
>Assignee: Xuan Gong
>
> Look into the log output snippet. It looks like there is an issue during 
> aggregation when an unreadable file is encountered. Likely, this results in 
> bad encoding.
> {noformat}
> LogType: command-13.json
> LogLength: 13934
> Log Contents:
> Error aggregating log file. Log file : 
> /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json
>  (Permission denied)command-3.json13983Error aggregating log file. Log file : 
> /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json
>  (Permission denied)
>   
> errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: 
> [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K->15575K(184320K), 
> 0.0488700 secs] 163840K->15575K(1028096K), 0.0492510 secs] [Times: user=0.06 
> sys=0.01, real=0.05 secs]
> 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: 
> [ParNew: 179415K->11865K(184320K), 0.0941310 secs] 179415K->17228K(1028096K), 
> 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs]
> 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: 
> 95.187: [ParNew: 175705K->12802K(184320K), 0.0466420 secs] 
> 181068K->18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, 
> real=0.04 secs]
> {noformat}
> Specifically, look at the text after the exception text. There should be two 
> more entries for log files but none exist. This is likely due to the fact 
> that command-13.json is expected to be of length 13934 but its is not as the 
> file was never read.
> I think, it should have been
> {noformat}
> LogType: command-13.json
> LogLength: 
> Log Contents:
> Error aggregating log file. Log file : 
> /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json
>  (Permission denied)command-3.json13983Error aggregating log file. Log file : 
> /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json
>  (Permission denied)
> {noformat}
> {noformat}
> LogType: errors-3.txt
> LogLength:0
> Log Contents:
> {noformat}
> {noformat}
> LogType:gc.log
> LogLength:???
> Log Contents:
> ..-20141021044514484052014-10-21T04:45:12.046+: 5.134: 
> [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K- ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor

2014-10-22 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180592#comment-14180592
 ] 

Xuan Gong commented on YARN-2701:
-

[~aw] Do you have any other comments for this patch ?

> Potential race condition in startLocalizer when using LinuxContainerExecutor  
> --
>
> Key: YARN-2701
> URL: https://issues.apache.org/jira/browse/YARN-2701
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
> YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch, 
> YARN-2701.addendum.1.patch, YARN-2701.addendum.2.patch
>
>
> When using LinuxContainerExecutor do startLocalizer, we are using native code 
> container-executor.c. 
> {code}
>  if (stat(npath, &sb) != 0) {
>if (mkdir(npath, perm) != 0) {
> {code}
> We are using check and create method to create the appDir under /usercache. 
> But if there are two containers trying to do this at the same time, race 
> condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2726) CapacityScheduler should explicitly log when an accessible label has no capacity

2014-10-22 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180594#comment-14180594
 ] 

Wangda Tan commented on YARN-2726:
--

[~tweek], good suggestion! I completely agree to make error message more clear 
to admin/user.

> CapacityScheduler should explicitly log when an accessible label has no 
> capacity
> 
>
> Key: YARN-2726
> URL: https://issues.apache.org/jira/browse/YARN-2726
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Phil D'Amore
>Assignee: Naganarasimha G R
>Priority: Minor
>
> Given:
> - Node label defined: test-label
> - Two queues defined: a, b
> - label accessibility and and capacity defined as follows (properties 
> abbreviated for readability):
> root.a.accessible-node-labels = test-label
> root.a.accessible-node-labels.test-label.capacity = 100
> If you restart the RM or do a 'rmadmin -refreshQueues' you will get a stack 
> trace with the following error buried within:
> "Illegal capacity of -1.0 for label=test-label in queue=root.b"
> This of course occurs because test-label is accessible to b due to 
> inheritance from the root, and -1 is the UNDEFINED value.  To my mind this 
> might not be obvious to the admin, and the error message which results does 
> not help guide someone to the source of the issue.
> I propose that this situation be updated so that when the capacity on an 
> accessible label is undefined, it is explicitly called out instead of falling 
> through to the illegal capacity check.  Something like:
> {code}
> if (capacity == UNDEFINED) {
> throw new IllegalArgumentException("Configuration issue: " + " label=" + 
> label + " is accessible from queue=" + queue + " but has no capacity set.");
> }
> {code}
> I'll leave it to better judgement than mine as to whether I'm throwing the 
> appropriate exception there.  I think this check should be added to both 
> getNodeLabelCapacities and getMaximumNodeLabelCapacities in 
> CapacitySchedulerConfiguration.java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2726) CapacityScheduler should explicitly log when an accessible label has no capacity

2014-10-22 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2726:
-
Issue Type: Sub-task  (was: Improvement)
Parent: YARN-2492

> CapacityScheduler should explicitly log when an accessible label has no 
> capacity
> 
>
> Key: YARN-2726
> URL: https://issues.apache.org/jira/browse/YARN-2726
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Phil D'Amore
>Assignee: Naganarasimha G R
>Priority: Minor
>
> Given:
> - Node label defined: test-label
> - Two queues defined: a, b
> - label accessibility and and capacity defined as follows (properties 
> abbreviated for readability):
> root.a.accessible-node-labels = test-label
> root.a.accessible-node-labels.test-label.capacity = 100
> If you restart the RM or do a 'rmadmin -refreshQueues' you will get a stack 
> trace with the following error buried within:
> "Illegal capacity of -1.0 for label=test-label in queue=root.b"
> This of course occurs because test-label is accessible to b due to 
> inheritance from the root, and -1 is the UNDEFINED value.  To my mind this 
> might not be obvious to the admin, and the error message which results does 
> not help guide someone to the source of the issue.
> I propose that this situation be updated so that when the capacity on an 
> accessible label is undefined, it is explicitly called out instead of falling 
> through to the illegal capacity check.  Something like:
> {code}
> if (capacity == UNDEFINED) {
> throw new IllegalArgumentException("Configuration issue: " + " label=" + 
> label + " is accessible from queue=" + queue + " but has no capacity set.");
> }
> {code}
> I'll leave it to better judgement than mine as to whether I'm throwing the 
> appropriate exception there.  I think this check should be added to both 
> getNodeLabelCapacities and getMaximumNodeLabelCapacities in 
> CapacitySchedulerConfiguration.java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2726) CapacityScheduler should explicitly log when an accessible label has no capacity

2014-10-22 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180637#comment-14180637
 ] 

Wangda Tan commented on YARN-2726:
--

Converted this to sub task of YARN-2492

> CapacityScheduler should explicitly log when an accessible label has no 
> capacity
> 
>
> Key: YARN-2726
> URL: https://issues.apache.org/jira/browse/YARN-2726
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Phil D'Amore
>Assignee: Naganarasimha G R
>Priority: Minor
>
> Given:
> - Node label defined: test-label
> - Two queues defined: a, b
> - label accessibility and and capacity defined as follows (properties 
> abbreviated for readability):
> root.a.accessible-node-labels = test-label
> root.a.accessible-node-labels.test-label.capacity = 100
> If you restart the RM or do a 'rmadmin -refreshQueues' you will get a stack 
> trace with the following error buried within:
> "Illegal capacity of -1.0 for label=test-label in queue=root.b"
> This of course occurs because test-label is accessible to b due to 
> inheritance from the root, and -1 is the UNDEFINED value.  To my mind this 
> might not be obvious to the admin, and the error message which results does 
> not help guide someone to the source of the issue.
> I propose that this situation be updated so that when the capacity on an 
> accessible label is undefined, it is explicitly called out instead of falling 
> through to the illegal capacity check.  Something like:
> {code}
> if (capacity == UNDEFINED) {
> throw new IllegalArgumentException("Configuration issue: " + " label=" + 
> label + " is accessible from queue=" + queue + " but has no capacity set.");
> }
> {code}
> I'll leave it to better judgement than mine as to whether I'm throwing the 
> appropriate exception there.  I think this check should be added to both 
> getNodeLabelCapacities and getMaximumNodeLabelCapacities in 
> CapacitySchedulerConfiguration.java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2721) Race condition: ZKRMStateStore retry logic may throw NodeExist exception

2014-10-22 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2721:
---
Priority: Blocker  (was: Major)

> Race condition: ZKRMStateStore retry logic may throw NodeExist exception 
> -
>
> Key: YARN-2721
> URL: https://issues.apache.org/jira/browse/YARN-2721
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: YARN-2721.1.patch
>
>
> Blindly retrying operations in zookeeper will not work for non-idempotent 
> operations (like create znode). The reason is that the client can do a create 
> znode, but the response may not be returned because the server can die or 
> timeout. In case of retrying the create znode, it will throw a NODE_EXISTS 
> exception from the earlier create from the same session.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2722) Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle

2014-10-22 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2722:
--
Attachment: YARN-2722-1.patch

This patch creates a whilelist {"TLSv1.2", "TLSv1.1", "TLSv1"} for the 
SSLFactory. Have verified with the ShuffleHandler (13562 port).
{code:title=Without fix}
$ openssl s_client -connect localhost:13562 -ssl3
CONNECTED(0003)
depth=0 CN = *.ent.cloudera.com
verify error:num=18:self signed certificate
verify return:1
depth=0 CN = *.ent.cloudera.com
verify return:1
---
Certificate chain
 0 s:/CN=*.ent.cloudera.com
   i:/CN=*.ent.cloudera.com
---
Server certificate
-BEGIN CERTIFICATE-
MIIC2TCCAcGgAwIBAgIERTXzmDANBgkqhkiG9w0BAQsFADAdMRswGQYDVQQDDBIq
LmVudC5jbG91ZGVyYS5jb20wHhcNMTQxMDE0MjEwOTU1WhcNMTUwMTEyMjEwOTU1
WjAdMRswGQYDVQQDDBIqLmVudC5jbG91ZGVyYS5jb20wggEiMA0GCSqGSIb3DQEB
AQUAA4IBDwAwggEKAoIBAQDdd3RIofg6S0jNi1tZPLC/ye4yLz5PLdxpn5Rlmg8p
jORirbyvsLSn82WcfITUUx8Iez9pYLLXBzOqS4nlXwFP1WHDHGJFyuidTOaXm2fr
sZIVYUx0ldzUT6AhSLQ1p81g8Uplv3xA+Bh/SIXU84vKnjH6eU2wJc/0AKS6Jchl
hNr9ZuMEK6Dc34MbjOd0inLNqR2A26wV/tEPhf3UWbpkED9J8DZqevp25hvmYomM
OSoUSyO2hc6Mkj97Cbd8OglbXzG0lFzCgmN0yqFZ7X8pZuOzs2MhnzXtzjUbwvyO
G+1mpQ95Oc1cBdK40Rq/xeE8NwDP6C9JJ8FEz/VuuUZfAgMBAAGjITAfMB0GA1Ud
DgQWBBR/aS6adMIKP9pQbfcNkxyIbRMXJDANBgkqhkiG9w0BAQsFAAOCAQEAktNr
AzECBbO3hZEmjbZ/lnE+9DI7LF8DV1XbwZqd5qXhnnqZde5CryOGsAn76RkizUlo
KH1+8w8WRW8YxCx3863dOKg9yRr8rR5+BedSfG1GeF9PSpRYJ1o5Bv9wLNjI+UM0
E6zq3ObxpLe1QqXwz5Ro5DOIaBN5GRNp6i1B6k6b1aPsJOAaBkuFkR+unBCWnQk7
uMtGb78LaCYU0/8D5fRMTkeChR9gxuwYj7hwt3+CKdKEQ+0Mxbd5/sO8HgGlOcB1
T1xtu/GXoboiwwn6pLm/OksEyxB9TXnSvkc9C/RXQeaSaiEvYksS1LvPkvq27qDU
09EC8C1HkfWd4uOKYA==
-END CERTIFICATE-
subject=/CN=*.ent.cloudera.com
issuer=/CN=*.ent.cloudera.com
---
No client certificate CA names sent
---
SSL handshake has read 1239 bytes and written 288 bytes
---
New, TLSv1/SSLv3, Cipher is ECDHE-RSA-DES-CBC3-SHA
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
SSL-Session:
Protocol  : SSLv3
Cipher: ECDHE-RSA-DES-CBC3-SHA
Session-ID: 5446E4F74C3341F5AEA8CB827A5745A90AB8BF09765C4EDBBE57174314AEC901
Session-ID-ctx:
Master-Key: 
D6C5A557D188361EB4E25414C6360EC6835143D27572D7A0019213C2AD175852C8F850D21B95DF334EC8B95D9FDB
Key-Arg   : None
PSK identity: None
PSK identity hint: None
SRP username: None
Start Time: 1413932279
Timeout   : 7200 (sec)
Verify return code: 18 (self signed certificate)
---
q
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=UTF-8
name: mapreduce
version: 1.0.0

closed
{code}

{code:title=With Fix}
$ openssl s_client -connect localhost:13562 -ssl3
CONNECTED(0003)
write:errno=104
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 0 bytes
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
SSL-Session:
Protocol  : SSLv3
Cipher: 
Session-ID:
Session-ID-ctx:
Master-Key:
Key-Arg   : None
PSK identity: None
PSK identity hint: None
SRP username: None
Start Time: 1414013826
Timeout   : 7200 (sec)
Verify return code: 0 (ok)
---
{code}

> Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle
> -
>
> Key: YARN-2722
> URL: https://issues.apache.org/jira/browse/YARN-2722
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-2722-1.patch
>
>
> We should disable SSLv3 in HttpFS to protect against the POODLEbleed 
> vulnerability.
> See [CVE-2014-3566 
> |http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-3566]
> We have {{context = SSLContext.getInstance("TLS");}} in SSLFactory, but when 
> I checked, I could still connect with SSLv3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2183) Cleaner service for cache manager

2014-10-22 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180676#comment-14180676
 ] 

Karthik Kambatla commented on YARN-2183:


bq. we do have a YARN admin command implemented that lets you run the cleaner 
task on demand (YARN-2189). 
Cool. In that case, I see the merit to keeping runCleanerTask around.

bq. this check is needed to prevent a race (i.e. not allow an on-demand run 
when a scheduled run is in progress).
I understand we need a check to prevent the race. I wonder if we can just 
re-use the existing check in CleanerTask#run instead of an explicit check in 
CleanerService#runCleanerTask? From what I remember, that would make the code 
in CleanerTask#run cleaner as well. (no pun)

bq. However, it’s not clear to me whether a dependency from an SCMStore to an 
AppChecker is always a fine requirement for other types of stores.
I poked around a little more, and here is what I think. SharedCacheManager 
creates an instance of AppChecker, rest of the SCM pieces (Store, 
CleanerService) should just use the same instance. This instance can be passed 
either in the constructor or through an SCMContext similar to RMContext. Or, we 
could add SCM#getAppChecker. 

In its current form, CleanerTask#cleanResourceReferences fetches the references 
from the store, checks if the apps are running, and asks the store to remove 
the references. Moving the whole method to the store would simplify the code 
more. 

The latest patch looks pretty good but for the above two points. One other nit: 
One of {CleanerTask, CleanerService} has unused imports. 

> Cleaner service for cache manager
> -
>
> Key: YARN-2183
> URL: https://issues.apache.org/jira/browse/YARN-2183
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
> Attachments: YARN-2183-trunk-v1.patch, YARN-2183-trunk-v2.patch, 
> YARN-2183-trunk-v3.patch, YARN-2183-trunk-v4.patch, YARN-2183-trunk-v5.patch
>
>
> Implement the cleaner service for the cache manager along with metrics for 
> the service. This service is responsible for cleaning up old resource 
> references in the manager and removing stale entries from the cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2503) Changes in RM Web UI to better show labels to end users

2014-10-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180683#comment-14180683
 ] 

Hadoop QA commented on YARN-2503:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12676426/YARN-2503-20141022-1.patch
  against trunk revision 7b0f9bb.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5501//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5501//console

This message is automatically generated.

> Changes in RM Web UI to better show labels to end users
> ---
>
> Key: YARN-2503
> URL: https://issues.apache.org/jira/browse/YARN-2503
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>     Attachments: YARN-2503-20141022-1.patch, YARN-2503.patch
>
>
> Include but not limited to:
> - Show labels of nodes in RM/nodes page
> - Show labels of queue in RM/scheduler page



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-10-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180690#comment-14180690
 ] 

Hadoop QA commented on YARN-2198:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12676333/YARN-2198.16.patch
  against trunk revision a36399e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5502//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5502//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5502//console

This message is automatically generated.

> Remove the need to run NodeManager as privileged account for Windows Secure 
> Container Executor
> --
>
> Key: YARN-2198
> URL: https://issues.apache.org/jira/browse/YARN-2198
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
> YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, 
> YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.16.patch, 
> YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
> YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
> YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
> YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
> YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch
>
>
> YARN-1972 introduces a Secure Windows Container Executor. However this 
> executor requires the process launching the container to be LocalSystem or a 
> member of the a local Administrators group. Since the process in question is 
> the NodeManager, the requirement translates to the entire NM to run as a 
> privileged account, a very large surface area to review and protect.
> This proposal is to move the privileged operations into a dedicated NT 
> service. The NM can run as a low privilege account and communicate with the 
> privileged NT service when it needs to launch a container. This would reduce 
> the surface exposed to the high privileges. 
> There has to exist a secure, authenticated and authorized channel of 
> communication between the NM and the privileged NT service. Possible 
> alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
> be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
> specific inter-process communication channel that satisfies all requirements 
> and is easy to deploy. The privileged NT service would register and listen on 
> an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
> with libwinutils which would host the LPC client code. The client would 
> connect to the LPC port (NtConnectPort) and send a message requesting a 
> container launch (NtRequestWaitReplyPort). LPC provides authentication and 
> the privileged NT service can use authorization API (AuthZ) to validate the 
> caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-10-22 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180695#comment-14180695
 ] 

Jian He commented on YARN-2198:
---

committing 

> Remove the need to run NodeManager as privileged account for Windows Secure 
> Container Executor
> --
>
> Key: YARN-2198
> URL: https://issues.apache.org/jira/browse/YARN-2198
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
> YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, 
> YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.16.patch, 
> YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
> YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
> YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
> YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
> YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch
>
>
> YARN-1972 introduces a Secure Windows Container Executor. However this 
> executor requires the process launching the container to be LocalSystem or a 
> member of the a local Administrators group. Since the process in question is 
> the NodeManager, the requirement translates to the entire NM to run as a 
> privileged account, a very large surface area to review and protect.
> This proposal is to move the privileged operations into a dedicated NT 
> service. The NM can run as a low privilege account and communicate with the 
> privileged NT service when it needs to launch a container. This would reduce 
> the surface exposed to the high privileges. 
> There has to exist a secure, authenticated and authorized channel of 
> communication between the NM and the privileged NT service. Possible 
> alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
> be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
> specific inter-process communication channel that satisfies all requirements 
> and is easy to deploy. The privileged NT service would register and listen on 
> an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
> with libwinutils which would host the LPC client code. The client would 
> connect to the LPC port (NtConnectPort) and send a message requesting a 
> container launch (NtRequestWaitReplyPort). LPC provides authentication and 
> the privileged NT service can use authorization API (AuthZ) to validate the 
> caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2722) Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle

2014-10-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180724#comment-14180724
 ] 

Hadoop QA commented on YARN-2722:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12676452/YARN-2722-1.patch
  against trunk revision a36399e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5503//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5503//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5503//console

This message is automatically generated.

> Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle
> -
>
> Key: YARN-2722
> URL: https://issues.apache.org/jira/browse/YARN-2722
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-2722-1.patch
>
>
> We should disable SSLv3 in HttpFS to protect against the POODLEbleed 
> vulnerability.
> See [CVE-2014-3566 
> |http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-3566]
> We have {{context = SSLContext.getInstance("TLS");}} in SSLFactory, but when 
> I checked, I could still connect with SSLv3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2718) Create a CompositeConatainerExecutor that combines DockerContainerExecutor and DefaultContainerExecutor

2014-10-22 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180728#comment-14180728
 ] 

Allen Wittenauer commented on YARN-2718:


I don't think creating some sort of mutant executor is really the proper fix 
here.  I suspect the real answer is to allow users to pick which executor (from 
an admin approved list) is probably closer (and quicker!) to the real goal.

But the bigger issue is that this will lead to some very weird and 
unpredictable administrative experiences.  It also means that users will be 
given even more impact on how the NM actually works.  This is a bit of a 
dangerous road to start heading...

> Create a CompositeConatainerExecutor that combines DockerContainerExecutor 
> and DefaultContainerExecutor
> ---
>
> Key: YARN-2718
> URL: https://issues.apache.org/jira/browse/YARN-2718
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Abin Shahab
>
> There should be a composite container that allows users to run their jobs in 
> DockerContainerExecutor, but switch to DefaultContainerExecutor for debugging 
> purposes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-10-22 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180733#comment-14180733
 ] 

Jian He commented on YARN-2198:
---

thanks Craig for reviewing the patch !

> Remove the need to run NodeManager as privileged account for Windows Secure 
> Container Executor
> --
>
> Key: YARN-2198
> URL: https://issues.apache.org/jira/browse/YARN-2198
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Fix For: 2.6.0
>
> Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
> YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch, 
> YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.16.patch, 
> YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
> YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
> YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
> YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
> YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch
>
>
> YARN-1972 introduces a Secure Windows Container Executor. However this 
> executor requires the process launching the container to be LocalSystem or a 
> member of the a local Administrators group. Since the process in question is 
> the NodeManager, the requirement translates to the entire NM to run as a 
> privileged account, a very large surface area to review and protect.
> This proposal is to move the privileged operations into a dedicated NT 
> service. The NM can run as a low privilege account and communicate with the 
> privileged NT service when it needs to launch a container. This would reduce 
> the surface exposed to the high privileges. 
> There has to exist a secure, authenticated and authorized channel of 
> communication between the NM and the privileged NT service. Possible 
> alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
> be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
> specific inter-process communication channel that satisfies all requirements 
> and is easy to deploy. The privileged NT service would register and listen on 
> an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
> with libwinutils which would host the LPC client code. The client would 
> connect to the LPC port (NtConnectPort) and send a message requesting a 
> container launch (NtRequestWaitReplyPort). LPC provides authentication and 
> the privileged NT service can use authorization API (AuthZ) to validate the 
> caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user

2014-10-22 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180735#comment-14180735
 ] 

Jian He commented on YARN-1063:
---

I merged this to 2.6, as this was marked for 2.6

> Winutils needs ability to create task as domain user
> 
>
> Key: YARN-1063
> URL: https://issues.apache.org/jira/browse/YARN-1063
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
> Environment: Windows
>Reporter: Kyle Leckie
>Assignee: Remus Rusanu
>  Labels: security, windows
> Fix For: 2.6.0
>
> Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, 
> YARN-1063.5.patch, YARN-1063.6.patch, YARN-1063.patch
>
>
> h1. Summary:
> Securing a Hadoop cluster requires constructing some form of security 
> boundary around the processes executed in YARN containers. Isolation based on 
> Windows user isolation seems most feasible. This approach is similar to the 
> approach taken by the existing LinuxContainerExecutor. The current patch to 
> winutils.exe adds the ability to create a process as a domain user. 
> h1. Alternative Methods considered:
> h2. Process rights limited by security token restriction:
> On Windows access decisions are made by examining the security token of a 
> process. It is possible to spawn a process with a restricted security token. 
> Any of the rights granted by SIDs of the default token may be restricted. It 
> is possible to see this in action by examining the security tone of a 
> sandboxed process launch be a web browser. Typically the launched process 
> will have a fully restricted token and need to access machine resources 
> through a dedicated broker process that enforces a custom security policy. 
> This broker process mechanism would break compatibility with the typical 
> Hadoop container process. The Container process must be able to utilize 
> standard function calls for disk and network IO. I performed some work 
> looking at ways to ACL the local files to the specific launched without 
> granting rights to other processes launched on the same machine but found 
> this to be an overly complex solution. 
> h2. Relying on APP containers:
> Recent versions of windows have the ability to launch processes within an 
> isolated container. Application containers are supported for execution of 
> WinRT based executables. This method was ruled out due to the lack of 
> official support for standard windows APIs. At some point in the future 
> windows may support functionality similar to BSD jails or Linux containers, 
> at that point support for containers should be added.
> h1. Create As User Feature Description:
> h2. Usage:
> A new sub command was added to the set of task commands. Here is the syntax:
> winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE]
> Some notes:
> * The username specified is in the format of "user@domain"
> * The machine executing this command must be joined to the domain of the user 
> specified
> * The domain controller must allow the account executing the command access 
> to the user information. For this join the account to the predefined group 
> labeled "Pre-Windows 2000 Compatible Access"
> * The account running the command must have several rights on the local 
> machine. These can be managed manually using secpol.msc: 
> ** "Act as part of the operating system" - SE_TCB_NAME
> ** "Replace a process-level token" - SE_ASSIGNPRIMARYTOKEN_NAME
> ** "Adjust memory quotas for a process" - SE_INCREASE_QUOTA_NAME
> * The launched process will not have rights to the desktop so will not be 
> able to display any information or create UI.
> * The launched process will have no network credentials. Any access of 
> network resources that requires domain authentication will fail.
> h2. Implementation:
> Winutils performs the following steps:
> # Enable the required privileges for the current process.
> # Register as a trusted process with the Local Security Authority (LSA).
> # Create a new logon for the user passed on the command line.
> # Load/Create a profile on the local machine for the new logon.
> # Create a new environment for the new logon.
> # Launch the new process in a job with the task name specified and using the 
> created logon.
> # Wait for the JOB to exit.
> h2. Future work:
> The following work was scoped out of this check in:
> * Support for non-domain users or machine that are not domain joined.
> * Support for privilege isolation by running the task launcher in a high 
> privilege service with access over an ACLed named pipe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2010) RM can't transition to active if it can't recover an app attempt

2014-10-22 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2010:
---
Attachment: yarn-2010-4.patch

> RM can't transition to active if it can't recover an app attempt
> 
>
> Key: YARN-2010
> URL: https://issues.apache.org/jira/browse/YARN-2010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: bc Wong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, 
> yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch
>
>
> If the RM fails to recover an app attempt, it won't come up. We should make 
> it more resilient.
> Specifically, the underlying error is that the app was submitted before 
> Kerberos security got turned on. Makes sense for the app to fail in this 
> case. But YARN should still start.
> {noformat}
> 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election 
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
> Active 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>  
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
>  
> ... 4 more 
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
>  
> ... 5 more 
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
> ... 8 more 
> Caused by: java.lang.IllegalArgumentException: Missing argument 
> at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) 
> at 
> org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
>  
> ... 13 more 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180739#comment-14180739
 ] 

Hudson commented on YARN-2198:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6318 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6318/])
YARN-2198. Remove the need to run NodeManager as privileged account for Windows 
Secure Container Executor. Contributed by Remus Rusanu (jianhe: rev 
3b12fd6cfbf4cc91ef8e8616c7aafa9de006cde5)
* 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/util/ProcessTree.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* hadoop-common-project/hadoop-common/src/main/winutils/hadoopwinutilsvc.idl
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
* hadoop-common-project/hadoop-common/src/main/winutils/libwinutils.vcxproj
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/SecureContainer.apt.vm
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java
* hadoop-common-project/hadoop-common/pom.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
* 
hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/yarn/server/nodemanager/windows_secure_container_executor.h
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/RawLocalFileSystem.java
* hadoop-common-project/hadoop-common/src/main/winutils/include/winutils.h
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java
* hadoop-common-project/hadoop-common/src/main/winutils/task.c
* hadoop-common-project/hadoop-common/src/main/winutils/client.c
* hadoop-common-project/hadoop-common/src/main/winutils/config.cpp
* hadoop-common-project/hadoop-common/src/main/native/native.vcxproj
* hadoop-common-project/hadoop-common/src/main/winutils/winutils.vcxproj
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/WindowsSecureContainerExecutor.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java
* .gitignore
* 
hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/yarn/server/nodemanager/windows_secure_container_executor.c
* hadoop-common-project/hadoop-common/src/main/winutils/winutils.sln
* hadoop-common-project/hadoop-common/src/main/winutils/service.c
* hadoop-common-project/hadoop-common/src/main/winutils/main.c
* hadoop-common-project/hadoop-common/src/main/winutils/chown.c
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutorWithMocks.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutor.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java
* 
hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/nativeio/NativeIO.c
* hadoop-common-project/hadoop-common/src/main/winutils/libwinutils.c
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
* hadoop-common-project/hadoop-common/src/main/winutils/winutils.mc


> Remove the need to run NodeManager as privileged account for Windows Secure 
> Container Executor
> --
>
> Key: YARN-2198
> URL: https://issues.apache.org/jira/browse/YARN-2198
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Remus Rusanu
>

[jira] [Updated] (YARN-2010) If RM fails to recover an app, it can never transition to active again

2014-10-22 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2010:
---
Summary: If RM fails to recover an app, it can never transition to active 
again  (was: RM can't transition to active if it can't recover an app attempt)

> If RM fails to recover an app, it can never transition to active again
> --
>
> Key: YARN-2010
> URL: https://issues.apache.org/jira/browse/YARN-2010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: bc Wong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, 
> yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch
>
>
> If the RM fails to recover an app attempt, it won't come up. We should make 
> it more resilient.
> Specifically, the underlying error is that the app was submitted before 
> Kerberos security got turned on. Makes sense for the app to fail in this 
> case. But YARN should still start.
> {noformat}
> 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election 
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
> Active 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>  
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
>  
> ... 4 more 
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
>  
> ... 5 more 
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
> ... 8 more 
> Caused by: java.lang.IllegalArgumentException: Missing argument 
> at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) 
> at 
> org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
>  
> ... 13 more 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2730) Only one localizer can run on a NodeManager at a time

2014-10-22 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li updated YARN-2730:
--
Description: 
The synchronized modifier appears to have been added by 
https://issues.apache.org/jira/browse/MAPREDUCE-3537
It could be removed if Localizer doesn't depend on current directory

> Only one localizer can run on a NodeManager at a time
> -
>
> Key: YARN-2730
> URL: https://issues.apache.org/jira/browse/YARN-2730
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siqi Li
>Priority: Critical
>
> The synchronized modifier appears to have been added by 
> https://issues.apache.org/jira/browse/MAPREDUCE-3537
> It could be removed if Localizer doesn't depend on current directory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2730) Only one localizer can run on a NodeManager at a time

2014-10-22 Thread Siqi Li (JIRA)
Siqi Li created YARN-2730:
-

 Summary: Only one localizer can run on a NodeManager at a time
 Key: YARN-2730
 URL: https://issues.apache.org/jira/browse/YARN-2730
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Siqi Li
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2010) If RM fails to recover an app, it can never transition to active again

2014-10-22 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180747#comment-14180747
 ] 

Karthik Kambatla commented on YARN-2010:


This JIRA has been an open for a while and went through several discussions. I 
ll try to consolidate everything here so we can iterate on this quickly.

Sometimes, the RM fails to recover an application. It could be because of 
turning security on, token expiry, or issues connecting to HDFS etc. The causes 
could be classified into (1) transient, (2) specific to one application, and 
(3) permanent and apply to multiple (all) applications. Today, the RM fails to 
transition to Active and ends up in STOPPED state and can never be transitioned 
to Active again. 

Vinod suggested we handle these cases (exceptions) separately, so we can do the 
right thing for each exception. The latest patch (v4) is along these lines - it 
catches a potentially transient issue (ConnectException) and transitions the RM 
to Standby. If the issue were to persist (case - 3), the RM would eventually 
run out of number of failovers and crash. For application-specific issues (as 
of now, all other exceptions), we just skip recovering that app.

In addition to this, the patch cleans up RMAppManager#recoverApplication and 
also adds a null-check in RMAppAttempt#recoverAppAttemptCredentials per Jian's 
suggestion. 

> If RM fails to recover an app, it can never transition to active again
> --
>
> Key: YARN-2010
> URL: https://issues.apache.org/jira/browse/YARN-2010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: bc Wong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, 
> yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch
>
>
> If the RM fails to recover an app attempt, it won't come up. We should make 
> it more resilient.
> Specifically, the underlying error is that the app was submitted before 
> Kerberos security got turned on. Makes sense for the app to fail in this 
> case. But YARN should still start.
> {noformat}
> 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election 
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
> Active 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>  
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
>  
> ... 4 more 
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
>  
> ... 5 more 
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
> ... 8 more 
> Caused by: java.lang.IllegalArgumentException: Missing argument 

[jira] [Updated] (YARN-2730) Only one localizer can run on a NodeManager at a time

2014-10-22 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li updated YARN-2730:
--
Description: 
We are seeing that when one of the localizerRunner stuck, the rest of the 
localizerRunners are blocked. We should remove the synchronized modifier.
The synchronized modifier appears to have been added by 
https://issues.apache.org/jira/browse/MAPREDUCE-3537
It could be removed if Localizer doesn't depend on current directory

  was:
The synchronized modifier appears to have been added by 
https://issues.apache.org/jira/browse/MAPREDUCE-3537
It could be removed if Localizer doesn't depend on current directory


> Only one localizer can run on a NodeManager at a time
> -
>
> Key: YARN-2730
> URL: https://issues.apache.org/jira/browse/YARN-2730
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siqi Li
>Priority: Critical
>
> We are seeing that when one of the localizerRunner stuck, the rest of the 
> localizerRunners are blocked. We should remove the synchronized modifier.
> The synchronized modifier appears to have been added by 
> https://issues.apache.org/jira/browse/MAPREDUCE-3537
> It could be removed if Localizer doesn't depend on current directory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-2730) Only one localizer can run on a NodeManager at a time

2014-10-22 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li reassigned YARN-2730:
-

Assignee: Siqi Li

> Only one localizer can run on a NodeManager at a time
> -
>
> Key: YARN-2730
> URL: https://issues.apache.org/jira/browse/YARN-2730
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Siqi Li
>Assignee: Siqi Li
>Priority: Critical
>
> We are seeing that when one of the localizerRunner stuck, the rest of the 
> localizerRunners are blocked. We should remove the synchronized modifier.
> The synchronized modifier appears to have been added by 
> https://issues.apache.org/jira/browse/MAPREDUCE-3537
> It could be removed if Localizer doesn't depend on current directory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2730) Only one localizer can run on a NodeManager at a time

2014-10-22 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li updated YARN-2730:
--
Affects Version/s: (was: 2.5.0)
   2.4.0

> Only one localizer can run on a NodeManager at a time
> -
>
> Key: YARN-2730
> URL: https://issues.apache.org/jira/browse/YARN-2730
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Siqi Li
>Assignee: Siqi Li
>Priority: Critical
>
> We are seeing that when one of the localizerRunner stuck, the rest of the 
> localizerRunners are blocked. We should remove the synchronized modifier.
> The synchronized modifier appears to have been added by 
> https://issues.apache.org/jira/browse/MAPREDUCE-3537
> It could be removed if Localizer doesn't depend on current directory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2010) If RM fails to recover an app, it can never transition to active again

2014-10-22 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2010:
---
Attachment: issue-stack-strace.rtf

> If RM fails to recover an app, it can never transition to active again
> --
>
> Key: YARN-2010
> URL: https://issues.apache.org/jira/browse/YARN-2010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: bc Wong
>Assignee: Karthik Kambatla
>Priority: Critical
> Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, 
> yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch
>
>
> Sometimes, the RM fails to recover an application. It could be because of 
> turning security on, token expiry, or issues connecting to HDFS etc. The 
> causes could be classified into (1) transient, (2) specific to one 
> application, and (3) permanent and apply to multiple (all) applications. 
> Today, the RM fails to transition to Active and ends up in STOPPED state and 
> can never be transitioned to Active again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >