date:20140917


 [ 
https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2494:
-
Attachment: YARN-2494.patch

 [YARN-796] Node label manager API and storage implementations
 -

 Key: YARN-2494
 URL: https://issues.apache.org/jira/browse/YARN-2494
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, 
 YARN-2494.patch


 This JIRA includes APIs and storage implementations of node label manager,
 NodeLabelManager is an abstract class used to manage labels of nodes in the 
 cluster, it has APIs to query/modify
 - Nodes according to given label
 - Labels according to given hostname
 - Add/remove labels
 - Set labels of nodes in the cluster
 - Persist/recover changes of labels/labels-on-nodes to/from storage
 And it has two implementations to store modifications
 - Memory based storage: It will not persist changes, so all labels will be 
 lost when RM restart
 - FileSystem based storage: It will persist/recover to/from FileSystem (like 
 HDFS), and all labels and labels-on-nodes will be recovered upon RM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations

[
https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136872#comment-14136872
]

Wangda Tan commented on YARN-2494:
--

Hi [~cwelch],
Thanks for your comments,
bq. the only change is the (?automated) removal of an import, I think you
should just drop it from the change set.
Good catch, reverted this file.

bq. why force all to lower case? Discussion favored dropping that...
Updated according to our discussion

bq. checks for valid labels, there must be an easier way/stringlib/regex
Good suggestion, updated

bq. also in updateLableResource - it looks like if node1 has label a b and
queue q1 has label a b it’s resources will be added 2x and removed 2x, while
present it will have a 2x value (1x too many)
It should not, you can check the test:
{{TestNodeLabelManager#testGetQueueResource}}, it should cover the case you
described.

bq. line 603 exception message needs to include “or not present”
I found there's no exception msg around line 603, could you please update
according to latest patch?

bq. pls rename activeNode deactiveNode to activateNode and deactivateNode
Renamed,

Attached a new patch address your comments, please kindly review.

Thanks!
Wangda

[YARN-796] Node label manager API and storage implementations
-

Key: YARN-2494
URL: https://issues.apache.org/jira/browse/YARN-2494
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch,
YARN-2494.patch

This JIRA includes APIs and storage implementations of node label manager,
NodeLabelManager is an abstract class used to manage labels of nodes in the
cluster, it has APIs to query/modify
- Nodes according to given label
- Labels according to given hostname
- Add/remove labels
- Set labels of nodes in the cluster
- Persist/recover changes of labels/labels-on-nodes to/from storage
And it has two implementations to store modifications
- Memory based storage: It will not persist changes, so all labels will be
lost when RM restart
- FileSystem based storage: It will persist/recover to/from FileSystem (like
HDFS), and all labels and labels-on-nodes will be recovered upon RM restart

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2558) Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId


 [ 
https://issues.apache.org/jira/browse/YARN-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2558:
-
Attachment: YARN-2558.2.patch

[~jianhe], thanks for your suggestion and review. Updated to add test to 
confirm whether the serialization and deserialization work well.

 Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId
 --

 Key: YARN-2558
 URL: https://issues.apache.org/jira/browse/YARN-2558
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
Priority: Blocker
 Attachments: YARN-2558.1.patch, YARN-2558.2.patch


 We should update ContainerTokenIdentifier#read/write to use 
 {{getContainerId}} instead of {{getId}} to pass all container information 
 correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2496) [YARN-796] Changes for capacity scheduler to support allocate resource respect labels


[ 
https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136878#comment-14136878
 ] 

Wangda Tan commented on YARN-2496:
--

Hi [~cwelch],
I'm not sure quite understand about this, did you mean we need calculate 
consumed resource for each label (or label-expression) under each queue? Could 
you give me an example about how to avoid job starvation with it?
It confuse me that, if we have resource per label/l-expression, should we have 
resource per host/rack (we can ask for resource only on a host/rack by 
specifying relax-locality).

Thanks,
Wangda

 [YARN-796] Changes for capacity scheduler to support allocate resource 
 respect labels
 -

 Key: YARN-2496
 URL: https://issues.apache.org/jira/browse/YARN-2496
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, 
 YARN-2496.patch


 This JIRA Includes:
 - Add/parse labels option to {{capacity-scheduler.xml}} similar to other 
 options of queue like capacity/maximum-capacity, etc.
 - Include a default-label-expression option in queue config, if an app 
 doesn't specify label-expression, default-label-expression of queue will be 
 used.
 - Check if labels can be accessed by the queue when submit an app with 
 labels-expression to queue or update ResourceRequest with label-expression
 - Check labels on NM when trying to allocate ResourceRequest on the NM with 
 label-expression
 - Respect  labels when calculate headroom/user-limit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations


[ 
https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136885#comment-14136885
 ] 

Hadoop QA commented on YARN-2494:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669353/YARN-2494.patch
  against trunk revision c0c7e6f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4986//console

This message is automatically generated.

 [YARN-796] Node label manager API and storage implementations
 -

 Key: YARN-2494
 URL: https://issues.apache.org/jira/browse/YARN-2494
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, 
 YARN-2494.patch


 This JIRA includes APIs and storage implementations of node label manager,
 NodeLabelManager is an abstract class used to manage labels of nodes in the 
 cluster, it has APIs to query/modify
 - Nodes according to given label
 - Labels according to given hostname
 - Add/remove labels
 - Set labels of nodes in the cluster
 - Persist/recover changes of labels/labels-on-nodes to/from storage
 And it has two implementations to store modifications
 - Memory based storage: It will not persist changes, so all labels will be 
 lost when RM restart
 - FileSystem based storage: It will persist/recover to/from FileSystem (like 
 HDFS), and all labels and labels-on-nodes will be recovered upon RM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2505) [YARN-796] Support get/add/remove/change labels in RM REST API


[ 
https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136891#comment-14136891
 ] 

Wangda Tan commented on YARN-2505:
--

Hi Craig,
I've reviewed this patch, some comments:
1) I think it's better to rename /labels/all-nodes-to-labels to 
/labels/nodes-to-labels, because it's not all nodes-to-labels always. 
And I think the filter should be better changed to node-filter. My feeling is 
it's not very natural to apply a filter on values instead of keys.
Or we can support both node-filter and label-filter.

2) Some lines exceeds 80 chars, you can run regex on vim to check:
/^+.\{80,}

3) Test looks very good to me, thanks!

Regards,
Wangda

 [YARN-796] Support get/add/remove/change labels in RM REST API
 --

 Key: YARN-2505
 URL: https://issues.apache.org/jira/browse/YARN-2505
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Craig Welch
 Attachments: YARN-2505.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2558) Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId


[ 
https://issues.apache.org/jira/browse/YARN-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136900#comment-14136900
 ] 

Hadoop QA commented on YARN-2558:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669356/YARN-2558.2.patch
  against trunk revision c0c7e6f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4987//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4987//console

This message is automatically generated.

 Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId
 --

 Key: YARN-2558
 URL: https://issues.apache.org/jira/browse/YARN-2558
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
Priority: Blocker
 Attachments: YARN-2558.1.patch, YARN-2558.2.patch


 We should update ContainerTokenIdentifier#read/write to use 
 {{getContainerId}} instead of {{getId}} to pass all container information 
 correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2561) MR job client cannot reconnect to AM after NM restart.


 [ 
https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-2561:
-
Attachment: YARN-2561-v2.patch

Update the patch to fix the test failure.

 MR job client cannot reconnect to AM after NM restart.
 --

 Key: YARN-2561
 URL: https://issues.apache.org/jira/browse/YARN-2561
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Tassapol Athiapinya
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2561-v2.patch, YARN-2561.patch


 Work-preserving NM restart is disabled.
 Submit a job. Restart the only NM and found that Job will hang with connect 
 retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2498) [YARN-796] Respect labels in preemption policy of capacity scheduler


 [ 
https://issues.apache.org/jira/browse/YARN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2498:
-
Attachment: yarn-2498-implementation-notes.pdf

 [YARN-796] Respect labels in preemption policy of capacity scheduler
 

 Key: YARN-2498
 URL: https://issues.apache.org/jira/browse/YARN-2498
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2498.patch, YARN-2498.patch, 
 yarn-2498-implementation-notes.pdf


 There're 3 stages in ProportionalCapacityPreemptionPolicy,
 # Recursively calculate {{ideal_assigned}} for queue. This is depends on 
 available resource, resource used/pending in each queue and guaranteed 
 capacity of each queue.
 # Mark to-be preempted containers: For each over-satisfied queue, it will 
 mark some containers will be preempted.
 # Notify scheduler about to-be preempted container.
 We need respect labels in the cluster for both #1 and #2:
 For #1, when there're some resource available in the cluster, we shouldn't 
 assign it to a queue (by increasing {{ideal_assigned}}) if the queue cannot 
 access such labels
 For #2, when we make decision about whether we need preempt a container, we 
 need make sure, resource this container is *possibly* usable by a queue which 
 is under-satisfied and has pending resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2498) [YARN-796] Respect labels in preemption policy of capacity scheduler

[
https://issues.apache.org/jira/browse/YARN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136984#comment-14136984
]

Hadoop QA commented on YARN-2498:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment

http://issues.apache.org/jira/secure/attachment/12669372/yarn-2498-implementation-notes.pdf
against trunk revision c0c7e6f.

{color:red}-1 patch{color}. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4989//console

This message is automatically generated.

[YARN-796] Respect labels in preemption policy of capacity scheduler

Key: YARN-2498
URL: https://issues.apache.org/jira/browse/YARN-2498
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
Attachments: YARN-2498.patch, YARN-2498.patch,
yarn-2498-implementation-notes.pdf

There're 3 stages in ProportionalCapacityPreemptionPolicy,
# Recursively calculate {{ideal_assigned}} for queue. This is depends on
available resource, resource used/pending in each queue and guaranteed
capacity of each queue.
# Mark to-be preempted containers: For each over-satisfied queue, it will
mark some containers will be preempted.
# Notify scheduler about to-be preempted container.
We need respect labels in the cluster for both #1 and #2:
For #1, when there're some resource available in the cluster, we shouldn't
assign it to a queue (by increasing {{ideal_assigned}}) if the queue cannot
access such labels
For #2, when we make decision about whether we need preempt a container, we
need make sure, resource this container is *possibly* usable by a queue which
is under-satisfied and has pending resource.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2498) [YARN-796] Respect labels in preemption policy of capacity scheduler


[ 
https://issues.apache.org/jira/browse/YARN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136987#comment-14136987
 ] 

Wangda Tan commented on YARN-2498:
--

Attached implementation notes,
[~curino], [~sunilg], [~mayank_bansal], I would appreciate if you can take a 
look at it.

Thanks a lot!
Wangda

 [YARN-796] Respect labels in preemption policy of capacity scheduler
 

 Key: YARN-2498
 URL: https://issues.apache.org/jira/browse/YARN-2498
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2498.patch, YARN-2498.patch, 
 yarn-2498-implementation-notes.pdf


 There're 3 stages in ProportionalCapacityPreemptionPolicy,
 # Recursively calculate {{ideal_assigned}} for queue. This is depends on 
 available resource, resource used/pending in each queue and guaranteed 
 capacity of each queue.
 # Mark to-be preempted containers: For each over-satisfied queue, it will 
 mark some containers will be preempted.
 # Notify scheduler about to-be preempted container.
 We need respect labels in the cluster for both #1 and #2:
 For #1, when there're some resource available in the cluster, we shouldn't 
 assign it to a queue (by increasing {{ideal_assigned}}) if the queue cannot 
 access such labels
 For #2, when we make decision about whether we need preempt a container, we 
 need make sure, resource this container is *possibly* usable by a queue which 
 is under-satisfied and has pending resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182

2014-09-17 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136993#comment-14136993
 ] 

Steve Loughran commented on YARN-2562:
--

+1 for text making it clear what the values are, but please make it lower case 
for consistency

 ContainerId@toString() is unreadable for epoch 0 after YARN-2182
 -

 Key: YARN-2562
 URL: https://issues.apache.org/jira/browse/YARN-2562
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Tsuyoshi OZAWA
Priority: Critical

 ContainerID string format is unreadable for RMs that restarted at least once 
 (epoch  0) after YARN-2182. For e.g, 
 container_1410901177871_0001_01_05_17.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.


[ 
https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136998#comment-14136998
 ] 

Hadoop QA commented on YARN-2561:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669364/YARN-2561-v2.patch
  against trunk revision c0c7e6f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4988//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4988//console

This message is automatically generated.

 MR job client cannot reconnect to AM after NM restart.
 --

 Key: YARN-2561
 URL: https://issues.apache.org/jira/browse/YARN-2561
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Tassapol Athiapinya
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2561-v2.patch, YARN-2561.patch


 Work-preserving NM restart is disabled.
 Submit a job. Restart the only NM and found that Job will hang with connect 
 retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed

2014-09-17 Thread Sunil G (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137013#comment-14137013
 ] 

Sunil G commented on YARN-2308:
---

+1 for this approach. I also feel that returning application state as
FAILED is not complete solution.




 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: chang li
Priority: Critical
 Attachments: jira2308.patch, jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1250) Generic history service should support application-acls


[ 
https://issues.apache.org/jira/browse/YARN-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137085#comment-14137085
 ] 

Hudson commented on YARN-1250:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #683 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/683/])
YARN-1250. Generic history service should support application-acls. 
(Contributed by Zhijie Shen) (junping_du: rev 
90a0c03f0a696d32e871a5da4560828edea8cfa9)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestAppManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/TestSystemMetricsPublisher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/ApplicationACLsUpdatedEvent.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/SystemMetricsPublisher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/metrics/ApplicationMetricsConstants.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/SystemMetricsEventType.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java
YARN-1250. Addendum (junping_du: rev 0e7d1dbf9ab732dd04dccaacbf273e9ac437eba5)
* hadoop-yarn-project/CHANGES.txt


 Generic history service should support application-acls
 ---

 Key: YARN-1250
 URL: https://issues.apache.org/jira/browse/YARN-1250
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Zhijie Shen
 Fix For: 2.6.0

 Attachments: GenericHistoryACLs.pdf, YARN-1250.1.patch, 
 YARN-1250.2.patch, YARN-1250.3.patch, YARN-1250.4.patch, YARN-1250.5.patch, 
 YARN-1250.6.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2531) CGroups - Admins should be allowed to enforce strict cpu limits


[ 
https://issues.apache.org/jira/browse/YARN-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137084#comment-14137084
 ] 

Hudson commented on YARN-2531:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #683 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/683/])
YARN-2531. Added a configuration for admins to be able to override app-configs 
and enforce/not-enforce strict control of per-container cpu usage. Contributed 
by Varun Vasudev. (vinodkv: rev 9f6891d9ef7064d121305ca783eb62586c8aa018)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java


 CGroups - Admins should be allowed to enforce strict cpu limits
 ---

 Key: YARN-2531
 URL: https://issues.apache.org/jira/browse/YARN-2531
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Fix For: 2.6.0

 Attachments: apache-yarn-2531.0.patch


 From YARN-2440 -
 {quote} 
 The other dimension to this is determinism w.r.t performance. Limiting to 
 allocated cores overall (as well as per container later) helps orgs run 
 workloads and reason about them deterministically. One of the examples is 
 benchmarking apps, but deterministic execution is a desired option beyond 
 benchmarks too.
 {quote}
 It would be nice to have an option to let admins to enforce strict cpu limits 
 for apps for things like benchmarking, etc. By default this flag should be 
 off so that containers can use available cpu but admin can turn the flag on 
 to determine worst case performance, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2557) Add a parameter attempt_Failures_Validity_Interval in DistributedShell


[ 
https://issues.apache.org/jira/browse/YARN-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137087#comment-14137087
 ] 

Hudson commented on YARN-2557:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #683 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/683/])
YARN-2557. Add a parameter attempt_Failures_Validity_Interval into (xgong: 
rev 8e5d6713cf16473d791c028cecc274fd2c7fd10b)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/Client.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDSSleepingAppMaster.java


 Add a parameter attempt_Failures_Validity_Interval in DistributedShell 
 -

 Key: YARN-2557
 URL: https://issues.apache.org/jira/browse/YARN-2557
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications/distributed-shell
Reporter: Xuan Gong
Assignee: Xuan Gong
 Fix For: 2.6.0

 Attachments: YARN-2557.1.patch, YARN-2557.2.patch


 Change Distributed shell to enable attemptFailuresValidityInterval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2565) ResourceManager is fails to start when GenericHistoryService is enabled in secure mode without doing manual kinit as yarn

2014-09-17 Thread Karam Singh (JIRA)

Karam Singh created YARN-2565:
-

 Summary: ResourceManager is fails to start when 
GenericHistoryService is enabled in secure mode without doing manual kinit as 
yarn
 Key: YARN-2565
 URL: https://issues.apache.org/jira/browse/YARN-2565
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.6.0
 Environment: Secure cluster with ATS (timeline server enabled) and 
yarn.resourcemanager.system-metrics-publisher.enabled=true
so that RM can send Application history to Timeline Store
Reporter: Karam Singh


Observed that RM fails to start in Secure mode when GenericeHistoryService is 
enabled and ResourceManager is set to use Timeline Store



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2565) ResourceManager is fails to start when GenericHistoryService is enabled in secure mode without doing manual kinit as yarn

2014-09-17 Thread Karam Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137204#comment-14137204
 ] 

Karam Singh commented on YARN-2565:
---

Observed that RM fails to start in Secure mode when GenericeHistoryService is 
enabled and ResourceManager is set to use Timeline Store
{code}
yarn.resourcemanager.keytab=RM_HOST
yarn.resourcemanager.principal=RM_PRINCIPAL
yarn.timeline-service.enabled=true
yarn.timeline-service.hostname=ATS_HOST
yarn.timeline-service.address=ATS_HOST:10200
yarn.timeline-service.webapp.address=ATS_HOST:8188
yarn.timeline-service.handler-thread-count=10
yarn.timeline-service.ttl-enable=true
yarn.timeline-service.ttl-ms=60480
yarn.timeline-service.leveldb-timeline-store.path=/tm/timeline
yarn.timeline-service.keytab=ATS_KEYTAB
yarn.timeline-service.principal=ATS_PRINCIPAL
yarn.timeline-service.webapp.spnego-principal=ATS_SPNEGO_PRINICPAL
yarn.timeline-service.webapp.spnego-keytab-file=ATS_SPNEGO_KETAB
yarn.timeline-service.http-authentication.type=kerberos
yarn.timeline-service.http-authentication.kerberos.principal=ATS_SPNEGO_PRINICPAL
yarn.timeline-service.http-authentication.kerberos.keytab=ATS_SPNEGO_KETAB
yarn.timeline-service.generic-application-history.enabled=true
yarn.timeline-service.generic-application-history.store-class=''
yarn.resourcemanager.system-metrics-publisher.enabled=true
yarn.resourcemanager.system-metrics-publisher.dispatcher.pool-size=10
{code}

Stop ResoruceManager and Timelineserver
Start Timelineserver. After ATS gets restart successfully.
Start ResourceManager.
RM fails to start with following exception :
{code}
2014-09-15 10:58:57,735 WARN  ipc.Client (Client.java:run(675)) - Exception 
encountered while connecting to the server : javax.security.sasl.SaslException: 
GSS initiate failed [Caused by GSSException: No valid credentials provided 
(Mechanism level: Failed to find any Kerberos tgt)]
2014-09-15 10:58:57,740 ERROR 
applicationhistoryservice.FileSystemApplicationHistoryStore 
(FileSystemApplicationHistoryStore.java:serviceInit(132)) - Error when 
initializing FileSystemHistoryStorage
java.io.IOException: Failed on local exception: java.io.IOException: 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]; Host Details : local host is: RM_HOST; destination host is: 
NN_HOST:8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
at org.apache.hadoop.ipc.Client.call(Client.java:1423)
at org.apache.hadoop.ipc.Client.call(Client.java:1372)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:219)
at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:748)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy15.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1918)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1105)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1101)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1101)
at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1413)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.serviceInit(FileSystemApplicationHistoryStore.java:126)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter.serviceInit(RMApplicationHistoryWriter.java:99)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:490)
at

[jira] [Commented] (YARN-1250) Generic history service should support application-acls


[ 
https://issues.apache.org/jira/browse/YARN-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137236#comment-14137236
 ] 

Hudson commented on YARN-1250:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1899 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1899/])
YARN-1250. Generic history service should support application-acls. 
(Contributed by Zhijie Shen) (junping_du: rev 
90a0c03f0a696d32e871a5da4560828edea8cfa9)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/ApplicationACLsUpdatedEvent.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/SystemMetricsPublisher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/SystemMetricsEventType.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/metrics/ApplicationMetricsConstants.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestAppManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/TestSystemMetricsPublisher.java
YARN-1250. Addendum (junping_du: rev 0e7d1dbf9ab732dd04dccaacbf273e9ac437eba5)
* hadoop-yarn-project/CHANGES.txt


 Generic history service should support application-acls
 ---

 Key: YARN-1250
 URL: https://issues.apache.org/jira/browse/YARN-1250
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Zhijie Shen
 Fix For: 2.6.0

 Attachments: GenericHistoryACLs.pdf, YARN-1250.1.patch, 
 YARN-1250.2.patch, YARN-1250.3.patch, YARN-1250.4.patch, YARN-1250.5.patch, 
 YARN-1250.6.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2557) Add a parameter attempt_Failures_Validity_Interval in DistributedShell


[ 
https://issues.apache.org/jira/browse/YARN-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137238#comment-14137238
 ] 

Hudson commented on YARN-2557:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1899 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1899/])
YARN-2557. Add a parameter attempt_Failures_Validity_Interval into (xgong: 
rev 8e5d6713cf16473d791c028cecc274fd2c7fd10b)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/Client.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDSSleepingAppMaster.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java
* hadoop-yarn-project/CHANGES.txt


 Add a parameter attempt_Failures_Validity_Interval in DistributedShell 
 -

 Key: YARN-2557
 URL: https://issues.apache.org/jira/browse/YARN-2557
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications/distributed-shell
Reporter: Xuan Gong
Assignee: Xuan Gong
 Fix For: 2.6.0

 Attachments: YARN-2557.1.patch, YARN-2557.2.patch


 Change Distributed shell to enable attemptFailuresValidityInterval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2531) CGroups - Admins should be allowed to enforce strict cpu limits


[ 
https://issues.apache.org/jira/browse/YARN-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137235#comment-14137235
 ] 

Hudson commented on YARN-2531:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1899 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1899/])
YARN-2531. Added a configuration for admins to be able to override app-configs 
and enforce/not-enforce strict control of per-container cpu usage. Contributed 
by Varun Vasudev. (vinodkv: rev 9f6891d9ef7064d121305ca783eb62586c8aa018)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java


 CGroups - Admins should be allowed to enforce strict cpu limits
 ---

 Key: YARN-2531
 URL: https://issues.apache.org/jira/browse/YARN-2531
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Fix For: 2.6.0

 Attachments: apache-yarn-2531.0.patch


 From YARN-2440 -
 {quote} 
 The other dimension to this is determinism w.r.t performance. Limiting to 
 allocated cores overall (as well as per container later) helps orgs run 
 workloads and reason about them deterministically. One of the examples is 
 benchmarking apps, but deterministic execution is a desired option beyond 
 benchmarks too.
 {quote}
 It would be nice to have an option to let admins to enforce strict cpu limits 
 for apps for things like benchmarking, etc. By default this flag should be 
 off so that containers can use available cpu but admin can turn the flag on 
 to determine worst case performance, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2531) CGroups - Admins should be allowed to enforce strict cpu limits


[ 
https://issues.apache.org/jira/browse/YARN-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137262#comment-14137262
 ] 

Hudson commented on YARN-2531:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1874 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1874/])
YARN-2531. Added a configuration for admins to be able to override app-configs 
and enforce/not-enforce strict control of per-container cpu usage. Contributed 
by Varun Vasudev. (vinodkv: rev 9f6891d9ef7064d121305ca783eb62586c8aa018)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java


 CGroups - Admins should be allowed to enforce strict cpu limits
 ---

 Key: YARN-2531
 URL: https://issues.apache.org/jira/browse/YARN-2531
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Fix For: 2.6.0

 Attachments: apache-yarn-2531.0.patch


 From YARN-2440 -
 {quote} 
 The other dimension to this is determinism w.r.t performance. Limiting to 
 allocated cores overall (as well as per container later) helps orgs run 
 workloads and reason about them deterministically. One of the examples is 
 benchmarking apps, but deterministic execution is a desired option beyond 
 benchmarks too.
 {quote}
 It would be nice to have an option to let admins to enforce strict cpu limits 
 for apps for things like benchmarking, etc. By default this flag should be 
 off so that containers can use available cpu but admin can turn the flag on 
 to determine worst case performance, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1250) Generic history service should support application-acls


[ 
https://issues.apache.org/jira/browse/YARN-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137263#comment-14137263
 ] 

Hudson commented on YARN-1250:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1874 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1874/])
YARN-1250. Generic history service should support application-acls. 
(Contributed by Zhijie Shen) (junping_du: rev 
90a0c03f0a696d32e871a5da4560828edea8cfa9)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestAppManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/SystemMetricsEventType.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/SystemMetricsPublisher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/TestSystemMetricsPublisher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/metrics/ApplicationMetricsConstants.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/ApplicationACLsUpdatedEvent.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java
YARN-1250. Addendum (junping_du: rev 0e7d1dbf9ab732dd04dccaacbf273e9ac437eba5)
* hadoop-yarn-project/CHANGES.txt


 Generic history service should support application-acls
 ---

 Key: YARN-1250
 URL: https://issues.apache.org/jira/browse/YARN-1250
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Zhijie Shen
 Fix For: 2.6.0

 Attachments: GenericHistoryACLs.pdf, YARN-1250.1.patch, 
 YARN-1250.2.patch, YARN-1250.3.patch, YARN-1250.4.patch, YARN-1250.5.patch, 
 YARN-1250.6.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2557) Add a parameter attempt_Failures_Validity_Interval in DistributedShell


[ 
https://issues.apache.org/jira/browse/YARN-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137265#comment-14137265
 ] 

Hudson commented on YARN-2557:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1874 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1874/])
YARN-2557. Add a parameter attempt_Failures_Validity_Interval into (xgong: 
rev 8e5d6713cf16473d791c028cecc274fd2c7fd10b)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/Client.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDSSleepingAppMaster.java


 Add a parameter attempt_Failures_Validity_Interval in DistributedShell 
 -

 Key: YARN-2557
 URL: https://issues.apache.org/jira/browse/YARN-2557
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications/distributed-shell
Reporter: Xuan Gong
Assignee: Xuan Gong
 Fix For: 2.6.0

 Attachments: YARN-2557.1.patch, YARN-2557.2.patch


 Change Distributed shell to enable attemptFailuresValidityInterval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.

2014-09-17 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137385#comment-14137385
 ] 

Jason Lowe commented on YARN-2561:
--

Thanks for the patch, Junping!  I'm not sure it's best for the RM to examine 
its local config for NM recovery and assume the same applies to the remote 
nodemanager.  I think it would be better if the RM cross-checked the list of 
running containers reported in the registration request against what containers 
it thinks are running on the node and act accordingly.  If the NM doesn't 
report a container then we should kill it.

 MR job client cannot reconnect to AM after NM restart.
 --

 Key: YARN-2561
 URL: https://issues.apache.org/jira/browse/YARN-2561
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Tassapol Athiapinya
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2561-v2.patch, YARN-2561.patch


 Work-preserving NM restart is disabled.
 Submit a job. Restart the only NM and found that Job will hang with connect 
 retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager

2014-09-17 Thread Benoy Antony (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137431#comment-14137431
 ] 

Benoy Antony commented on YARN-2527:


[~zjshen], any comments ?
Since the ACLS are provided by Application Master and , isn't Null a valid 
value ?
If not, we could log a warning.  
Any other suggestions are welcome.

 NPE in ApplicationACLsManager
 -

 Key: YARN-2527
 URL: https://issues.apache.org/jira/browse/YARN-2527
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: YARN-2527.patch, YARN-2527.patch


 NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error.
 The relevant stacktrace snippet from the ResourceManager logs is as below
 {code}
 Caused by: java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101)
 at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66)
 at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76)
 at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
 {code}
 This issue was reported by [~miguenther].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue

2014-09-17 Thread Maysam Yabandeh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137496#comment-14137496
 ] 

Maysam Yabandeh commented on YARN-1963:
---

I was wondering what is the long-term plan for this jira? It does not seem to 
have any activity in the past 4 months and I was wondering if we have any rough 
estimate that on which release we plan to have this feature added?

 Support priorities across applications within the same queue 
 -

 Key: YARN-1963
 URL: https://issues.apache.org/jira/browse/YARN-1963
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Reporter: Arun C Murthy
Assignee: Sunil G

 It will be very useful to support priorities among applications within the 
 same queue, particularly in production scenarios. It allows for finer-grained 
 controls without having to force admins to create a multitude of queues, plus 
 allows existing applications to continue using existing queues which are 
 usually part of institutional memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2554) Slider AM Web UI is inaccessible if HTTPS/SSL is specified as the HTTP policy

2014-09-17 Thread Jonathan Maron (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Maron updated YARN-2554:
-
Attachment: YARN-2554.1.patch

 Slider AM Web UI is inaccessible if HTTPS/SSL is specified as the HTTP policy
 -

 Key: YARN-2554
 URL: https://issues.apache.org/jira/browse/YARN-2554
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.6.0
Reporter: Jonathan Maron
 Attachments: YARN-2554.1.patch


 If the HTTP policy to enable HTTPS is specified, the RM and AM are 
 initialized with SSL listeners.  The RM has a web app proxy servlet that acts 
 as a proxy for incoming AM requests.  In order to forward the requests to the 
 AM the proxy servlet makes use of HttpClient.  However, the HttpClient 
 utilized is not initialized correctly with the necessary certs to allow for 
 successful one way SSL invocations to the other nodes in the cluster (it is 
 not configured to access/load the client truststore specified in 
 ssl-client.xml).   I imagine SSLFactory.createSSLSocketFactory() could be 
 utilized to create an instance that can be assigned to the HttpClient.
 The symptoms of this issue are:
 AM: Displays unknown_certificate exception
 RM:  Displays an exception such as javax.net.ssl.SSLHandshakeException: 
 sun.security.validator.ValidatorException: PKIX path building failed: 
 sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
 valid certification path to requested target



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2554) Slider AM Web UI is inaccessible if HTTPS/SSL is specified as the HTTP policy

[
https://issues.apache.org/jira/browse/YARN-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137544#comment-14137544
]

Hadoop QA commented on YARN-2554:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12669439/YARN-2554.1.patch
against trunk revision c0c7e6f.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:red}-1 tests included{color}. The patch doesn't appear to include
any new or modified tests.
Please justify why no new tests are needed for this
patch.
Also please list what manual steps were performed to
verify this patch.

{color:red}-1 javac{color:red}. The patch appears to cause the build to
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4990//console

This message is automatically generated.

Slider AM Web UI is inaccessible if HTTPS/SSL is specified as the HTTP policy
-

Key: YARN-2554
URL: https://issues.apache.org/jira/browse/YARN-2554
Project: Hadoop YARN
Issue Type: Bug
Components: webapp
Affects Versions: 2.6.0
Reporter: Jonathan Maron
Attachments: YARN-2554.1.patch

If the HTTP policy to enable HTTPS is specified, the RM and AM are
initialized with SSL listeners. The RM has a web app proxy servlet that acts
as a proxy for incoming AM requests. In order to forward the requests to the
AM the proxy servlet makes use of HttpClient. However, the HttpClient
utilized is not initialized correctly with the necessary certs to allow for
successful one way SSL invocations to the other nodes in the cluster (it is
not configured to access/load the client truststore specified in
ssl-client.xml). I imagine SSLFactory.createSSLSocketFactory() could be
utilized to create an instance that can be assigned to the HttpClient.
The symptoms of this issue are:
AM: Displays unknown_certificate exception
RM: Displays an exception such as javax.net.ssl.SSLHandshakeException:
sun.security.validator.ValidatorException: PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable to find
valid certification path to requested target

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue

2014-09-17 Thread Sunil G (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137564#comment-14137564
 ] 

Sunil G commented on YARN-1963:
---

HI [~maysamyabandeh], We are bringing up a design doc for this by capturing all 
details, will soon publish the same. [~vinodkv], could we discuss doc this 
offline and publish it.

 Support priorities across applications within the same queue 
 -

 Key: YARN-1963
 URL: https://issues.apache.org/jira/browse/YARN-1963
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Reporter: Arun C Murthy
Assignee: Sunil G

 It will be very useful to support priorities among applications within the 
 same queue, particularly in production scenarios. It allows for finer-grained 
 controls without having to force admins to create a multitude of queues, plus 
 allows existing applications to continue using existing queues which are 
 usually part of institutional memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2554) Slider AM Web UI is inaccessible if HTTPS/SSL is specified as the HTTP policy

2014-09-17 Thread Jonathan Maron (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Maron updated YARN-2554:
-
Attachment: YARN-2554.2.patch

 Slider AM Web UI is inaccessible if HTTPS/SSL is specified as the HTTP policy
 -

 Key: YARN-2554
 URL: https://issues.apache.org/jira/browse/YARN-2554
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.6.0
Reporter: Jonathan Maron
 Attachments: YARN-2554.1.patch, YARN-2554.2.patch


 If the HTTP policy to enable HTTPS is specified, the RM and AM are 
 initialized with SSL listeners.  The RM has a web app proxy servlet that acts 
 as a proxy for incoming AM requests.  In order to forward the requests to the 
 AM the proxy servlet makes use of HttpClient.  However, the HttpClient 
 utilized is not initialized correctly with the necessary certs to allow for 
 successful one way SSL invocations to the other nodes in the cluster (it is 
 not configured to access/load the client truststore specified in 
 ssl-client.xml).   I imagine SSLFactory.createSSLSocketFactory() could be 
 utilized to create an instance that can be assigned to the HttpClient.
 The symptoms of this issue are:
 AM: Displays unknown_certificate exception
 RM:  Displays an exception such as javax.net.ssl.SSLHandshakeException: 
 sun.security.validator.ValidatorException: PKIX path building failed: 
 sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
 valid certification path to requested target



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2561) MR job client cannot reconnect to AM after NM restart.


 [ 
https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-2561:
-
Attachment: YARN-2561-v3.patch

 MR job client cannot reconnect to AM after NM restart.
 --

 Key: YARN-2561
 URL: https://issues.apache.org/jira/browse/YARN-2561
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Tassapol Athiapinya
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, YARN-2561.patch


 Work-preserving NM restart is disabled.
 Submit a job. Restart the only NM and found that Job will hang with connect 
 retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.


[ 
https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137590#comment-14137590
 ] 

Junping Du commented on YARN-2561:
--

Thanks [~jlowe] for the comments! Yes. Checking running containers reported by 
NM seems to be a better way given each NM's recovery configuration is possible 
to be different (although we don't encourage this configuration. Isn't it?). v3 
patch use the new way.

 MR job client cannot reconnect to AM after NM restart.
 --

 Key: YARN-2561
 URL: https://issues.apache.org/jira/browse/YARN-2561
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Tassapol Athiapinya
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, YARN-2561.patch


 Work-preserving NM restart is disabled.
 Submit a job. Restart the only NM and found that Job will hang with connect 
 retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-2565) ResourceManager is fails to start when GenericHistoryService is enabled in secure mode without doing manual kinit as yarn


 [ 
https://issues.apache.org/jira/browse/YARN-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen reassigned YARN-2565:
-

Assignee: Zhijie Shen

 ResourceManager is fails to start when GenericHistoryService is enabled in 
 secure mode without doing manual kinit as yarn
 -

 Key: YARN-2565
 URL: https://issues.apache.org/jira/browse/YARN-2565
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.6.0
 Environment: Secure cluster with ATS (timeline server enabled) and 
 yarn.resourcemanager.system-metrics-publisher.enabled=true
 so that RM can send Application history to Timeline Store
Reporter: Karam Singh
Assignee: Zhijie Shen

 Observed that RM fails to start in Secure mode when GenericeHistoryService is 
 enabled and ResourceManager is set to use Timeline Store



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.


[ 
https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137620#comment-14137620
 ] 

Hadoop QA commented on YARN-2561:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669449/YARN-2561-v3.patch
  against trunk revision c0c7e6f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4993//console

This message is automatically generated.

 MR job client cannot reconnect to AM after NM restart.
 --

 Key: YARN-2561
 URL: https://issues.apache.org/jira/browse/YARN-2561
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Tassapol Athiapinya
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, YARN-2561.patch


 Work-preserving NM restart is disabled.
 Submit a job. Restart the only NM and found that Job will hang with connect 
 retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2554) Slider AM Web UI is inaccessible if HTTPS/SSL is specified as the HTTP policy


[ 
https://issues.apache.org/jira/browse/YARN-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137629#comment-14137629
 ] 

Hadoop QA commented on YARN-2554:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669448/YARN-2554.2.patch
  against trunk revision c0c7e6f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4991//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4991//console

This message is automatically generated.

 Slider AM Web UI is inaccessible if HTTPS/SSL is specified as the HTTP policy
 -

 Key: YARN-2554
 URL: https://issues.apache.org/jira/browse/YARN-2554
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.6.0
Reporter: Jonathan Maron
 Attachments: YARN-2554.1.patch, YARN-2554.2.patch


 If the HTTP policy to enable HTTPS is specified, the RM and AM are 
 initialized with SSL listeners.  The RM has a web app proxy servlet that acts 
 as a proxy for incoming AM requests.  In order to forward the requests to the 
 AM the proxy servlet makes use of HttpClient.  However, the HttpClient 
 utilized is not initialized correctly with the necessary certs to allow for 
 successful one way SSL invocations to the other nodes in the cluster (it is 
 not configured to access/load the client truststore specified in 
 ssl-client.xml).   I imagine SSLFactory.createSSLSocketFactory() could be 
 utilized to create an instance that can be assigned to the HttpClient.
 The symptoms of this issue are:
 AM: Displays unknown_certificate exception
 RM:  Displays an exception such as javax.net.ssl.SSLHandshakeException: 
 sun.security.validator.ValidatorException: PKIX path building failed: 
 sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
 valid certification path to requested target



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2561) MR job client cannot reconnect to AM after NM restart.


 [ 
https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-2561:
-
Attachment: YARN-2561-v4.patch

Fix the compile issues in v3.

 MR job client cannot reconnect to AM after NM restart.
 --

 Key: YARN-2561
 URL: https://issues.apache.org/jira/browse/YARN-2561
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Tassapol Athiapinya
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, 
 YARN-2561-v4.patch, YARN-2561.patch


 Work-preserving NM restart is disabled.
 Submit a job. Restart the only NM and found that Job will hang with connect 
 retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2559) ResourceManager sometime become un-responsive due to NPE in SystemMetricsPublisher


[ 
https://issues.apache.org/jira/browse/YARN-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137636#comment-14137636
 ] 

Jian He commented on YARN-2559:
---

To be consistent with the FinalApplicationStatus exposed on RM web UI and CLI,  
we may publish UNDEFINED state as well in case finalStatus is unavailable ?

 ResourceManager sometime become un-responsive due to NPE in 
 SystemMetricsPublisher
 --

 Key: YARN-2559
 URL: https://issues.apache.org/jira/browse/YARN-2559
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.6.0
 Environment: Generice History Service is enabled in Timelineserver 
 with 
 yarn.resourcemanager.system-metrics-publisher.enabled=true
 So that ResourceManager should Timeline Store for recording application 
 history information 
Reporter: Karam Singh
Assignee: Zhijie Shen
 Attachments: YARN-2559.1.patch


 ResourceManager sometime become un-responsive due to NPE in 
 SystemMetricsPublisher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-2563) On secure clusters call to timeline server fails with authentication errors when running a job via oozie


 [ 
https://issues.apache.org/jira/browse/YARN-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen reassigned YARN-2563:
-

Assignee: Zhijie Shen

 On secure clusters call to timeline server fails with authentication errors 
 when running a job via oozie
 

 Key: YARN-2563
 URL: https://issues.apache.org/jira/browse/YARN-2563
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Arpit Gupta
Assignee: Zhijie Shen
Priority: Blocker

 During our nightlies on a secure cluster we have seen oozie jobs fail with 
 authentication error to the time line server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2446) Using TimelineNamespace to shield the entities of a user

2014-09-17 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137637#comment-14137637
 ] 

Li Lu commented on YARN-2446:
-

Hi [~zjshen], yes the latest patch looks good to me. Still, similar to 
YARN-2102, maybe you want some more committers to review the patch?

 Using TimelineNamespace to shield the entities of a user
 

 Key: YARN-2446
 URL: https://issues.apache.org/jira/browse/YARN-2446
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2446.1.patch, YARN-2446.2.patch


 Given YARN-2102 adds TimelineNamespace, we can make use of it to shield the 
 entities, preventing them from being accessed or affected by other users' 
 operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover


[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137644#comment-14137644
 ] 

Vinod Kumar Vavilapalli commented on YARN-2001:
---

This looks almost close except for the logging - we don't have any indication 
of this wait in the RM logs.

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch, 
 YARN-2001.4.patch


 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2468) Log handling for LRS

[
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137661#comment-14137661
]

Xuan Gong commented on YARN-2468:
-

bq. If LogContext is not specified, we're running into the traditional log
handling case, right? We will still have a combined log file identified by the
node id? Or node id will always be the directory, and there exists only one
file under it?

node id will always be the directory, and there exists only one file under it

bq. Let's say if work-preserving NM restarting happens, NM is going to forget
all the uploaded logs files, and redo everything, right?

If NM restarts happens, it will upload all logs which are previous uploaded,
but not deleted.
I think that we can solve this problem in separate ticket, because this ticket
is the first step to solve Log handling for LRS.

bq. LogContext doesn't need to be in ApplicatonSubmissionContext, because
ApplicatonSubmissionContext contains ContainerLaunchContext. LogContext is
container related stuff, such that ContainerLaunchContext should be the best
place. Concurrently, we can have one context for all containers. Maybe in the
future we can think of setting different LogContext for each individual
container.

DONE

bq. In getFilteredLogFiles, the logic is that if the log file matches the
include pattern, it will be added first, and if then if it matches the exclude
pattern, it will be removed. Shall we do the sanity check to make sure we can
not include and exclude the same pattern, otherwise, the semantics is a bit
weird.

Add more explanation in javaDoc.

Uploaded a new patch to address all comments.

Log handling for LRS

Key: YARN-2468
URL: https://issues.apache.org/jira/browse/YARN-2468
Project: Hadoop YARN
Issue Type: Sub-task
Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch,
YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch,
YARN-2468.4.patch, YARN-2468.5.patch

Currently, when application is finished, NM will start to do the log
aggregation. But for Long running service applications, this is not ideal.
The problems we have are:
1) LRS applications are expected to run for a long time (weeks, months).
2) Currently, all the container logs (from one NM) will be written into a
single file. The files could become larger and larger.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2468) Log handling for LRS


 [ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2468:

Attachment: YARN-2468.5.patch

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1779) Handle AMRMTokens across RM failover


[ 
https://issues.apache.org/jira/browse/YARN-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137668#comment-14137668
 ] 

Hadoop QA commented on YARN-1779:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669450/YARN-1779.3.patch
  against trunk revision c0c7e6f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4992//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4992//console

This message is automatically generated.

 Handle AMRMTokens across RM failover
 

 Key: YARN-1779
 URL: https://issues.apache.org/jira/browse/YARN-1779
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Jian He
Priority: Blocker
  Labels: ha
 Attachments: YARN-1779.1.patch, YARN-1779.2.patch, YARN-1779.3.patch


 Verify if AMRMTokens continue to work against RM failover. If not, we will 
 have to do something along the lines of YARN-986. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover


 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--
Attachment: YARN-2001.5.patch

Fixed logging to add the wait msg.

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch, 
 YARN-2001.4.patch, YARN-2001.5.patch


 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1779) Handle AMRMTokens across RM failover


[ 
https://issues.apache.org/jira/browse/YARN-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137685#comment-14137685
 ] 

Xuan Gong commented on YARN-1779:
-

+1 LGTM

 Handle AMRMTokens across RM failover
 

 Key: YARN-1779
 URL: https://issues.apache.org/jira/browse/YARN-1779
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Jian He
Priority: Blocker
  Labels: ha
 Attachments: YARN-1779.1.patch, YARN-1779.2.patch, YARN-1779.3.patch


 Verify if AMRMTokens continue to work against RM failover. If not, we will 
 have to do something along the lines of YARN-986. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1453) [JDK8] Fix Javadoc errors caused by incorrect or illegal tags in doc comments

2014-09-17 Thread Allen Wittenauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-1453:
---
Attachment: YARN-1453-02.patch

Rebased.  It doesn't fix all the javadoc errors, but it at least applies now.

 [JDK8] Fix Javadoc errors caused by incorrect or illegal tags in doc comments
 -

 Key: YARN-1453
 URL: https://issues.apache.org/jira/browse/YARN-1453
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Purtell
Assignee: Andrew Purtell
Priority: Minor
 Attachments: 1453-branch-2.patch, 1453-branch-2.patch, 
 1453-trunk.patch, 1453-trunk.patch, YARN-1453-02.patch


 Javadoc is more strict by default in JDK8 and will error out on malformed or 
 illegal tags found in doc comments. Although tagged as JDK8 all of the 
 required changes are generic Javadoc cleanups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2558) Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId


[ 
https://issues.apache.org/jira/browse/YARN-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137696#comment-14137696
 ] 

Jian He commented on YARN-2558:
---

the test looks good me, thanks Tsuyoshi !

 Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId
 --

 Key: YARN-2558
 URL: https://issues.apache.org/jira/browse/YARN-2558
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
Priority: Blocker
 Attachments: YARN-2558.1.patch, YARN-2558.2.patch


 We should update ContainerTokenIdentifier#read/write to use 
 {{getContainerId}} instead of {{getId}} to pass all container information 
 correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-17 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137725#comment-14137725
 ] 

Karthik Kambatla commented on YARN-2179:


Review comments:
# Rename yarn.sharedcache.nested.level to yarn.sharedcache.nested-level? 
# Rename AppChecker#appIsActive to either isActive(AppId) or isAppActive.
# Nit (okay with not changing): Should AppChecker methods thrown YarnException 
instead of IOException, since they are strictly used within SCM code?  
# CacheStructureUtil: remove empty line in class javadoc
# sharedcache-pom: my understand of maven is pretty sparse, so please correct 
me if I am wrong. Looks like sharedcache depends on the RM. If we were to embed 
the sharedcache in the RM, wouldn't that lead to circular dependency? How do we 
plan to solve it? 
# RemoteAppChecker: Just thinking out loud - in a non-embedded case, what 
happens if we upgrade other daemons/clients but not the SCM and add a new 
completed state? There might not be a solution here though, the worst case 
appears to be that we wouldn't clear the cache when apps end up in that state. 
One alternative is to query the RM for active states or an app being active. I 
am open to adding these APIs (Private for now) to the RM.
{code}
  private static final EnumSetYarnApplicationState ACTIVE_STATES =
  EnumSet.complementOf(EnumSet.of(YarnApplicationState.FINISHED,
YarnApplicationState.FAILED,
YarnApplicationState.KILLED));
{code}
# RemoteAppChecker#create should use ClientRMProxy instead of YarnRPC for it to 
work in an HA-RM setting. 
# As per offline discussions, we don't need the SCMContext outside of the store 
implementations. Can we move it out?

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
 YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.


[ 
https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137738#comment-14137738
 ] 

Hadoop QA commented on YARN-2561:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669455/YARN-2561-v4.patch
  against trunk revision c0c7e6f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestSchedulerUtils

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4994//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4994//console

This message is automatically generated.

 MR job client cannot reconnect to AM after NM restart.
 --

 Key: YARN-2561
 URL: https://issues.apache.org/jira/browse/YARN-2561
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Tassapol Athiapinya
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, 
 YARN-2561-v4.patch, YARN-2561.patch


 Work-preserving NM restart is disabled.
 Submit a job. Restart the only NM and found that Job will hang with connect 
 retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2468) Log handling for LRS


[ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137740#comment-14137740
 ] 

Hadoop QA commented on YARN-2468:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669460/YARN-2468.5.patch
  against trunk revision d9a8603.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 3 
warning messages.
See 
https://builds.apache.org/job/PreCommit-YARN-Build/4995//artifact/trunk/patchprocess/diffJavadocWarnings.txt
 for details.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4995//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4995//console

This message is automatically generated.

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover


[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137743#comment-14137743
 ] 

Hadoop QA commented on YARN-2001:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669463/YARN-2001.5.patch
  against trunk revision 8a7671d.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The test build failed in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4996//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4996//console

This message is automatically generated.

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch, 
 YARN-2001.4.patch, YARN-2001.5.patch


 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2558) Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId

2014-09-17 Thread Jason Lowe (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137757#comment-14137757
]

Jason Lowe commented on YARN-2558:
--

bq. Jason Lowe, without changing containerTokenIdentifier, we can't support
rolling upgrades. does the current approach look good you ?

Yes, this should be fine for the short term. I just wanted to make it clear
that until YARN-668 is addressed we're going to continue to break backwards
compatibility and thus rolling upgrades with seemingly simple changes like this.

Some minor comments on the patch, none of which are must-fix:
- is it necessary to call testNMToken in testContainerManagerWithEpoch? That
test is already covered by testContainerManager.
- the timeout on the test seems way too large
- not sure what the point of having the test catch an exception just to print a
stacktrace and re-throw it. Won't the stacktrace be printed also when the test
fails due to the thrown exception?

Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId
--

Key: YARN-2558
URL: https://issues.apache.org/jira/browse/YARN-2558
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
Priority: Blocker
Attachments: YARN-2558.1.patch, YARN-2558.2.patch

We should update ContainerTokenIdentifier#read/write to use
{{getContainerId}} instead of {{getId}} to pass all container information
correctly.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182


 [ 
https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2562:
-
Attachment: YARN-2562.1.patch

Thanks for your comment, Steve. Attaching a first patch to change the format to 
container_1410901177871_0001_01_05_epoch_17.

 ContainerId@toString() is unreadable for epoch 0 after YARN-2182
 -

 Key: YARN-2562
 URL: https://issues.apache.org/jira/browse/YARN-2562
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-2562.1.patch


 ContainerID string format is unreadable for RMs that restarted at least once 
 (epoch  0) after YARN-2182. For e.g, 
 container_1410901177871_0001_01_05_17.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2558) Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId


 [ 
https://issues.apache.org/jira/browse/YARN-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2558:
-
Attachment: YARN-2558.3.patch

Thanks for your review, Jian and Jason! Updated:

* Removed testNMToken in the new test case.
* Made the timeout short.
* Removed the needless catch and re-throw of the exception.



 Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId
 --

 Key: YARN-2558
 URL: https://issues.apache.org/jira/browse/YARN-2558
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
Priority: Blocker
 Attachments: YARN-2558.1.patch, YARN-2558.2.patch, YARN-2558.3.patch


 We should update ContainerTokenIdentifier#read/write to use 
 {{getContainerId}} instead of {{getId}} to pass all container information 
 correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1453) [JDK8] Fix Javadoc errors caused by incorrect or illegal tags in doc comments


[ 
https://issues.apache.org/jira/browse/YARN-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137849#comment-14137849
 ] 

Hadoop QA commented on YARN-1453:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669467/YARN-1453-02.patch
  against trunk revision ea4e2e8.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy:

  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4997//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4997//console

This message is automatically generated.

 [JDK8] Fix Javadoc errors caused by incorrect or illegal tags in doc comments
 -

 Key: YARN-1453
 URL: https://issues.apache.org/jira/browse/YARN-1453
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.3.0
Reporter: Andrew Purtell
Assignee: Andrew Purtell
Priority: Minor
 Attachments: 1453-branch-2.patch, 1453-branch-2.patch, 
 1453-trunk.patch, 1453-trunk.patch, YARN-1453-02.patch


 Javadoc is more strict by default in JDK8 and will error out on malformed or 
 illegal tags found in doc comments. Although tagged as JDK8 all of the 
 required changes are generic Javadoc cleanups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182


[ 
https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137854#comment-14137854
 ] 

Hadoop QA commented on YARN-2562:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669482/YARN-2562.1.patch
  against trunk revision ea4e2e8.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4998//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4998//console

This message is automatically generated.

 ContainerId@toString() is unreadable for epoch 0 after YARN-2182
 -

 Key: YARN-2562
 URL: https://issues.apache.org/jira/browse/YARN-2562
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-2562.1.patch


 ContainerID string format is unreadable for RMs that restarted at least once 
 (epoch  0) after YARN-2182. For e.g, 
 container_1410901177871_0001_01_05_17.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.


[ 
https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137863#comment-14137863
 ] 

Junping Du commented on YARN-2561:
--

This test failure looks like a random failure and should be unrelated. Kick off 
test again manually.

 MR job client cannot reconnect to AM after NM restart.
 --

 Key: YARN-2561
 URL: https://issues.apache.org/jira/browse/YARN-2561
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Tassapol Athiapinya
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, 
 YARN-2561-v4.patch, YARN-2561.patch


 Work-preserving NM restart is disabled.
 Submit a job. Restart the only NM and found that Job will hang with connect 
 retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2080) Admission Control: Integrate Reservation subsystem with ResourceManager


[ 
https://issues.apache.org/jira/browse/YARN-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137881#comment-14137881
 ] 

Vinod Kumar Vavilapalli commented on YARN-2080:
---

This looks good, +1. Let's commit it to the branch..

 Admission Control: Integrate Reservation subsystem with ResourceManager
 ---

 Key: YARN-2080
 URL: https://issues.apache.org/jira/browse/YARN-2080
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Subru Krishnan
Assignee: Subru Krishnan
 Attachments: YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, 
 YARN-2080.patch, YARN-2080.patch, YARN-2080.patch


 This JIRA tracks the integration of Reservation subsystem data structures 
 introduced in YARN-1709 with the YARN RM. This is essentially end2end wiring 
 of YARN-1051.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)

2014-09-17 Thread Chris Trezzo (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137895#comment-14137895
]

Chris Trezzo commented on YARN-1492:

[~kasha], [~vinodkv] and I had a conversation around the main things needed
before committing to trunk:
1. Complete the refactor that removes SCMContext and ensures implementation
details from the in-memory store are not leaked through the SCMStore interface.
2. Add a configuration parameter at the yarn level that allows operators to
disallow uploading resources to the shared cache if they are not PUBLIC
(currently resources are allowed if they are PUBLIC or owned by the user
requesting the upload).
3. Ability to run SCM optionally as part of the RM.

A few things that are important, but can be added post merge:
1. A levelDB store implementation.
2. Security.
3. ZK-based store implementation.

Also, the consensus was that it seemed OK to let store implementations handle
eviction policy logic. Having eviction policy logic span store implementations
might be difficult and could cause store implementation details to leak through
into the policies. For example, the in-memory store has to consider when it
started up during cache eviction, where persistent stores may not need to.

truly shared cache for jars (jobjar/libjar)
---

Key: YARN-1492
URL: https://issues.apache.org/jira/browse/YARN-1492
Project: Hadoop YARN
Issue Type: New Feature
Affects Versions: 2.0.4-alpha
Reporter: Sangjin Lee
Assignee: Chris Trezzo
Attachments: YARN-1492-all-trunk-v1.patch,
YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch,
YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch,
shared_cache_design.pdf, shared_cache_design_v2.pdf,
shared_cache_design_v3.pdf, shared_cache_design_v4.pdf,
shared_cache_design_v5.pdf, shared_cache_design_v6.pdf

Currently there is the distributed cache that enables you to cache jars and
files so that attempts from the same job can reuse them. However, sharing is
limited with the distributed cache because it is normally on a per-job basis.
On a large cluster, sometimes copying of jobjars and libjars becomes so
prevalent that it consumes a large portion of the network bandwidth, not to
speak of defeating the purpose of bringing compute to where data is. This
is wasteful because in most cases code doesn't change much across many jobs.
I'd like to propose and discuss feasibility of introducing a truly shared
cache so that multiple jobs from multiple users can share and cache jars.
This JIRA is to open the discussion.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2559) ResourceManager sometime become un-responsive due to NPE in SystemMetricsPublisher


 [ 
https://issues.apache.org/jira/browse/YARN-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2559:
--
Attachment: YARN-2559.2.patch

bq. To be consistent with the FinalApplicationStatus exposed on RM web UI and 
CLI, we may publish UNDEFINED state as well in case finalStatus is unavailable ?

Nice catch! Fix the issue in the new patch.

 ResourceManager sometime become un-responsive due to NPE in 
 SystemMetricsPublisher
 --

 Key: YARN-2559
 URL: https://issues.apache.org/jira/browse/YARN-2559
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.6.0
 Environment: Generice History Service is enabled in Timelineserver 
 with 
 yarn.resourcemanager.system-metrics-publisher.enabled=true
 So that ResourceManager should Timeline Store for recording application 
 history information 
Reporter: Karam Singh
Assignee: Zhijie Shen
 Attachments: YARN-2559.1.patch, YARN-2559.2.patch


 ResourceManager sometime become un-responsive due to NPE in 
 SystemMetricsPublisher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2190) Provide a Windows container executor that can limit memory and CPU

2014-09-17 Thread Varun Vasudev (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137903#comment-14137903
 ] 

Varun Vasudev commented on YARN-2190:
-

[~chuanliu] thanks for the patch! Some questions and comments -
1. What is the behaviour of a process that tries to exceed the allocated 
memory? Will it start swapping or will it be killed?
2. Your code assumes a 1-1 mapping of physical cores to vcores. This assumption 
is/will be problematic, especially in heterogeneous clusters. You're better off 
using the ratio of (container-vcores/node-vcores) to determine cpu limits.
3.
{noformat}
Index: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java
===
--- 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java
(revision 1618292)
+++ 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java
(working copy)
@@ -38,6 +38,7 @@
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.fs.permission.FsPermission;
 import org.apache.hadoop.yarn.api.records.ContainerId;
+import org.apache.hadoop.yarn.api.records.Resource;
 import org.apache.hadoop.yarn.conf.YarnConfiguration;
 import 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container;
 import 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerDiagnosticsUpdateEvent;
@@ -257,6 +258,11 @@
   readLock.unlock();
 }
   }
+
+  protected String[] getRunCommand(String command, String groupId,
+  Configuration conf) {
+return getRunCommand(command, groupId, conf, null);
+  }

Index: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
===
--- 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
 (revision 1618292)
+++ 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
 (working copy)
@@ -185,7 +185,7 @@
 
   // Setup command to run
   String[] command = getRunCommand(sb.getWrapperScriptPath().toString(),
-containerIdStr, this.getConf());
+containerIdStr, this.getConf(), container.getResource());
 
   LOG.info(launchContainer:  + Arrays.toString(command));
{noformat}

Can you explain why you are modifying DefaultContainerExecutor? You've added a 
method for the old signature in ContainerExecutor.
4. Can you modify the comments/usage to specify the units of memory(bytes, MB, 
GB)?

 Provide a Windows container executor that can limit memory and CPU
 --

 Key: YARN-2190
 URL: https://issues.apache.org/jira/browse/YARN-2190
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager
Reporter: Chuan Liu
Assignee: Chuan Liu
 Attachments: YARN-2190-prototype.patch, YARN-2190.1.patch, 
 YARN-2190.2.patch, YARN-2190.3.patch, YARN-2190.4.patch, YARN-2190.5.patch


 Yarn default container executor on Windows does not set the resource limit on 
 the containers currently. The memory limit is enforced by a separate 
 monitoring thread. The container implementation on Windows uses Job Object 
 right now. The latest Windows (8 or later) API allows CPU and memory limits 
 on the job objects. We want to create a Windows container executor that sets 
 the limits on job objects thus provides resource enforcement at OS level.
 http://msdn.microsoft.com/en-us/library/windows/desktop/ms686216(v=vs.85).aspx



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2558) Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId


[ 
https://issues.apache.org/jira/browse/YARN-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137902#comment-14137902
 ] 

Hadoop QA commented on YARN-2558:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669487/YARN-2558.3.patch
  against trunk revision ea4e2e8.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4999//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4999//console

This message is automatically generated.

 Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId
 --

 Key: YARN-2558
 URL: https://issues.apache.org/jira/browse/YARN-2558
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
Priority: Blocker
 Attachments: YARN-2558.1.patch, YARN-2558.2.patch, YARN-2558.3.patch


 We should update ContainerTokenIdentifier#read/write to use 
 {{getContainerId}} instead of {{getId}} to pass all container information 
 correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2190) Provide a Windows container executor that can limit memory and CPU

[
https://issues.apache.org/jira/browse/YARN-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137905#comment-14137905
]

Hadoop QA commented on YARN-2190:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12662538/YARN-2190.5.patch
against trunk revision e3803d0.

{color:red}-1 patch{color}. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5001//console

This message is automatically generated.

Provide a Windows container executor that can limit memory and CPU
--

Key: YARN-2190
URL: https://issues.apache.org/jira/browse/YARN-2190
Project: Hadoop YARN
Issue Type: New Feature
Components: nodemanager
Reporter: Chuan Liu
Assignee: Chuan Liu
Attachments: YARN-2190-prototype.patch, YARN-2190.1.patch,
YARN-2190.2.patch, YARN-2190.3.patch, YARN-2190.4.patch, YARN-2190.5.patch

Yarn default container executor on Windows does not set the resource limit on
the containers currently. The memory limit is enforced by a separate
monitoring thread. The container implementation on Windows uses Job Object
right now. The latest Windows (8 or later) API allows CPU and memory limits
on the job objects. We want to create a Windows container executor that sets
the limits on job objects thus provides resource enforcement at OS level.
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686216(v=vs.85).aspx

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-668) TokenIdentifier serialization should consider Unknown fields


 [ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-668:
-
Priority: Blocker  (was: Major)
Target Version/s: 2.6.0

 TokenIdentifier serialization should consider Unknown fields
 

 Key: YARN-668
 URL: https://issues.apache.org/jira/browse/YARN-668
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Vinod Kumar Vavilapalli
Priority: Blocker

 This would allow changing of the TokenIdentifier between versions. The 
 current serialization is Writable. A simple way to achieve this would be to 
 have a Proto object as the payload for TokenIdentifiers, instead of 
 individual fields.
 TokenIdentifier continues to implement Writable to work with the RPC layer - 
 but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2558) Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId


[ 
https://issues.apache.org/jira/browse/YARN-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137907#comment-14137907
 ] 

Vinod Kumar Vavilapalli commented on YARN-2558:
---

bq. Yes, this should be fine for the short term. I just wanted to make it clear 
that until YARN-668 is addressed we're going to continue to break backwards 
compatibility and thus rolling upgrades with seemingly simple changes like this.
*Sigh* yes. I just marked YARN-668 as a blocker for 2.6. Thanks [~jianhe] for 
pointing out problems with preserving restart without the patch.

 Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId
 --

 Key: YARN-2558
 URL: https://issues.apache.org/jira/browse/YARN-2558
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
Priority: Blocker
 Attachments: YARN-2558.1.patch, YARN-2558.2.patch, YARN-2558.3.patch


 We should update ContainerTokenIdentifier#read/write to use 
 {{getContainerId}} instead of {{getId}} to pass all container information 
 correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.


[ 
https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137908#comment-14137908
 ] 

Junping Du commented on YARN-2561:
--

Also, try this patch in a real cluster which works fine as expected.

 MR job client cannot reconnect to AM after NM restart.
 --

 Key: YARN-2561
 URL: https://issues.apache.org/jira/browse/YARN-2561
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Tassapol Athiapinya
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, 
 YARN-2561-v4.patch, YARN-2561.patch


 Work-preserving NM restart is disabled.
 Submit a job. Restart the only NM and found that Job will hang with connect 
 retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields


[ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137912#comment-14137912
 ] 

Vinod Kumar Vavilapalli commented on YARN-668:
--

I think [~sseth]'s solution in the description is a much simpler way to address 
this
bq. The current serialization is Writable. A simple way to achieve this would 
be to have a Proto object as the payload for TokenIdentifiers, instead of 
individual fields.


 TokenIdentifier serialization should consider Unknown fields
 

 Key: YARN-668
 URL: https://issues.apache.org/jira/browse/YARN-668
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Vinod Kumar Vavilapalli
Priority: Blocker

 This would allow changing of the TokenIdentifier between versions. The 
 current serialization is Writable. A simple way to achieve this would be to 
 have a Proto object as the payload for TokenIdentifiers, instead of 
 individual fields.
 TokenIdentifier continues to implement Writable to work with the RPC layer - 
 but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover


 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--
Attachment: YARN-2001.5.patch

Test passes locally, re-submit the same patch

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch, 
 YARN-2001.4.patch, YARN-2001.5.patch, YARN-2001.5.patch


 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed

2014-09-17 Thread chang li (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137937#comment-14137937
 ] 

chang li commented on YARN-2308:


thanks for the collective thought about this and all suggestions. I will 
improve my solution. 

 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: chang li
Priority: Critical
 Attachments: jira2308.patch, jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-17 Thread Chris Trezzo (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137944#comment-14137944
]

Chris Trezzo commented on YARN-2179:

[~kasha] a couple of comments:

bq. 5. sharedcache-pom: my understand of maven is pretty sparse, so please
correct me if I am wrong. Looks like sharedcache depends on the RM. If we were
to embed the sharedcache in the RM, wouldn't that lead to circular dependency?
How do we plan to solve it?

One approach would be to move the shared cache project back into the RM
project. This would not affect the ability to run the shared cache manager as a
separate service, but would be more a code organizational thing. Thoughts?

bq. 6. RemoteAppChecker: Just thinking out loud - in a non-embedded case, what
happens if we upgrade other daemons/clients but not the SCM and add a new
completed state? There might not be a solution here though, the worst case
appears to be that we wouldn't clear the cache when apps end up in that state.
One alternative is to query the RM for active states or an app being active. I
am open to adding these APIs (Private for now) to the RM.

I took a look at the ApplicationReport interface again. Would it make more
sense to leverage getFinalApplicationStatus() instead of
getYarnApplicationState()? That way we can just say if the
FinalApplicationStatus is undefined don't clean it up, otherwise we are safe to
delete the appId.

I will work on the changes for the other comments and post an updated patch.

Initial cache manager structure and context
---

Key: YARN-2179
URL: https://issues.apache.org/jira/browse/YARN-2179
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch,
YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch

Implement the initial shared cache manager structure and context. The
SCMContext will be used by a number of manager services (i.e. the backing
store and the cleaner service). The AppChecker is used to gather the
currently running applications on SCM startup (necessary for an scm that is
backed by an in-memory store).

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2468) Log handling for LRS


 [ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2468:

Attachment: YARN-2468.5.1.patch

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields


[ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137952#comment-14137952
 ] 

Junping Du commented on YARN-668:
-

bq. I think Siddharth Seth's solution in the description is a much simpler way 
to address this.
Agree. I am starting to work on this way. [~vinodkv], can I take it over if you 
haven't start to work on this? Thanks!

 TokenIdentifier serialization should consider Unknown fields
 

 Key: YARN-668
 URL: https://issues.apache.org/jira/browse/YARN-668
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Vinod Kumar Vavilapalli
Priority: Blocker

 This would allow changing of the TokenIdentifier between versions. The 
 current serialization is Writable. A simple way to achieve this would be to 
 have a Proto object as the payload for TokenIdentifiers, instead of 
 individual fields.
 TokenIdentifier continues to implement Writable to work with the RPC layer - 
 but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-09-17 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-913:

Attachment: YARN-913-003.patch

assuming this patch builds (it does appear to locally), this patch

# is in sync with trunk, including the new curator import of HADOOP-10982
# Adds security
# has tests that bring up a kerberized ZK cluster to verify clients can work 
with it.
# has the RM in charge of setting up paths and cleaning up after

I don't think security is perfect ... I need to lock down the ACLs, and get the 
design docs from google drive into the docs as .md files.

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
 YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, yarnregistry.pdf, 
 yarnregistry.tla


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.


[ 
https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137973#comment-14137973
 ] 

Hadoop QA commented on YARN-2561:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669455/YARN-2561-v4.patch
  against trunk revision ea4e2e8.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5000//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5000//console

This message is automatically generated.

 MR job client cannot reconnect to AM after NM restart.
 --

 Key: YARN-2561
 URL: https://issues.apache.org/jira/browse/YARN-2561
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Tassapol Athiapinya
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, 
 YARN-2561-v4.patch, YARN-2561.patch


 Work-preserving NM restart is disabled.
 Submit a job. Restart the only NM and found that Job will hang with connect 
 retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1779) Handle AMRMTokens across RM failover


 [ 
https://issues.apache.org/jira/browse/YARN-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1779:
--
Attachment: YARN-1779.6.patch

Thanks Vinod for reviewing. Reverted unnecessary changes.
Also changed TestUnmanagedAMLauncher to new YarnConfiguration instead of 
Configuration so that YarnConfiguration can be reloaded. 

Tested on real HA cluster with work-preserving restart enabled.
Without the patch,  AM will get Token exception if AM fails over from 
rm1-rm2-rm1. With the patch. AM can failover properly.

 Handle AMRMTokens across RM failover
 

 Key: YARN-1779
 URL: https://issues.apache.org/jira/browse/YARN-1779
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Jian He
Priority: Blocker
  Labels: ha
 Attachments: YARN-1779.1.patch, YARN-1779.2.patch, YARN-1779.3.patch, 
 YARN-1779.6.patch


 Verify if AMRMTokens continue to work against RM failover. If not, we will 
 have to do something along the lines of YARN-986. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster

[
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137997#comment-14137997
]

Hadoop QA commented on YARN-913:

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12669504/YARN-913-003.patch
against trunk revision f24ac42.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 32 new
or modified test files.

{color:red}-1 javac{color:red}. The patch appears to cause the build to
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5004//console

This message is automatically generated.

Add a way to register long-lived services in a YARN cluster
---

Key: YARN-913
URL: https://issues.apache.org/jira/browse/YARN-913
Project: Hadoop YARN
Issue Type: New Feature
Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf,
2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt,
YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, yarnregistry.pdf,
yarnregistry.tla

In a YARN cluster you can't predict where services will come up -or on what
ports. The services need to work those things out as they come up and then
publish them somewhere.
Applications need to be able to find the service instance they are to bond to
-and not any others in the cluster.
Some kind of service registry -in the RM, in ZK, could do this. If the RM
held the write access to the ZK nodes, it would be more secure than having
apps register with ZK themselves.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields


[ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137999#comment-14137999
 ] 

Tsuyoshi OZAWA commented on YARN-668:
-

+1(non-binding) for making TokenIdentifier serialization protobuf. With the 
change, we can do versioning TokenIdentifier and co-exist old and new 
TokenIdentifier.

 TokenIdentifier serialization should consider Unknown fields
 

 Key: YARN-668
 URL: https://issues.apache.org/jira/browse/YARN-668
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Vinod Kumar Vavilapalli
Priority: Blocker

 This would allow changing of the TokenIdentifier between versions. The 
 current serialization is Writable. A simple way to achieve this would be to 
 have a Proto object as the payload for TokenIdentifiers, instead of 
 individual fields.
 TokenIdentifier continues to implement Writable to work with the RPC layer - 
 but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2559) ResourceManager sometime become un-responsive due to NPE in SystemMetricsPublisher


[ 
https://issues.apache.org/jira/browse/YARN-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138003#comment-14138003
 ] 

Hadoop QA commented on YARN-2559:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669496/YARN-2559.2.patch
  against trunk revision e3803d0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5002//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5002//console

This message is automatically generated.

 ResourceManager sometime become un-responsive due to NPE in 
 SystemMetricsPublisher
 --

 Key: YARN-2559
 URL: https://issues.apache.org/jira/browse/YARN-2559
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.6.0
 Environment: Generice History Service is enabled in Timelineserver 
 with 
 yarn.resourcemanager.system-metrics-publisher.enabled=true
 So that ResourceManager should Timeline Store for recording application 
 history information 
Reporter: Karam Singh
Assignee: Zhijie Shen
 Attachments: YARN-2559.1.patch, YARN-2559.2.patch


 ResourceManager sometime become un-responsive due to NPE in 
 SystemMetricsPublisher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-17 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138021#comment-14138021
 ] 

Karthik Kambatla commented on YARN-2179:


bq. One approach would be to move the shared cache project back into the RM 
project.
That should work. I am okay with leaving the patch as is for now and move 
modules when we are embedding SCM in the RM.

bq. Would it make more sense to leverage getFinalApplicationStatus() instead of 
getYarnApplicationState()?
Sounds reasonable. 

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
 YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2468) Log handling for LRS


 [ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2468:

Attachment: YARN-2468.5.1.patch

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover


[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138060#comment-14138060
 ] 

Hadoop QA commented on YARN-2001:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669499/YARN-2001.5.patch
  against trunk revision f24ac42.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The test build failed in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5005//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5005//console

This message is automatically generated.

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch, 
 YARN-2001.4.patch, YARN-2001.5.patch, YARN-2001.5.patch


 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1779) Handle AMRMTokens across RM failover


[ 
https://issues.apache.org/jira/browse/YARN-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138072#comment-14138072
 ] 

Hadoop QA commented on YARN-1779:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669507/YARN-1779.6.patch
  against trunk revision f24ac42.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5006//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5006//console

This message is automatically generated.

 Handle AMRMTokens across RM failover
 

 Key: YARN-1779
 URL: https://issues.apache.org/jira/browse/YARN-1779
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Jian He
Priority: Blocker
  Labels: ha
 Attachments: YARN-1779.1.patch, YARN-1779.2.patch, YARN-1779.3.patch, 
 YARN-1779.6.patch


 Verify if AMRMTokens continue to work against RM failover. If not, we will 
 have to do something along the lines of YARN-986. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

[
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138083#comment-14138083
]

Jian He commented on YARN-1372:
---

We can probably do this:
- Transfer both justFinishedContainers and finishedContainersSentToAM to the
new attempt irrespective work-preserving AM restart is enabled or not, so that
second attempt could continously ack previous finished containers.
- In pullJustFinishedContainers, we can check if work-preserving AM restart is
enabled. If it is, we return all the attempts’ finished containers. If it is
not enabled, only return current attempt’s containers.

Ensure all completed containers are reported to the AMs across RM restart
-

Key: YARN-1372
URL: https://issues.apache.org/jira/browse/YARN-1372
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
Attachments: YARN-1372.001.patch, YARN-1372.001.patch,
YARN-1372.002_NMHandlesCompletedApp.patch,
YARN-1372.002_RMHandlesCompletedApp.patch,
YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch,
YARN-1372.004.patch, YARN-1372.005.patch, YARN-1372.005.patch,
YARN-1372.prelim.patch, YARN-1372.prelim2.patch

Currently the NM informs the RM about completed containers and then removes
those containers from the RM notification list. The RM passes on that
completed container information to the AM and the AM pulls this data. If the
RM dies before the AM pulls this data then the AM may not be able to get this
information again. To fix this, NM should maintain a separate list of such
completed container notifications sent to the RM. After the AM has pulled the
containers from the RM then the RM will inform the NM about it and the NM can
remove the completed container from the new list. Upon re-register with the
RM (after RM restart) the NM should send the entire list of completed
containers to the RM along with any other containers that completed while the
RM was dead. This ensures that the RM can inform the AM's about all completed
containers. Some container completions may be reported more than once since
the AM may have pulled the container but the RM may die before notifying the
NM about the pull.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2558) Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId


[ 
https://issues.apache.org/jira/browse/YARN-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138093#comment-14138093
 ] 

Jian He commented on YARN-2558:
---

Committing this, thanks [~jlowe], [~vinodkv] for the comments.

 Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId
 --

 Key: YARN-2558
 URL: https://issues.apache.org/jira/browse/YARN-2558
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
Priority: Blocker
 Attachments: YARN-2558.1.patch, YARN-2558.2.patch, YARN-2558.3.patch


 We should update ContainerTokenIdentifier#read/write to use 
 {{getContainerId}} instead of {{getId}} to pass all container information 
 correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2563) On secure clusters call to timeline server fails with authentication errors when running a job via oozie

[
https://issues.apache.org/jira/browse/YARN-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138107#comment-14138107
]

Zhijie Shen commented on YARN-2563:
---

When submitting an app in a secure mode, YarnClient will automatically obtain a
timeline DT from the timeline server. This communication needs to pass Kerberos
authentication. It works at the client side, which has Kerberos setup. In a
container (either of AM or a specific task), the process doesn't do Kerberos
login, such that it is not able to pass Kerberos authentication to get the
timeline DT. In this scenario, Oozie is starting a MR job inside the MR mapper
container, such that it fails to pass Kerberos authentication enforced by the
timeline server.

However, the expected behavior is that YarnClient only grab a timeline DT when
it is not found when submitting a app, and the DT will be put into the
credentials of ContainerLaunchContext, and passed to AM and the remaining MR
tasks' containers. Hence when Oozie wants to launch to a RM job from there, it
should already have the DT, and don't need to invoke getTimelineDelegationToken
method.

It seems that YarnClientImpl.addTimelineDelegationToken has a bug. No matter
the DT is already in the credentials or not, YarnClientImpl will always grab
one, but only put it into the credentials when the DT is not there. The right
behavior should be: when the DT is already in credentials, we even shouldn't
invoke getTimelineDelegationToken. I'll create a patch to fix the bug.

On secure clusters call to timeline server fails with authentication errors
when running a job via oozie

Key: YARN-2563
URL: https://issues.apache.org/jira/browse/YARN-2563
Project: Hadoop YARN
Issue Type: Bug
Components: timelineserver
Affects Versions: 2.6.0
Reporter: Arpit Gupta
Assignee: Zhijie Shen
Priority: Blocker

During our nightlies on a secure cluster we have seen oozie jobs fail with
authentication error to the time line server.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2468) Log handling for LRS


[ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138127#comment-14138127
 ] 

Hadoop QA commented on YARN-2468:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669521/YARN-2468.5.1.patch
  against trunk revision f230248.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5007//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5007//console

This message is automatically generated.

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover


 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--
Attachment: YARN-2001.5.patch

Trying the same patch again. no failures actually found in the jenkins console 
log

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch, 
 YARN-2001.4.patch, YARN-2001.5.patch, YARN-2001.5.patch, YARN-2001.5.patch


 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2559) ResourceManager sometime become un-responsive due to NPE in SystemMetricsPublisher


[ 
https://issues.apache.org/jira/browse/YARN-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138162#comment-14138162
 ] 

Jian He commented on YARN-2559:
---

looks good overall,  We may just call RMApp#getFinalApplicationStatus here?
{code}
(appAttempt.getFinalApplicationStatus() == null ?
  RMServerUtils.createFinalApplicationStatus(appState) :
appAttempt.getFinalApplicationStatus()
{code}

 ResourceManager sometime become un-responsive due to NPE in 
 SystemMetricsPublisher
 --

 Key: YARN-2559
 URL: https://issues.apache.org/jira/browse/YARN-2559
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.6.0
 Environment: Generice History Service is enabled in Timelineserver 
 with 
 yarn.resourcemanager.system-metrics-publisher.enabled=true
 So that ResourceManager should Timeline Store for recording application 
 history information 
Reporter: Karam Singh
Assignee: Zhijie Shen
 Attachments: YARN-2559.1.patch, YARN-2559.2.patch


 ResourceManager sometime become un-responsive due to NPE in 
 SystemMetricsPublisher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2563) On secure clusters call to timeline server fails with authentication errors when running a job via oozie


 [ 
https://issues.apache.org/jira/browse/YARN-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2563:
--
Attachment: YARN-2563.1.patch

Create a patch to fix the aforementioned bug.

 On secure clusters call to timeline server fails with authentication errors 
 when running a job via oozie
 

 Key: YARN-2563
 URL: https://issues.apache.org/jira/browse/YARN-2563
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Arpit Gupta
Assignee: Zhijie Shen
Priority: Blocker
 Attachments: YARN-2563.1.patch


 During our nightlies on a secure cluster we have seen oozie jobs fail with 
 authentication error to the time line server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2565) ResourceManager is fails to start when GenericHistoryService is enabled in secure mode without doing manual kinit as yarn


[ 
https://issues.apache.org/jira/browse/YARN-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138184#comment-14138184
 ] 

Zhijie Shen commented on YARN-2565:
---

[~karams], I think you've neglected mentioning the config: 
yarn.timeline-service.generic-application-history.enabled. It should be true, 
such that FileSystemApplicationHistoryStore is picked by 
RMApplicationHistoryWriter, which cannot access HDFS correctly in secure mode.

After YARN-2033, when you enable generic history service, you should by default 
pick the new storage stack based on TimelineStore. The problem seems to be that 
the configurations which determine what store is chosen by 
ApplicationHistoryServer and RMApplicationHistoryWriter is not consistent. On 
RMApplicationHistoryWriter side, we should also use 
FileSystemApplicationHistoryStore only when users have explicitly put it in the 
config file.

 ResourceManager is fails to start when GenericHistoryService is enabled in 
 secure mode without doing manual kinit as yarn
 -

 Key: YARN-2565
 URL: https://issues.apache.org/jira/browse/YARN-2565
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.6.0
 Environment: Secure cluster with ATS (timeline server enabled) and 
 yarn.resourcemanager.system-metrics-publisher.enabled=true
 so that RM can send Application history to Timeline Store
Reporter: Karam Singh
Assignee: Zhijie Shen

 Observed that RM fails to start in Secure mode when GenericeHistoryService is 
 enabled and ResourceManager is set to use Timeline Store



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2565) RM shouldn't use the old RMApplicationHistoryWriter unless explicitly setting FileSystemApplicationHistoryStore


 [ 
https://issues.apache.org/jira/browse/YARN-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2565:
--
Summary: RM shouldn't use the old RMApplicationHistoryWriter unless 
explicitly setting FileSystemApplicationHistoryStore  (was: ResourceManager is 
fails to start when GenericHistoryService is enabled in secure mode without 
doing manual kinit as yarn)

 RM shouldn't use the old RMApplicationHistoryWriter unless explicitly setting 
 FileSystemApplicationHistoryStore
 ---

 Key: YARN-2565
 URL: https://issues.apache.org/jira/browse/YARN-2565
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.6.0
 Environment: Secure cluster with ATS (timeline server enabled) and 
 yarn.resourcemanager.system-metrics-publisher.enabled=true
 so that RM can send Application history to Timeline Store
Reporter: Karam Singh
Assignee: Zhijie Shen

 Observed that RM fails to start in Secure mode when GenericeHistoryService is 
 enabled and ResourceManager is set to use Timeline Store



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2559) ResourceManager sometime become un-responsive due to NPE in SystemMetricsPublisher


 [ 
https://issues.apache.org/jira/browse/YARN-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2559:
--
Attachment: YARN-2559.3.patch

Update the patch accordingly

 ResourceManager sometime become un-responsive due to NPE in 
 SystemMetricsPublisher
 --

 Key: YARN-2559
 URL: https://issues.apache.org/jira/browse/YARN-2559
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.6.0
 Environment: Generice History Service is enabled in Timelineserver 
 with 
 yarn.resourcemanager.system-metrics-publisher.enabled=true
 So that ResourceManager should Timeline Store for recording application 
 history information 
Reporter: Karam Singh
Assignee: Zhijie Shen
 Attachments: YARN-2559.1.patch, YARN-2559.2.patch, YARN-2559.3.patch


 ResourceManager sometime become un-responsive due to NPE in 
 SystemMetricsPublisher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

[
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138233#comment-14138233
]

Hadoop QA commented on YARN-2001:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12669564/YARN-2565.1.patch
against trunk revision 123f20d.

{color:red}-1 patch{color}. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5011//console

This message is automatically generated.

Threshold for RM to accept requests from AM after failover
--

Key: YARN-2001
URL: https://issues.apache.org/jira/browse/YARN-2001
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
Attachments: YARN-2001.1.patch, YARN-2001.2.patch, YARN-2001.3.patch,
YARN-2001.4.patch, YARN-2001.5.patch, YARN-2001.5.patch, YARN-2001.5.patch

After failover, RM may require a certain threshold to determine whether it’s
safe to make scheduling decisions and start accepting new container requests
from AMs. The threshold could be a certain amount of nodes. i.e. RM waits
until a certain amount of nodes joining before accepting new container
requests. Or it could simply be a timeout, only after the timeout RM accepts
new requests.
NMs joined after the threshold can be treated as new NMs and instructed to
kill all its containers.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2565) RM shouldn't use the old RMApplicationHistoryWriter unless explicitly setting FileSystemApplicationHistoryStore


[ 
https://issues.apache.org/jira/browse/YARN-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138232#comment-14138232
 ] 

Hadoop QA commented on YARN-2565:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669564/YARN-2565.1.patch
  against trunk revision 123f20d.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5010//console

This message is automatically generated.

 RM shouldn't use the old RMApplicationHistoryWriter unless explicitly setting 
 FileSystemApplicationHistoryStore
 ---

 Key: YARN-2565
 URL: https://issues.apache.org/jira/browse/YARN-2565
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.6.0
 Environment: Secure cluster with ATS (timeline server enabled) and 
 yarn.resourcemanager.system-metrics-publisher.enabled=true
 so that RM can send Application history to Timeline Store
Reporter: Karam Singh
Assignee: Zhijie Shen
 Attachments: YARN-2565.1.patch


 Observed that RM fails to start in Secure mode when GenericeHistoryService is 
 enabled and ResourceManager is set to use Timeline Store



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2080) Admission Control: Integrate Reservation subsystem with ResourceManager

2014-09-17 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138234#comment-14138234
 ] 

Karthik Kambatla commented on YARN-2080:


Looks mostly good. Nits: There are some unused imports and javadoc errors in 
the files. Also, a couple of class javadocs have empty lines at the end.
Comments:
# It would be nice to not have default values for configs for ReservationSystem 
and PlanFollower. We could pick these defaults based on the scheduler. 
# I am not convinced using UTCClock is the best way, particularly when client 
time in not UTC. But, I guess we can go ahead with this for now and revisit it 
when we run into problems. 

 Admission Control: Integrate Reservation subsystem with ResourceManager
 ---

 Key: YARN-2080
 URL: https://issues.apache.org/jira/browse/YARN-2080
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Subru Krishnan
Assignee: Subru Krishnan
 Attachments: YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, 
 YARN-2080.patch, YARN-2080.patch, YARN-2080.patch


 This JIRA tracks the integration of Reservation subsystem data structures 
 introduced in YARN-1709 with the YARN RM. This is essentially end2end wiring 
 of YARN-1051.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2139) Add support for disk IO isolation/scheduling for containers

2014-09-17 Thread Wei Yan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2139:
--
Attachment: Disk_IO_Scheduling_Design_2.pdf

Update a new design doc including spindle-locality information. Comments are 
very welcome.
I'll create the sub-tasks to upload prelim code for review soon.

 Add support for disk IO isolation/scheduling for containers
 ---

 Key: YARN-2139
 URL: https://issues.apache.org/jira/browse/YARN-2139
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: Disk_IO_Scheduling_Design_1.pdf, 
 Disk_IO_Scheduling_Design_2.pdf






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2563) On secure clusters call to timeline server fails with authentication errors when running a job via oozie


[ 
https://issues.apache.org/jira/browse/YARN-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138262#comment-14138262
 ] 

Hadoop QA commented on YARN-2563:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669557/YARN-2563.1.patch
  against trunk revision 123f20d.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5009//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5009//console

This message is automatically generated.

 On secure clusters call to timeline server fails with authentication errors 
 when running a job via oozie
 

 Key: YARN-2563
 URL: https://issues.apache.org/jira/browse/YARN-2563
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Arpit Gupta
Assignee: Zhijie Shen
Priority: Blocker
 Attachments: YARN-2563.1.patch


 During our nightlies on a secure cluster we have seen oozie jobs fail with 
 authentication error to the time line server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2558) Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId