date:20140619

[
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037045#comment-14037045
]

Vinod Kumar Vavilapalli commented on YARN-2142:
---

bq. Because of critical computing environment ,we must test every node's TRUST
status in the cluster (We can get the TRUST status by the API of OAT sever),So
I add this feature into hadoop's schedule .

Can you add more details on what this really means? What is the definition of
TRUST? What is OAT server here? Will the kerberos based authentication
mechanism be not enough already? If not, why?

Add one service to check the nodes' TRUST status
-

Key: YARN-2142
URL: https://issues.apache.org/jira/browse/YARN-2142
Project: Hadoop YARN
Issue Type: New Feature
Components: nodemanager, resourcemanager, scheduler
Affects Versions: 2.2.0
Environment: OS:Ubuntu 13.04;
JAVA:OpenJDK 7u51-2.4.4-0
Reporter: anders
Priority: Minor
Labels: patch
Fix For: 2.2.0

Attachments: trust.patch

Original Estimate: 1m
Remaining Estimate: 1m

Because of critical computing environment ,we must test every node's TRUST
status in the cluster (We can get the TRUST status by the API of OAT
sever),So I add this feature into hadoop's schedule .
By the TRUST check service ,node can get the TRUST status of itself,
then through the heartbeat ,send the TRUST status to resource manager for
scheduling.
In the scheduling step,if the node's TRUST status is 'false', it will be
abandoned until it's TRUST status turn to 'true'.
***The logic of this feature is similar to node's healthcheckservice.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

[
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037055#comment-14037055
]

Vinod Kumar Vavilapalli commented on YARN-2175:
---

bq. there is no way to kill an task if its stuck in these states.
YARN-1619/YARN-445 should let you do this manually if not automatically.

Container localization has no timeouts and tasks can be stuck there for a
long time
---

Key: YARN-2175
URL: https://issues.apache.org/jira/browse/YARN-2175
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Affects Versions: 2.4.0
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot

There are no timeouts that can be used to limit the time taken by various
container startup operations. Localization for example could take a long time
and there is no way to kill an task if its stuck in these states. These may
have nothing to do with the task itself and could be an issue within the
platform.
Ideally there should be configurable limits for various states within the
NodeManager to limit various states. The RM does not care about most of these
and its only between AM and the NM. We can start by making these global
configurable defaults and in future we can make it fancier by letting AM
override them in the start container request.
This jira will be used to limit localization time and we open others if we
feel we need to limit other operations.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2142) Add one service to check the nodes' TRUST status

2014-06-19 Thread anders (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037068#comment-14037068
 ] 

anders commented on YARN-2142:
--

In the cluster ,every node has been registered on one machine which is OAT 
server;
If the node's info(maybe BIOS info ,system info etc) was edited,and the current 
info is not equal  to register info ,we can call this nodes is not TRUST .

 Add one service to check the nodes' TRUST status 
 -

 Key: YARN-2142
 URL: https://issues.apache.org/jira/browse/YARN-2142
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager, scheduler
Affects Versions: 2.2.0
 Environment: OS:Ubuntu 13.04; 
 JAVA:OpenJDK 7u51-2.4.4-0
Reporter: anders
Priority: Minor
  Labels: patch
 Fix For: 2.2.0

 Attachments: trust.patch

   Original Estimate: 1m
  Remaining Estimate: 1m

 Because of critical computing environment ,we must test every node's TRUST 
 status in the cluster (We can get the TRUST status by the API of OAT 
 sever),So I add this feature into hadoop's schedule .
 By the TRUST check service ,node can get the TRUST status of itself,
 then through the heartbeat ,send the TRUST status to resource manager for 
 scheduling.
 In the scheduling step,if the node's TRUST status is 'false', it will be 
 abandoned until it's TRUST status turn to 'true'.
 ***The logic of this feature is similar to node's healthcheckservice.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2144) Add logs when preemption occurs

2014-06-19 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037115#comment-14037115
 ] 

Wangda Tan commented on YARN-2144:
--

Attached a patch only contains preemption log related changes,
[~jianhe], [~tassapola], do you have any comments?

Thanks,

 Add logs when preemption occurs
 ---

 Key: YARN-2144
 URL: https://issues.apache.org/jira/browse/YARN-2144
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.5.0
Reporter: Tassapol Athiapinya
Assignee: Wangda Tan
 Attachments: AM-page-preemption-info.png, YARN-2144.patch, 
 YARN-2144.patch, YARN-2144.patch, YARN-2144.patch


 There should be easy-to-read logs when preemption does occur. 
 RM logs should have following properties:
 * Logs are retrievable when an application is still running and often flushed.
 * Can distinguish between AM container preemption and task container 
 preemption with container ID shown.
 * Should be INFO level log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-06-19 Thread anders (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: trust.patch

Fix the WebUI.

 Add one service to check the nodes' TRUST status 
 -

 Key: YARN-2142
 URL: https://issues.apache.org/jira/browse/YARN-2142
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager, scheduler
Affects Versions: 2.2.0
 Environment: OS:Ubuntu 13.04; 
 JAVA:OpenJDK 7u51-2.4.4-0
Reporter: anders
Priority: Minor
  Labels: patch
 Fix For: 2.2.0

 Attachments: trust.patch, trust.patch

   Original Estimate: 1m
  Remaining Estimate: 1m

 Because of critical computing environment ,we must test every node's TRUST 
 status in the cluster (We can get the TRUST status by the API of OAT 
 sever),So I add this feature into hadoop's schedule .
 By the TRUST check service ,node can get the TRUST status of itself,
 then through the heartbeat ,send the TRUST status to resource manager for 
 scheduling.
 In the scheduling step,if the node's TRUST status is 'false', it will be 
 abandoned until it's TRUST status turn to 'true'.
 ***The logic of this feature is similar to node's healthcheckservice.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2142) Add one service to check the nodes' TRUST status

[
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037130#comment-14037130
]

Hadoop QA commented on YARN-2142:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12651387/trust.patch
against trunk revision .

{color:red}-1 patch{color}. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4025//console

This message is automatically generated.

Add one service to check the nodes' TRUST status
-

Attachments: trust.patch, trust.patch

Original Estimate: 1m
Remaining Estimate: 1m

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-06-19 Thread anders (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

anders updated YARN-2142:
-

Description:
Because of critical computing environment ,we must test every node's TRUST
status in the cluster (We can get the TRUST status by the API of OAT sever),So
I add this feature into hadoop's schedule .
By the TRUST check service ,node can get the TRUST status of itself,
then through the heartbeat ,send the TRUST status to resource manager for
scheduling.
In the scheduling step,if the node's TRUST status is 'false', it will be
abandoned until it's TRUST status turn to 'true'.

***The logic of this feature is similar to node's health checkservice.

was:
Because of critical computing environment ,we must test every node's TRUST
status in the cluster (We can get the TRUST status by the API of OAT sever),So
I add this feature into hadoop's schedule .
By the TRUST check service ,node can get the TRUST status of itself,
then through the heartbeat ,send the TRUST status to resource manager for
scheduling.
In the scheduling step,if the node's TRUST status is 'false', it will be
abandoned until it's TRUST status turn to 'true'.

***The logic of this feature is similar to node's healthcheckservice.

Add one service to check the nodes' TRUST status
-

Attachments: trust.patch, trust.patch

Original Estimate: 1m
Remaining Estimate: 1m

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-06-19 Thread Jason Lowe (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037337#comment-14037337
]

Jason Lowe commented on YARN-2175:
--

I also wonder if there's been a regression, since at least in 0.23 containers
that are localizing can be killed by the ApplicationMaster. The MR AM does
this when mapreduce.task.timeout triggers a kill of a task due to lack of
progress. The MR AM kills the container and that, in turn, causes the
localizer to die because the NM tells the localizer to DIE during its next
heartbeat.

Although if the localizer gets stuck and stops heartbeating and the NM lost
track of it due to the container kill then it seems like we could leak a hung
localizer process.

Container localization has no timeouts and tasks can be stuck there for a
long time
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2178) TestApplicationMasterService sometimes fails in trunk

2014-06-19 Thread Mit Desai (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037399#comment-14037399
 ] 

Mit Desai commented on YARN-2178:
-

Hi Ted,
How did you reproduce this? I tried mvn clean test 
-Dtest=TestApplicationMasterService and I could not reproduce it even after 
running it couple of times.

 TestApplicationMasterService sometimes fails in trunk
 -

 Key: YARN-2178
 URL: https://issues.apache.org/jira/browse/YARN-2178
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Priority: Minor
  Labels: test

 From https://builds.apache.org/job/Hadoop-Yarn-trunk/587/ :
 {code}
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService
 Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 55.763 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService
 testInvalidContainerReleaseRequest(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService)
   Time elapsed: 41.336 sec   FAILURE!
 java.lang.AssertionError: AppAttempt state is not correct (timedout) 
 expected:ALLOCATED but was:SCHEDULED
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:401)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService.testInvalidContainerReleaseRequest(TestApplicationMasterService.java:143)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2052) ContainerId creation after work preserving restart is broken


 [ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2052:
-

Attachment: YARN-2052.4.patch

 ContainerId creation after work preserving restart is broken
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch, 
 YARN-2052.4.patch


 Container ids are made unique by using the app identifier and appending a 
 monotonically increasing sequence number to it. Since container creation is a 
 high churn activity the RM does not store the sequence number per app. So 
 after restart it does not know what the new sequence number should be for new 
 allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2178) TestApplicationMasterService sometimes fails in trunk

2014-06-19 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037451#comment-14037451
 ] 

Ted Yu commented on YARN-2178:
--

I ran TestApplicationMasterService on Mac and it passed.

Let me loop this test on Linux.

 TestApplicationMasterService sometimes fails in trunk
 -

 Key: YARN-2178
 URL: https://issues.apache.org/jira/browse/YARN-2178
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Priority: Minor
  Labels: test

 From https://builds.apache.org/job/Hadoop-Yarn-trunk/587/ :
 {code}
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService
 Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 55.763 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService
 testInvalidContainerReleaseRequest(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService)
   Time elapsed: 41.336 sec   FAILURE!
 java.lang.AssertionError: AppAttempt state is not correct (timedout) 
 expected:ALLOCATED but was:SCHEDULED
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:401)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService.testInvalidContainerReleaseRequest(TestApplicationMasterService.java:143)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken


[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037463#comment-14037463
 ] 

Tsuyoshi OZAWA commented on YARN-2052:
--

Updated patch to address the comments by Bikas, Jian, and Vinod. We agreed that 
this JIRA doesn't include the changes of {{toString()}} format and container id 
length from int to long. Therefore, the latest patch includes following changes:

* Added getEpoch()/setEpoch() APIs to ContainerId.
* Changed setContainerId() to ignore upper 8bits for the number of RM restarts.
* Updated ContainerIdProto to include epoch(int32 value) for the future changes.

[~jianhe], [~bikassaha], [~vinodkv] could you take a look?

 ContainerId creation after work preserving restart is broken
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch, 
 YARN-2052.4.patch


 Container ids are made unique by using the app identifier and appending a 
 monotonically increasing sequence number to it. Since container creation is a 
 high churn activity the RM does not store the sequence number per app. So 
 after restart it does not know what the new sequence number should be for new 
 allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken


[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037481#comment-14037481
 ] 

Hadoop QA commented on YARN-2052:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12651430/YARN-2052.4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4026//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4026//console

This message is automatically generated.

 ContainerId creation after work preserving restart is broken
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch, 
 YARN-2052.4.patch


 Container ids are made unique by using the app identifier and appending a 
 monotonically increasing sequence number to it. Since container creation is a 
 high churn activity the RM does not store the sequence number per app. So 
 after restart it does not know what the new sequence number should be for new 
 allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2182) Update ContainerId#toString() to avoid conflicts before and after RM restart

Tsuyoshi OZAWA created YARN-2182:


 Summary: Update ContainerId#toString() to avoid conflicts before 
and after RM restart
 Key: YARN-2182
 URL: https://issues.apache.org/jira/browse/YARN-2182
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA


ContainerId#toString() doesn't include any information about current cluster 
id. This leads conflict between container ids. We can avoid the conflicts 
without breaking backward compatibility by using epoch introduced on YARN-2052.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2182) Update ContainerId#toString() to avoid conflicts before and after RM restart


 [ 
https://issues.apache.org/jira/browse/YARN-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2182:
-

Issue Type: Sub-task  (was: Improvement)
Parent: YARN-556

 Update ContainerId#toString() to avoid conflicts before and after RM restart
 

 Key: YARN-2182
 URL: https://issues.apache.org/jira/browse/YARN-2182
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA

 ContainerId#toString() doesn't include any information about current cluster 
 id. This leads conflict between container ids. We can avoid the conflicts 
 without breaking backward compatibility by using epoch introduced on 
 YARN-2052.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM

[
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037544#comment-14037544
]

Xuan Gong commented on YARN-611:

I am working on this, and will give a proposal soon.

Add an AM retry count reset window to YARN RM
-

Key: YARN-611
URL: https://issues.apache.org/jira/browse/YARN-611
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Chris Riccomini
Assignee: Xuan Gong

YARN currently has the following config:
yarn.resourcemanager.am.max-retries
This config defaults to 2, and defines how many times to retry a failed AM
before failing the whole YARN job. YARN counts an AM as failed if the node
that it was running on dies (the NM will timeout, which counts as a failure
for the AM), or if the AM dies.
This configuration is insufficient for long running (or infinitely running)
YARN jobs, since the machine (or NM) that the AM is running on will
eventually need to be restarted (or the machine/NM will fail). In such an
event, the AM has not done anything wrong, but this is counted as a failure
by the RM. Since the retry count for the AM is never reset, eventually, at
some point, the number of machine/NM failures will result in the AM failure
count going above the configured value for
yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the
job as failed, and shut it down. This behavior is not ideal.
I propose that we add a second configuration:
yarn.resourcemanager.am.retry-count-window-ms
This configuration would define a window of time that would define when an AM
is well behaved, and it's safe to reset its failure count back to zero.
Every time an AM fails the RmAppImpl would check the last time that the AM
failed. If the last failure was less than retry-count-window-ms ago, and the
new failure count is max-retries, then the job should fail. If the AM has
never failed, the retry count is max-retries, or if the last failure was
OUTSIDE the retry-count-window-ms, then the job should be restarted.
Additionally, if the last failure was outside the retry-count-window-ms, then
the failure count should be set back to 0.
This would give developers a way to have well-behaved AMs run forever, while
still failing mis-behaving AMs after a short period of time.
I think the work to be done here is to change the RmAppImpl to actually look
at app.attempts, and see if there have been more than max-retries failures in
the last retry-count-window-ms milliseconds. If there have, then the job
should fail, if not, then the job should go forward. Additionally, we might
also need to add an endTime in either RMAppAttemptImpl or
RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the
failure.
Thoughts?

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-611) Add an AM retry count reset window to YARN RM

[
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xuan Gong reassigned YARN-611:
--

Assignee: Xuan Gong

Add an AM retry count reset window to YARN RM
-

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN

2014-06-19 Thread Allen Wittenauer (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037550#comment-14037550
]

Allen Wittenauer commented on YARN-1964:

I took a look at the patch. After some discussion/clarification from Abin and
Dinesh, here are some issues we've identified with the current patch:

* yarn-site.xml names are not in the yarn.* hierarchy.

* The environment variables being imported are likely incomplete
(HADOOP_PREFIX, at a minimum, should be added to the list).

* Application Masters cannot currently be run from DockerContainerExecutor.
The current patch launches them with the default launcher.

* There isn't any special distributed cache handling, which could/will expose
private data to everyone as well as risk confusing the node manager if the
distributed cache contents are changed out from underneath it.

* Same thing with logging.

* This hasn't been tested with any level of security (either 'run as the user'
or fully kerberized). There are likely more holes/problems that will be
discovered after those tests.

* There is a question as to how to handle importing Java and the Hadoop setup
into the container. There is a risk that just adding HADOOP_PREFIX and marking
it as RO may break things in certain configurations (specifically logs, tmp,
and pids dirs).

Create Docker analog of the LinuxContainerExecutor in YARN
--

Key: YARN-1964
URL: https://issues.apache.org/jira/browse/YARN-1964
Project: Hadoop YARN
Issue Type: New Feature
Affects Versions: 2.2.0
Reporter: Arun C Murthy
Assignee: Abin Shahab
Attachments: yarn-1964-branch-2.2.0-docker.patch,
yarn-1964-branch-2.2.0-docker.patch, yarn-1964-docker.patch,
yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch,
yarn-1964-docker.patch

Docker (https://www.docker.io/) is, increasingly, a very popular container
technology.
In context of YARN, the support for Docker will provide a very elegant
solution to allow applications to *package* their software into a Docker
container (entire Linux file system incl. custom versions of perl, python
etc.) and use it as a blueprint to launch all their YARN containers with
requisite software environment. This provides both consistency (all YARN
containers will have the same software environment) and isolation (no
interference with whatever is installed on the physical machine).

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service

2014-06-19 Thread Varun Vasudev (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-1713:


Attachment: apache-yarn-1713.6.patch

New patch with code and documentation for submit app.

 Implement getnewapplication and submitapp as part of RM web service
 ---

 Key: YARN-1713
 URL: https://issues.apache.org/jira/browse/YARN-1713
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-1713.3.patch, apache-yarn-1713.4.patch, 
 apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, 
 apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, 
 apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, 
 apache-yarn-1713.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios

2014-06-19 Thread Ashwin Shankar (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037647#comment-14037647
]

Ashwin Shankar commented on YARN-2026:
--

[~sandyr], Sure. Just to be on the same page,I'd like to iron out design
details upfront to avoid rework and minimize boilerplate code.
Here is my design proposal,feel free to change anything :
1. FairShareBase,DominantResourceFairnessBase would contain all the common
code. Their parent class would be the existing SchedulingPolicy class.
2. FairShareBase would have two subclasses - FairSharePolicy(which is the
existing one) and FairShareActiveQueuesPolicy.
The difference between the subclasses would be that they would be using
different ComputeShare class.
3. Similarly DominantResourceFairnessBase would have two subclasses -
DominantResourceFairnessPolicy(existing one) and
DominantResourceActiveQueuesPolicy.
4. A new ComputeShareActiveQueues which computes fair share for active queues
and used by FairShareActiveQueuesPolicy and DominantResourceActiveQueuesPolicy.

Thoughts ?

Fair scheduler : Fair share for inactive queues causes unfair allocation in
some scenarios
--

Key: YARN-2026
URL: https://issues.apache.org/jira/browse/YARN-2026
Project: Hadoop YARN
Issue Type: Bug
Components: scheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
Labels: scheduler
Attachments: YARN-2026-v1.txt

Problem1- While using hierarchical queues in fair scheduler,there are few
scenarios where we have seen a leaf queue with least fair share can take
majority of the cluster and starve a sibling parent queue which has greater
weight/fair share and preemption doesn’t kick in to reclaim resources.
The root cause seems to be that fair share of a parent queue is distributed
to all its children irrespective of whether its an active or an inactive(no
apps running) queue. Preemption based on fair share kicks in only if the
usage of a queue is less than 50% of its fair share and if it has demands
greater than that. When there are many queues under a parent queue(with high
fair share),the child queue’s fair share becomes really low. As a result when
only few of these child queues have apps running,they reach their *tiny* fair
share quickly and preemption doesn’t happen even if other leaf
queues(non-sibling) are hogging the cluster.
This can be solved by dividing fair share of parent queue only to active
child queues.
Here is an example describing the problem and proposed solution:
root.lowPriorityQueue is a leaf queue with weight 2
root.HighPriorityQueue is parent queue with weight 8
root.HighPriorityQueue has 10 child leaf queues :
root.HighPriorityQueue.childQ(1..10)
Above config,results in root.HighPriorityQueue having 80% fair share
and each of its ten child queue would have 8% fair share. Preemption would
happen only if the child queue is 4% (0.5*8=4).
Lets say at the moment no apps are running in any of the
root.HighPriorityQueue.childQ(1..10) and few apps are running in
root.lowPriorityQueue which is taking up 95% of the cluster.
Up till this point,the behavior of FS is correct.
Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30%
of the cluster. It would get only the available 5% in the cluster and
preemption wouldn't kick in since its above 4%(half fair share).This is bad
considering childQ1 is under a highPriority parent queue which has *80% fair
share*.
Until root.lowPriorityQueue starts relinquishing containers,we would see the
following allocation on the scheduler page:
*root.lowPriorityQueue = 95%*
*root.HighPriorityQueue.childQ1=5%*
This can be solved by distributing a parent’s fair share only to active
queues.
So in the example above,since childQ1 is the only active queue
under root.HighPriorityQueue, it would get all its parent’s fair share i.e.
80%.
This would cause preemption to reclaim the 30% needed by childQ1 from
root.lowPriorityQueue after fairSharePreemptionTimeout seconds.
Problem2 - Also note that similar situation can happen between
root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2
hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck
at 5%,until childQ2 starts relinquishing containers. We would like each of
childQ1 and childQ2 to get half of root.HighPriorityQueue fair share ie
40%,which would ensure childQ1 gets upto 40% resource if needed through
preemption.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart

2014-06-19 Thread Junping Du (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037656#comment-14037656
]

Junping Du commented on YARN-1341:
--

bq. I'm not sure I understand what you're requesting. Recovering the NM tokens
is one line of code (3 if we count the if canRecover part), and recovering
the container tokens in YARN-1342 will add one more line for that (inside the
same if canRecover block). I went ahead and factored this into a separate
method, however I'm not sure it matches what you were expecting as I don't see
where we're saving duplicated code. If what's in the updated patch isn't what
you expected, please provide some sample pseudo-code to demonstrate how we can
avoid duplication of code.
I think it is fine for now. However, I would like to refactor a bit on
NodeManager#serviceInit() when we finish all these recover work to avoid some
duplicate work, some code like: createNMContext(), we duplicated set some
handler. Anyway, we can do this later.

bq. The problem with throwing an exception is what to do with the exception –
do we take down the NM? That seems like a drastic answer since the NM will
likely chug along just fine without the key stored. It only becomes a problem
when the NM restarts and restores an old key. However if we rollback the old
key here then we take that only-breaks-if-we-happened-to-restart case and make
it an always-breaks scenario. Eventually the old key will no longer be valid to
the RM, and none of the AMs will be able to authenticate to the NM. Therefore I
thought it would be better to log the error, press onward, and hope we don't
restart before we store a valid key again (maybe store error was transient)
rather than either take down the NM or have things start failing even without a
restart
We already have similar tradeoff in RM side, if any exception happens in
RMStore then it will bring down RM. In NM case, if levelDB stop to work, I
think we should bring NM down to get rid of any inconsistent after NM restart.
Although I am not sure what weird things could happen in case of inconsistency
here, but considering it is cheaper to bring down NM, we should play more
safety in our case than RM. Actually, I bring up some thoughts on play more
risky in RM side at YARN-2019 which target to reduce RM service down time. But
here, I prefer to be safer. Jason, what do you think?

Recover NMTokens upon nodemanager restart
-

Key: YARN-1341
URL: https://issues.apache.org/jira/browse/YARN-1341
Project: Hadoop YARN
Issue Type: Sub-task
Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch,
YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore

2014-06-19 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037664#comment-14037664
 ] 

Junping Du commented on YARN-2019:
--

[~kasha], sorry that I ignored your comments as my email/company changed during 
that time. My thought on right behave is:
If any issue in ZK cluster side, although it is distributed and should be more 
robust but could down due to bug or bad configuration, we can let ActiveRM 
continue to run as no-HA case. In addition, we should report Admin that the HA 
is not playing well, and let admin to decide when it is the proper timeline to 
bring down RM and reconfigure the HA things. Make sense?

 Retrospect on decision of making RM crashed if any exception throw in 
 ZKRMStateStore
 

 Key: YARN-2019
 URL: https://issues.apache.org/jira/browse/YARN-2019
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du
Priority: Critical
  Labels: ha
 Attachments: YARN-2019.1-wip.patch


 Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal 
 exception to crash RM down. As shown in YARN-1924, it could due to RM HA 
 internal bug itself, but not fatal exception. We should retrospect some 
 decision here as HA feature is designed to protect key component but not 
 disturb it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service


[ 
https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037670#comment-14037670
 ] 

Hadoop QA commented on YARN-1713:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12651461/apache-yarn-1713.6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4027//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4027//console

This message is automatically generated.

 Implement getnewapplication and submitapp as part of RM web service
 ---

 Key: YARN-1713
 URL: https://issues.apache.org/jira/browse/YARN-1713
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-1713.3.patch, apache-yarn-1713.4.patch, 
 apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, 
 apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, 
 apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, 
 apache-yarn-1713.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1039) Add parameter for YARN resource requests to indicate long lived


 [ 
https://issues.apache.org/jira/browse/YARN-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-1039:


Assignee: (was: Xuan Gong)

 Add parameter for YARN resource requests to indicate long lived
 -

 Key: YARN-1039
 URL: https://issues.apache.org/jira/browse/YARN-1039
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 3.0.0, 2.1.1-beta
Reporter: Steve Loughran
Priority: Minor

 A container request could support a new parameter long-lived. This could be 
 used by a scheduler that would know not to host the service on a transient 
 (cloud: spot priced) node.
 Schedulers could also decide whether or not to allocate multiple long-lived 
 containers on the same node



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-1039) Add parameter for YARN resource requests to indicate long lived

2014-06-19 Thread Craig Welch (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch reassigned YARN-1039:
-

Assignee: Craig Welch

 Add parameter for YARN resource requests to indicate long lived
 -

 Key: YARN-1039
 URL: https://issues.apache.org/jira/browse/YARN-1039
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 3.0.0, 2.1.1-beta
Reporter: Steve Loughran
Assignee: Craig Welch
Priority: Minor

 A container request could support a new parameter long-lived. This could be 
 used by a scheduler that would know not to host the service on a transient 
 (cloud: spot priced) node.
 Schedulers could also decide whether or not to allocate multiple long-lived 
 containers on the same node



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart


[ 
https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037791#comment-14037791
 ] 

Jian He commented on YARN-1365:
---

Thanks for updating the patch

1.how about UnregisteredApplicationMasterException - 
ApplicationMasterNotRegisteredException ?
please also add comments that this exception can happen even if application has 
registered before because RM may have restarted and the expectation to handle 
this exception is to re-register.

2.This newly added constructor is not used anywhere?  we can just use 
“app.handler.handle” to send the scheduler event in RMAppRecoverdTransition 
instead of refactoring the transition. 
{code}
public void transition(RMAppImpl app, RMAppEvent event,
   boolean shouldSchedulerNotifyAppAdded) {
  transitionImplementation(app, event, shouldSchedulerNotifyAppAdded);
}
{code}

3. the following code format in FifoScheduler can be consolidated to 2 lines. 
{code}
  public synchronized void addApplication(ApplicationId applicationId,
  String queue, String user, boolean
  shouldNotifyAppAccepted) {
{code}

4. some minor comments on testRMRestartWorkPreservingAppReregister:
this conf.set is not needed, it’s already enabled globally.
{code}
conf.setBoolean(YarnConfiguration.RM_WORK_PRESERVING_RECOVERY_ENABLED,
true);
{code}
We can use MockRM.launchAndRegisterAM instead of changing 
TestRMRestart.launchAM to be static
{code}
MockAM am0 = TestRMRestart.launchAM(app0, rm1, nm1);
{code}
If using the global variable rm1,rm2, the following two statements are not 
needed.
{code}
rm1.stop();
rm2.stop();
{code}

 ApplicationMasterService to allow Register and Unregister of an app that was 
 running before restart
 ---

 Key: YARN-1365
 URL: https://issues.apache.org/jira/browse/YARN-1365
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
 Attachments: YARN-1365.001.patch, YARN-1365.002.patch, 
 YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, 
 YARN-1365.005.patch, YARN-1365.006.patch, YARN-1365.initial.patch


 For an application that was running before restart, the 
 ApplicationMasterService currently throws an exception when the app tries to 
 make the initial register or final unregister call. These should succeed and 
 the RMApp state machine should transition to completed like normal. 
 Unregistration should succeed for an app that the RM considers complete since 
 the RM may have died after saving completion in the store but before 
 notifying the AM that the AM is free to exit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application

2014-06-19 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037828#comment-14037828
 ] 

Steve Loughran commented on YARN-941:
-

[~vanzin], the issue here is that the AMRM token is only valid for 48h or so, 
after which an AM can't talk to the RM.

This feature allows the RM to push out to the AM a new token. An attacker who 
gets the old token would only be able to impersonate the AM for the remaining 
life of that token. 

Without this feature we can't have long-lived YARN services. 

Even with this, there's still the challenge of updating hdfs tokens. YARN is 
leaving that to the application either through client-initiated updates (client 
gets token, pushes to AM somehow), or preinstalled keytabs a la HBase.

 RM Should have a way to update the tokens it has for a running application
 --

 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Xuan Gong
 Attachments: YARN-941.preview.2.patch, YARN-941.preview.3.patch, 
 YARN-941.preview.4.patch, YARN-941.preview.patch


 When an application is submitted to the RM it includes with it a set of 
 tokens that the RM will renew on behalf of the application, that will be 
 passed to the AM when the application is launched, and will be used when 
 launching the application to access HDFS to download files on behalf of the 
 application.
 For long lived applications/services these tokens can expire, and then the 
 tokens that the AM has will be invalid, and the tokens that the RM had will 
 also not work to launch a new AM.
 We need to provide an API that will allow the RM to replace the current 
 tokens for this application with a new set.  To avoid any real race issues, I 
 think this API should be something that the AM calls, so that the client can 
 connect to the AM with a new set of tokens it got using kerberos, then the AM 
 can inform the RM of the new set of tokens and quickly update its tokens 
 internally to use these new ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.

2014-06-19 Thread Carlo Curino (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037833#comment-14037833
 ] 

Carlo Curino commented on YARN-1051:


We created a branch named YARN-1051 where we are going to develop/commit this 
feature. Once it all looks good we will merge back to trunk.

 YARN Admission Control/Planner: enhancing the resource allocation model with 
 time.
 --

 Key: YARN-1051
 URL: https://issues.apache.org/jira/browse/YARN-1051
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, resourcemanager, scheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
 Attachments: YARN-1051-design.pdf, curino_MSR-TR-2013-108.pdf, 
 techreport.pdf


 In this umbrella JIRA we propose to extend the YARN RM to handle time 
 explicitly, allowing users to reserve capacity over time. This is an 
 important step towards SLAs, long-running services, workflows, and helps for 
 gang scheduling.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2130) Cleanup: Adding getRMAppManager, getQueueACLsManager, getApplicationACLsManager to RMContext

2014-06-19 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037838#comment-14037838
 ] 

Karthik Kambatla commented on YARN-2130:


Looking even better. Few more comments:
# Like how the ClientRMService's constructor is minimal. We should probably do 
the same to RMAppManager and ResourceTrackerService.
# What do you think of leaving the fields in ClientRMService and initializing 
them in the constructor? That way, the changes would be limited to 
constructors. 

 Cleanup: Adding getRMAppManager, getQueueACLsManager, 
 getApplicationACLsManager to RMContext
 

 Key: YARN-2130
 URL: https://issues.apache.org/jira/browse/YARN-2130
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2130.1.patch, YARN-2130.2.patch, YARN-2130.3.patch, 
 YARN-2130.4.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application

2014-06-19 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037850#comment-14037850
 ] 

Marcelo Vanzin commented on YARN-941:
-

[~ste...@apache.org], thanks for the comments, but I understand the part about 
renewing the token. My question was more along the lines of: what prevents the 
attacker from getting the new token and using it?

That's why I called it an attack mitigation feature. If an attacker gets a 
token, that particular token is only usable for a period of time. But it 
doesn't seem like there's anything that prevents the attack in the first place 
- so if an attacker is able to get the first token, he is able to get any 
future new tokens using exactly the same approach.

I understand that renewing tokens is needed for long-running processes. I'm 
just trying to understand whether this is the right approach from a security 
perspective, and if it's not, if it wouldn't be good to spend some time 
thinking about a more secure way of exchanging these tokens.

 RM Should have a way to update the tokens it has for a running application
 --

 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Xuan Gong
 Attachments: YARN-941.preview.2.patch, YARN-941.preview.3.patch, 
 YARN-941.preview.4.patch, YARN-941.preview.patch


 When an application is submitted to the RM it includes with it a set of 
 tokens that the RM will renew on behalf of the application, that will be 
 passed to the AM when the application is launched, and will be used when 
 launching the application to access HDFS to download files on behalf of the 
 application.
 For long lived applications/services these tokens can expire, and then the 
 tokens that the AM has will be invalid, and the tokens that the RM had will 
 also not work to launch a new AM.
 We need to provide an API that will allow the RM to replace the current 
 tokens for this application with a new set.  To avoid any real race issues, I 
 think this API should be something that the AM calls, so that the client can 
 connect to the AM with a new set of tokens it got using kerberos, then the AM 
 can inform the RM of the new set of tokens and quickly update its tokens 
 internally to use these new ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart

2014-06-19 Thread Junping Du (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037868#comment-14037868
]

Junping Du commented on YARN-1341:
--

bq. Restarts should be rare, and I'd rather not force a loss of work by taking
the NM down instantly when the state store hiccups.
Yes. But considering rolling upgrade case, it (restart) should be much often
than failed in state store (Correct me here if I am wrong as I am not levelDB
expert). In this case, we always look forward to some work loss as even if we
don't bring NM down now, we will suffer after NM restart in upgrade.

bq. If the state store is missing some things, we might not be able to recover
a localized resource, a token, a container, or possibly anything at all.
I am not worrying losing them all, but if we can only partially recover these,
would it become a problem and break some assumptions we have? I don't know. But
this seems to make things more complicated.

bq. in the worst-case, the state store is so corrupted on startup that we
don't even survive the NM restart and the NM crashes, which would have an end
result just like if we took it down when the state store failed.
I am not sure if this is the worst case. The worst case seems to me is: NM
restart with partial state recovered, this inconsistent state is not aware by
running containers which could bring some weird bugs. I am not sure how
possible it could happen here, please

bq. Therefore I'd rather not guarantee that we'll lose work by crashing the NM
on any store error and instead try to preserve the work we have. The NM could
theoretically recover (e.g.: if the error is transient then the next RM key
store could succeed). If we take the NM down immediately then we're
guaranteeing the work is lost. Is that really better?
I think it is better to guarantee the work get lost as the expectation to user
is consistent. We don't know when new Token from RM come to refresh to stale
one to make persevering work succeed in lucky. User shouldn't expect work still
get preserved after NM restart if state store get failed sometime.

bq. May be a better approach is to have errors like this trigger an unhealthy
state for the NM when we have the ability to do a graceful decommission.
I agree. This could be a better approach.

In overall, I agree that we can keep log error here without breaking NM down
(or we will have change previous code on update
localizedResources/deletionServices) for reason you specified above. However,
to get rid of loading inconsistent state and manage user's expectation. I think
we shouldn't allow the state get loaded again if get some failure before in
store. May be we add some stale tag on NMStateStore and mark this when store
failure happens and never load a staled store. [~jlowe], what do you think?

Recover NMTokens upon nodemanager restart
-

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken


[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037875#comment-14037875
 ] 

Jian He commented on YARN-2052:
---

I think the conclusion was to not add any new fields into ContainerId. Instead, 
we persist the epoch number. Each time restart happens, the initial value of 
AppSchedulingInfo#containerIdCounter will increase by (epoch*2^22) if we 
reserve 10bits for the number of RM restarts.  Later on if we change the int to 
long, we will have 2^32 for epoch number which should be fairly enough. This 
patch should include state-store change as well as the containerIdCounter 
change.

 ContainerId creation after work preserving restart is broken
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch, 
 YARN-2052.4.patch


 Container ids are made unique by using the app identifier and appending a 
 monotonically increasing sequence number to it. Since container creation is a 
 high churn activity the RM does not store the sequence number per app. So 
 after restart it does not know what the new sequence number should be for new 
 allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios

2014-06-19 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038029#comment-14038029
 ] 

Sandy Ryza commented on YARN-2026:
--

I think it might be simpler to just:
* Create FairShareActiveOnlyPolicy and DominantResourceFairnessActiveOnlyPolicy 
that extend FairSharePolicy and DominantResourceFairnessPolicy
* In those, override SchedulingPolicy.computeShares to set the fair shares for 
inactive apps to 0 and just call ComputeFairShares.computeShares on the active 
schedulables.

 Fair scheduler : Fair share for inactive queues causes unfair allocation in 
 some scenarios
 --

 Key: YARN-2026
 URL: https://issues.apache.org/jira/browse/YARN-2026
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
  Labels: scheduler
 Attachments: YARN-2026-v1.txt


 Problem1- While using hierarchical queues in fair scheduler,there are few 
 scenarios where we have seen a leaf queue with least fair share can take 
 majority of the cluster and starve a sibling parent queue which has greater 
 weight/fair share and preemption doesn’t kick in to reclaim resources.
 The root cause seems to be that fair share of a parent queue is distributed 
 to all its children irrespective of whether its an active or an inactive(no 
 apps running) queue. Preemption based on fair share kicks in only if the 
 usage of a queue is less than 50% of its fair share and if it has demands 
 greater than that. When there are many queues under a parent queue(with high 
 fair share),the child queue’s fair share becomes really low. As a result when 
 only few of these child queues have apps running,they reach their *tiny* fair 
 share quickly and preemption doesn’t happen even if other leaf 
 queues(non-sibling) are hogging the cluster.
 This can be solved by dividing fair share of parent queue only to active 
 child queues.
 Here is an example describing the problem and proposed solution:
 root.lowPriorityQueue is a leaf queue with weight 2
 root.HighPriorityQueue is parent queue with weight 8
 root.HighPriorityQueue has 10 child leaf queues : 
 root.HighPriorityQueue.childQ(1..10)
 Above config,results in root.HighPriorityQueue having 80% fair share
 and each of its ten child queue would have 8% fair share. Preemption would 
 happen only if the child queue is 4% (0.5*8=4). 
 Lets say at the moment no apps are running in any of the 
 root.HighPriorityQueue.childQ(1..10) and few apps are running in 
 root.lowPriorityQueue which is taking up 95% of the cluster.
 Up till this point,the behavior of FS is correct.
 Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% 
 of the cluster. It would get only the available 5% in the cluster and 
 preemption wouldn't kick in since its above 4%(half fair share).This is bad 
 considering childQ1 is under a highPriority parent queue which has *80% fair 
 share*.
 Until root.lowPriorityQueue starts relinquishing containers,we would see the 
 following allocation on the scheduler page:
 *root.lowPriorityQueue = 95%*
 *root.HighPriorityQueue.childQ1=5%*
 This can be solved by distributing a parent’s fair share only to active 
 queues.
 So in the example above,since childQ1 is the only active queue
 under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 
 80%.
 This would cause preemption to reclaim the 30% needed by childQ1 from 
 root.lowPriorityQueue after fairSharePreemptionTimeout seconds.
 Problem2 - Also note that similar situation can happen between 
 root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 
 hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck 
 at 5%,until childQ2 starts relinquishing containers. We would like each of 
 childQ1 and childQ2 to get half of root.HighPriorityQueue  fair share ie 
 40%,which would ensure childQ1 gets upto 40% resource if needed through 
 preemption.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1039) Add parameter for YARN resource requests to indicate long lived

2014-06-19 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038041#comment-14038041
 ] 

Steve Loughran commented on YARN-1039:
--

marking as depended on  by YARN-896.

I would keep the affinity logic separate, as discussed in YARN-1042

 Add parameter for YARN resource requests to indicate long lived
 -

 Key: YARN-1039
 URL: https://issues.apache.org/jira/browse/YARN-1039
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 3.0.0, 2.1.1-beta
Reporter: Steve Loughran
Assignee: Craig Welch
Priority: Minor

 A container request could support a new parameter long-lived. This could be 
 used by a scheduler that would know not to host the service on a transient 
 (cloud: spot priced) node.
 Schedulers could also decide whether or not to allocate multiple long-lived 
 containers on the same node



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2176) CapacityScheduler loops over all running applications rather than actively requesting apps

2014-06-19 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038045#comment-14038045
 ] 

Sandy Ryza commented on YARN-2176:
--

Can we merge the ActiveUsersManager stuff into an abstract SchedulerLeafQueue 
class that FSLeafQueue and LeafQueue extend from?  AppSchedulingInfo is private 
/ unstable, so we can modify it's constructor to take to take a 
SchedulerLeafQueue instead of a Queue.


 CapacityScheduler loops over all running applications rather than actively 
 requesting apps
 --

 Key: YARN-2176
 URL: https://issues.apache.org/jira/browse/YARN-2176
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.4.0
Reporter: Jason Lowe

 The capacity scheduler performance is primarily dominated by 
 LeafQueue.assignContainers, and that currently loops over all applications 
 that are running in the queue.  It would be more efficient if we looped over 
 just the applications that are actively asking for resources rather than all 
 applications, as there could be thousands of applications running but only a 
 few hundred that are currently asking for resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios

2014-06-19 Thread Ashwin Shankar (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038047#comment-14038047
]

Ashwin Shankar commented on YARN-2026:
--

[~sandyr], makes sense. I'll post a patch soon.

Fair scheduler : Fair share for inactive queues causes unfair allocation in
some scenarios
--

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2144) Add logs when preemption occurs

2014-06-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2144:
-

Attachment: YARN-2144.patch

Sorry I forgot attaching patch yesterday, attached now.

 Add logs when preemption occurs
 ---

 Key: YARN-2144
 URL: https://issues.apache.org/jira/browse/YARN-2144
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.5.0
Reporter: Tassapol Athiapinya
Assignee: Wangda Tan
 Attachments: AM-page-preemption-info.png, YARN-2144.patch, 
 YARN-2144.patch, YARN-2144.patch, YARN-2144.patch, YARN-2144.patch


 There should be easy-to-read logs when preemption does occur. 
 RM logs should have following properties:
 * Logs are retrievable when an application is still running and often flushed.
 * Can distinguish between AM container preemption and task container 
 preemption with container ID shown.
 * Should be INFO level log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures


[ 
https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038169#comment-14038169
 ] 

Vinod Kumar Vavilapalli commented on YARN-2074:
---

Also appAttempt.isLastAttempt doesn't sound like the right name..

 Preemption of AM containers shouldn't count towards AM failures
 ---

 Key: YARN-2074
 URL: https://issues.apache.org/jira/browse/YARN-2074
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, 
 YARN-2074.4.patch, YARN-2074.5.patch, YARN-2074.6.patch, YARN-2074.6.patch, 
 YARN-2074.7.patch, YARN-2074.7.patch


 One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM 
 containers getting preempted shouldn't count towards AM failures and thus 
 shouldn't eventually fail applications.
 We should explicitly handle AM container preemption/kill as a separate issue 
 and not count it towards the limit on AM failures.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2179) Initial cache manager structure and context

2014-06-19 Thread Chris Trezzo (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated YARN-2179:
---

Attachment: YARN-2179-trunk-v2.patch

Attached is v2 patch to address javac warning for use of a deprecated method. 
The two find bug warnings are non-issues (a null pointer warning that will 
never occur and a local variable that is not used but will be by follow-on 
patches).

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2183) Cleaner service for cache manager

2014-06-19 Thread Chris Trezzo (JIRA)

Chris Trezzo created YARN-2183:
--

 Summary: Cleaner service for cache manager
 Key: YARN-2183
 URL: https://issues.apache.org/jira/browse/YARN-2183
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo


Implement the cleaner service for the cache manager along with metrics for the 
service. This service is responsible for cleaning up old resource references in 
the manager and removing stale entries from the cache.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2144) Add logs when preemption occurs


[ 
https://issues.apache.org/jira/browse/YARN-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038185#comment-14038185
 ] 

Hadoop QA commented on YARN-2144:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12651559/YARN-2144.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4028//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4028//console

This message is automatically generated.

 Add logs when preemption occurs
 ---

 Key: YARN-2144
 URL: https://issues.apache.org/jira/browse/YARN-2144
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.5.0
Reporter: Tassapol Athiapinya
Assignee: Wangda Tan
 Attachments: AM-page-preemption-info.png, YARN-2144.patch, 
 YARN-2144.patch, YARN-2144.patch, YARN-2144.patch, YARN-2144.patch


 There should be easy-to-read logs when preemption does occur. 
 RM logs should have following properties:
 * Logs are retrievable when an application is still running and often flushed.
 * Can distinguish between AM container preemption and task container 
 preemption with container ID shown.
 * Should be INFO level log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2183) Cleaner service for cache manager

2014-06-19 Thread Chris Trezzo (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated YARN-2183:
---

Attachment: YARN-2183-trunk-v1.patch

Attached is a v1 patch based on trunk+YARN-2179+YARN-2180.

 Cleaner service for cache manager
 -

 Key: YARN-2183
 URL: https://issues.apache.org/jira/browse/YARN-2183
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2183-trunk-v1.patch


 Implement the cleaner service for the cache manager along with metrics for 
 the service. This service is responsible for cleaning up old resource 
 references in the manager and removing stale entries from the cache.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2179) Initial cache manager structure and context


[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038208#comment-14038208
 ] 

Hadoop QA commented on YARN-2179:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12651571/YARN-2179-trunk-v2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4029//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/4029//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-sharedcachemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4029//console

This message is automatically generated.

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1039) Add parameter for YARN resource requests to indicate long lived


[ 
https://issues.apache.org/jira/browse/YARN-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038248#comment-14038248
 ] 

Vinod Kumar Vavilapalli commented on YARN-1039:
---

For now, we can start with a parameter on the ApplicationSubmissionContext - we 
are still figuring out long-running services before delving into enabling a 
smaller subset of long-lived containers within a larger application..

 Add parameter for YARN resource requests to indicate long lived
 -

 Key: YARN-1039
 URL: https://issues.apache.org/jira/browse/YARN-1039
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 3.0.0, 2.1.1-beta
Reporter: Steve Loughran
Assignee: Craig Welch
Priority: Minor

 A container request could support a new parameter long-lived. This could be 
 used by a scheduler that would know not to host the service on a transient 
 (cloud: spot priced) node.
 Schedulers could also decide whether or not to allocate multiple long-lived 
 containers on the same node



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures

2014-06-19 Thread Subramaniam Venkatraman Krishnan (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jian He updated YARN-2074:
--

Attachment: YARN-2074.8.patch

Thanks Vinod for the review! uploaded a new patch.

bq. Not related to the patch, but I think I found a bug - the following doesn't
take into whether the finished container is an AM or not. Let's file a ticket..
Checked more. This may be fine because in the case of work-preserving AM
restart, the container-finished event will be sent to the previous failed
attempt which is capturing all the finished containers.
bq. Why are we making this change? Comment in code as well as here as to the
why. May be add a test too?
Add comment in the code. Test is already added to cover this.
bq. Need to think about how this will work when clusters get upgraded.
added a test case to check the default container exit status in protobuf is
indeed -1000.
Fixed other comments also.

Preemption of AM containers shouldn't count towards AM failures
---

Key: YARN-2074
URL: https://issues.apache.org/jira/browse/YARN-2074
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch,
YARN-2074.4.patch, YARN-2074.5.patch, YARN-2074.6.patch, YARN-2074.6.patch,
YARN-2074.7.patch, YARN-2074.7.patch, YARN-2074.8.patch

One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM
containers getting preempted shouldn't count towards AM failures and thus
shouldn't eventually fail applications.
We should explicitly handle AM container preemption/kill as a separate issue
and not count it towards the limit on AM failures.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1709) Admission Control: Reservation subsystem


 [ 
https://issues.apache.org/jira/browse/YARN-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramaniam Venkatraman Krishnan updated YARN-1709:
---

Attachment: YARN-1709.patch

 Admission Control: Reservation subsystem
 

 Key: YARN-1709
 URL: https://issues.apache.org/jira/browse/YARN-1709
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Carlo Curino
Assignee: Subramaniam Venkatraman Krishnan
 Attachments: YARN-1709.patch, YARN-1709.patch


 This JIRA is about the key data structure used to track resources over time 
 to enable YARN-1051. The Reservation subsystem is conceptually a plan of 
 how the scheduler will allocate resources over-time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures


[ 
https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038264#comment-14038264
 ] 

Jian He commented on YARN-2074:
---

Seem to find a bug in 
ResourceManager#RMContainerPreemptEventDispatcher#handle(), the 
ApplicationAttemptId is the id of the attempt who creates the container. In 
work-preserving AM restart, the ApplicationAttemptId here should be the current 
active attemptId so that the preemption is charged against the current attempt 
instead of previous one. I can file a jira for this. 

 Preemption of AM containers shouldn't count towards AM failures
 ---

 Key: YARN-2074
 URL: https://issues.apache.org/jira/browse/YARN-2074
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, 
 YARN-2074.4.patch, YARN-2074.5.patch, YARN-2074.6.patch, YARN-2074.6.patch, 
 YARN-2074.7.patch, YARN-2074.7.patch, YARN-2074.8.patch


 One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM 
 containers getting preempted shouldn't count towards AM failures and thus 
 shouldn't eventually fail applications.
 We should explicitly handle AM container preemption/kill as a separate issue 
 and not count it towards the limit on AM failures.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1709) Admission Control: Reservation subsystem


[ 
https://issues.apache.org/jira/browse/YARN-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038271#comment-14038271
 ] 

Hadoop QA commented on YARN-1709:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12651594/YARN-1709.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4030//console

This message is automatically generated.

 Admission Control: Reservation subsystem
 

 Key: YARN-1709
 URL: https://issues.apache.org/jira/browse/YARN-1709
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Carlo Curino
Assignee: Subramaniam Venkatraman Krishnan
 Attachments: YARN-1709.patch, YARN-1709.patch


 This JIRA is about the key data structure used to track resources over time 
 to enable YARN-1051. The Reservation subsystem is conceptually a plan of 
 how the scheduler will allocate resources over-time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-614) Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1


[ 
https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038297#comment-14038297
 ] 

Xuan Gong commented on YARN-614:


[~criccomini]
Hey, Chris. Do you have any updates for this ticket ? Do you mind if I can take 
over this ?

 Retry attempts automatically for hardware failures or YARN issues and set 
 default app retries to 1
 --

 Key: YARN-614
 URL: https://issues.apache.org/jira/browse/YARN-614
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Bikas Saha
Assignee: Chris Riccomini
 Fix For: 2.5.0

 Attachments: YARN-614-0.patch, YARN-614-1.patch, YARN-614-2.patch, 
 YARN-614-3.patch, YARN-614-4.patch, YARN-614-5.patch, YARN-614-6.patch


 Attempts can fail due to a large number of user errors and they should not be 
 retried unnecessarily. The only reason YARN should retry an attempt is when 
 the hardware fails or YARN has an error. NM failing, lost NM and NM disk 
 errors are the hardware errors that come to mind.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures