date:20150716

[
https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630042#comment-14630042
]

Wangda Tan commented on YARN-2003:
--

Thanks [~sunilg] to update, few more comments regarding the latest patch:
- I suggest defer the consideration of queue checking. Currently we're changing
how to do queue mapping. Ideally, it should be done before submit to scheduler
(maybe before assigning application priority), see YARN-3635.
- Assumption of queue will be existed before submit to scheduler may be not
always valid. With queue mapping, scheduler can create queue when accepting
application. I suggest remove the check of queue's existence. Instead, you can
have a private method to get priority by queue name. If queue is not existed,
you can assign default priority to application.
- Comparison of priority should use Priority.compareTo instead of /.

Support for Application priority : Changes in RM and Capacity Scheduler
---

Key: YARN-2003
URL: https://issues.apache.org/jira/browse/YARN-2003
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Sunil G
Assignee: Sunil G
Attachments: 0001-YARN-2003.patch, 00010-YARN-2003.patch,
0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch,
0005-YARN-2003.patch, 0006-YARN-2003.patch, 0007-YARN-2003.patch,
0008-YARN-2003.patch, 0009-YARN-2003.patch, 0011-YARN-2003.patch,
0012-YARN-2003.patch, 0013-YARN-2003.patch, 0014-YARN-2003.patch,
0015-YARN-2003.patch, 0016-YARN-2003.patch, 0017-YARN-2003.patch,
0018-YARN-2003.patch, 0019-YARN-2003.patch, 0020-YARN-2003.patch,
0021-YARN-2003.patch, 0022-YARN-2003.patch

AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from
Submission Context and store.
Later this can be used by Scheduler.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-16 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630090#comment-14630090
 ] 

Bibin A Chundatt commented on YARN-3932:


Hi [~leftnoteasy] i think we should iterate over {{liveContainers}} get sum of 
resource used. Any thoughts??

 SchedulerApplicationAttempt#getResourceUsageReport should be based on 
 NodeLabel
 ---

 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: ApplicationReport.jpg


 Application Resource Report shown wrong when node Label is used.
 1.Submit application with NodeLabel
 2.Check RM UI for resources used 
 Allocated CPU VCores and Allocated Memory MB is always {{zero}}
 {code}
  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
 AggregateAppResourceUsage runningResourceUsage =
 getRunningAggregateAppResourceUsage();
 Resource usedResourceClone =
 Resources.clone(attemptResourceUsage.getUsed());
 Resource reservedResourceClone =
 Resources.clone(attemptResourceUsage.getReserved());
 return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
 reservedContainers.size(), usedResourceClone, reservedResourceClone,
 Resources.add(usedResourceClone, reservedResourceClone),
 runningResourceUsage.getMemorySeconds(),
 runningResourceUsage.getVcoreSeconds());
   }
 {code}
 should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart


[ 
https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630472#comment-14630472
 ] 

Hadoop QA commented on YARN-3905:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  17m 14s | Pre-patch trunk has 6 extant 
Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   8m 29s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 23s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 21s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 37s | The applied patch generated  1 
new checkstyle issues (total was 39, now 40). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 23s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 35s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m  9s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 25s | Tests passed in 
hadoop-yarn-server-common. |
| | |  40m 39s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12745708/YARN-3905.001.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 0bda84f |
| Pre-patch Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8562/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8562/artifact/patchprocess/diffcheckstylehadoop-yarn-server-common.txt
 |
| hadoop-yarn-server-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8562/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8562/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8562/console |


This message was automatically generated.

 Application History Server UI NPEs when accessing apps run after RM restart
 ---

 Key: YARN-3905
 URL: https://issues.apache.org/jira/browse/YARN-3905
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.7.0, 2.8.0, 2.7.1
Reporter: Eric Payne
Assignee: Eric Payne
 Attachments: YARN-3905.001.patch


 From the Application History URL (http://RmHostName:8188/applicationhistory), 
 clicking on the application ID of an app that was run after the RM daemon has 
 been restarted results in a 500 error:
 {noformat}
 Sorry, got error 500
 Please consult RFC 2616 for meanings of the error code.
 {noformat}
 The stack trace is as follows:
 {code}
 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO 
 applicationhistoryservice.FileSystemApplicationHistoryStore: Completed 
 reading history information of all application attempts of application 
 application_1436472584878_0001
 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: 
 Failed to read the AM container of the application attempt 
 appattempt_1436472584878_0001_01.
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206)
 at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199)
 at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205)
 at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272)
 at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at

[jira] [Commented] (YARN-3906) split the application table from the entity table

2015-07-16 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630593#comment-14630593
 ] 

Sangjin Lee commented on YARN-3906:
---

The bulk of the work is done, but I'd like to wait until YARN-3908 is committed 
and update the changes.

 split the application table from the entity table
 -

 Key: YARN-3906
 URL: https://issues.apache.org/jira/browse/YARN-3906
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Sangjin Lee

 Per discussions on YARN-3815, we need to split the application entities from 
 the main entity table into its own table (application).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED


[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629300#comment-14629300
 ] 

Hadoop QA commented on YARN-3535:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 14s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 3 new or modified test files. |
| {color:green}+1{color} | javac |   7m 44s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 41s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 46s | The applied patch generated  5 
new checkstyle issues (total was 338, now 343). |
| {color:green}+1{color} | whitespace |   0m  2s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 22s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 25s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  51m 30s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  89m 45s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12745572/0005-YARN-3535.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 3ec0a04 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8554/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8554/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8554/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8554/console |


This message was automatically generated.

  ResourceRequest should be restored back to scheduler when RMContainer is 
 killed at ALLOCATED
 -

 Key: YARN-3535
 URL: https://issues.apache.org/jira/browse/YARN-3535
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Peng Zhang
Priority: Critical
 Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
 yarn-app.log


 During rolling update of NM, AM start of container on NM failed. 
 And then job hang there.
 Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3852) Add docker container support to container-executor

2015-07-16 Thread Varun Vasudev (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629276#comment-14629276
 ] 

Varun Vasudev commented on YARN-3852:
-

Thanks for the patch [~ashahab]. The patch isn't working for me. There are two 
issues -
# No default value for docker.binary. I think we should assume this to be 
docker and allow it to be overriden.
# The docker launch fails due to 
{code}
if (change_effective_user(user_uid, user_gid) != 0)
{code}
in launch_docker_container_as_user. For docker run to work, the effective user 
needs to be root(something like change_effective_user(0, user_gid) is probably 
the right way).

Some other issues -
# 
{code}
-static const char* DEFAULT_BANNED_USERS[] = {yarn, mapred, hdfs, bin, 
0};
+static const char* DEFAULT_BANNED_USERS[] = {mapred, hdfs, bin, 0};
{code}
Why are you removing the yarn user from the banned users? I'm guessing this is 
due to a branch-2/trunk issue. The yarn user is banned in trunk but not in 
branch-2
# A couple of formatting fixes
{code}
+   fprintf(LOGFILE, done opening pid\n);
+fflush(LOGFILE);
{code}
and
{code}
+fprintf(LOGFILE, done writing pid to tmp\n);
+ fflush(LOGFILE);
{code}
# Can we change the error message in the message below to a more descriptive 
one?
{code}
+ fprintf(ERRORFILE, Error reading\n);
+ fflush(ERRORFILE);
{code}
# In parse_docker_command_file
{code}
+  int read;
{code}
should we use ssize_t instead or int?
# In parse_docker_command_file, we have some exit(1) calls - can we change this 
to use the error codes in container-executor.h?
# In run_docker
{code}
+  free(docker_binary);
+  free(args);
+  free(docker_command_with_binary);
+  free(docker_command);
+  exit_code = DOCKER_RUN_FAILED;
+  }
+  exit_code = 0;
+  return exit_code;
{code}
The exit code from the function will always be 0
# Formatting
{code}
+int create_script_paths(const char *work_dir,
+  const char *script_name, const char *cred_file,
+ char** script_file_dest, char** cred_file_dest,
+ int* container_file_source, int* cred_file_source ) {
{code}
# In create_script_paths, we use a bunch of goto's but the goto target doesn't 
have any special logic or handling. Can we avoid using the goto?
# 
{code}
+//kill me now.
{code}
No need for the commentary :)
# In main.c
{code}
+char * resources = argv[optind++];// key,value pair describing resources
+char * resources_key = malloc(strlen(resources));
+char * resources_value = malloc(strlen(resources));
{code}

Can we move the declarations of resources, resources_key and resources_value 
out of the case block(since the same variables are used in two case blocks)?



 Add docker container support to container-executor 
 ---

 Key: YARN-3852
 URL: https://issues.apache.org/jira/browse/YARN-3852
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: yarn
Reporter: Sidharta Seethana
Assignee: Abin Shahab
 Attachments: YARN-3852.patch


 For security reasons, we need to ensure that access to the docker daemon and 
 the ability to run docker containers is restricted to privileged users ( i.e 
 users running applications should not have direct access to docker). In order 
 to ensure the node manager can run docker commands, we need to add docker 
 support to the container-executor binary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one


 [ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi Ozawa updated YARN-3174:
-
Summary: Consolidate the NodeManager and NodeManagerRestart documentation 
into one  (was: Consolidate the NodeManager documentation into one)

 Consolidate the NodeManager and NodeManagerRestart documentation into one
 -

 Key: YARN-3174
 URL: https://issues.apache.org/jira/browse/YARN-3174
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Allen Wittenauer
Assignee: Masatake Iwasaki
 Attachments: YARN-3174.001.patch


 We really don't need a different document for every individual nodemanager 
 feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs


[ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629268#comment-14629268
 ] 

Sunil G commented on YARN-2005:
---

Thanks [~adhoot]. Sorry for delayed response.

bq.The nodes are removed from blacklist once the launch of the AM happens to 
limit this issue.
Yes. I feel this will be fine. 

 Blacklisting support for scheduling AMs
 ---

 Key: YARN-2005
 URL: https://issues.apache.org/jira/browse/YARN-2005
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Anubhav Dhoot
 Attachments: YARN-2005.001.patch, YARN-2005.002.patch, 
 YARN-2005.003.patch, YARN-2005.004.patch


 It would be nice if the RM supported blacklisting a node for an AM launch 
 after the same node fails a configurable number of AM attempts.  This would 
 be similar to the blacklisting support for scheduling task attempts in the 
 MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one


 [ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi Ozawa updated YARN-3174:
-
Affects Version/s: 2.7.1

 Consolidate the NodeManager and NodeManagerRestart documentation into one
 -

 Key: YARN-3174
 URL: https://issues.apache.org/jira/browse/YARN-3174
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.7.1
Reporter: Allen Wittenauer
Assignee: Masatake Iwasaki
 Fix For: 2.8.0

 Attachments: YARN-3174.001.patch


 We really don't need a different document for every individual nodemanager 
 feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one


 [ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi Ozawa updated YARN-3174:
-
Component/s: documentation

 Consolidate the NodeManager and NodeManagerRestart documentation into one
 -

 Key: YARN-3174
 URL: https://issues.apache.org/jira/browse/YARN-3174
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.7.1
Reporter: Allen Wittenauer
Assignee: Masatake Iwasaki
 Fix For: 2.8.0

 Attachments: YARN-3174.001.patch


 We really don't need a different document for every individual nodemanager 
 feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one


[ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629308#comment-14629308
 ] 

Hudson commented on YARN-3174:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8171 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8171/])
YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation 
into one. Contributed by Masatake Iwasaki. (ozawa: rev 
f02dd146f58bcfa0595eec7f2433bafdd857630f)
* hadoop-project/src/site/site.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md


 Consolidate the NodeManager and NodeManagerRestart documentation into one
 -

 Key: YARN-3174
 URL: https://issues.apache.org/jira/browse/YARN-3174
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.7.1
Reporter: Allen Wittenauer
Assignee: Masatake Iwasaki
 Fix For: 2.8.0

 Attachments: YARN-3174.001.patch


 We really don't need a different document for every individual nodemanager 
 feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED

2015-07-16 Thread Arun Suresh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629253#comment-14629253
 ] 

Arun Suresh commented on YARN-3535:
---

The patch looks good !!
Thanks for working on this [~peng.zhang] and [~rohithsharma]

+1, Pending successful jenkins run with latest patch

  ResourceRequest should be restored back to scheduler when RMContainer is 
 killed at ALLOCATED
 -

 Key: YARN-3535
 URL: https://issues.apache.org/jira/browse/YARN-3535
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Peng Zhang
Priority: Critical
 Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
 yarn-app.log


 During rolling update of NM, AM start of container on NM failed. 
 And then job hang there.
 Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3926) Extend the YARN resource model for easier resource-type management and profiles

2015-07-16 Thread Karthik Kambatla (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629288#comment-14629288
]

Karthik Kambatla commented on YARN-3926:

Thanks a bunch for putting this proposal together, Varun. We are in dire need
of improvements to our resource-model, and the proposal goes a long way in
addressing some of these issues. Huge +1 to this effort.

Comments on the proposal itself:
# There is a significant overlap between resource-types.xml and
node-resources.xml. It would be nice to consolidate at least these parts.
# Can we avoid the mismatch between the resource types on RM and NM altogether?
# Can we avoid different restart paths for adding and removing resources?
# Really like the concise configs proposed at the end of the document.

What do you think of the following modifications to the proposal to address
above wishes? I have clearly not thought as much before making these
suggestions, so please feel free to shoot them down.
# How about calling them yarn.resource-types, yarn.resource-types.memory.*,
yarn.resource-types.cpu.*. Further, memory/cpu specific configs could be made
simpler per the suggestions later in the document?
# yarn.scheduler.resource-types is a subset of yarn.resource-types, and
captures the resource-types the scheduler supports. This could be in yarn-site
on RM.
# yarn.nodemanager.resource-types.monitored and
yarn.nodemanager.resource-types.enforced also are subsets of
yarn.resource-types and could define the resources the NM monitors and enforces
respectively. These could be in yarn-site on the NM. I understand isolation is
out of scope here, but would be nice to have configs that lend themselves to
future work.
# yarn.nodemanager.[resources|resource-types].available could be a map where
each key should be an entry in yarn.resource-types.

You mention capturing node-labels etc. similarly. Could you elaborate on your
thoughts, at least informally? Would be super nice to have a path in mind even
if we were to do as follow-up work.

Extend the YARN resource model for easier resource-type management and
profiles
---

Key: YARN-3926
URL: https://issues.apache.org/jira/browse/YARN-3926
Project: Hadoop YARN
Issue Type: New Feature
Components: nodemanager, resourcemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev
Attachments: Proposal for modifying resource model and profiles.pdf

Currently, there are efforts to add support for various resource-types such
as disk(YARN-2139), network(YARN-2140), and HDFS bandwidth(YARN-2681). These
efforts all aim to add support for a new resource type and are fairly
involved efforts. In addition, once support is added, it becomes harder for
users to specify the resources they need. All existing jobs have to be
modified, or have to use the minimum allocation.
This ticket is a proposal to extend the YARN resource model to a more
flexible model which makes it easier to support additional resource-types. It
also considers the related aspect of “resource profiles” which allow users to
easily specify the various resources they need for any given container.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90


[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629302#comment-14629302
 ] 

Tsuyoshi Ozawa commented on YARN-3805:
--

[~iwasakims] could you rebase it?

 Update the documentation of Disk Checker based on YARN-90
 -

 Key: YARN-3805
 URL: https://issues.apache.org/jira/browse/YARN-3805
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Attachments: YARN-3805.001.patch


 NodeManager is able to recover status of the disk once broken and fixed 
 without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED

2015-07-16 Thread Arun Suresh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629249#comment-14629249
 ] 

Arun Suresh commented on YARN-3535:
---

I meant for the FairScheduler... but looks like your new patch has it... thanks

  ResourceRequest should be restored back to scheduler when RMContainer is 
 killed at ALLOCATED
 -

 Key: YARN-3535
 URL: https://issues.apache.org/jira/browse/YARN-3535
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Peng Zhang
Priority: Critical
 Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
 yarn-app.log


 During rolling update of NM, AM start of container on NM failed. 
 And then job hang there.
 Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2015-07-16 Thread Akira AJISAKA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629250#comment-14629250
 ] 

Akira AJISAKA commented on YARN-2578:
-

Thanks [~iwasakims] for creating the patch. One comment and one question from 
me.
bq. The default value is 0 in order to keep current behaviour.
1. We would like to fix this bug, so default to 1min is good for me.
2. Would you tell me why {{Client.getRpcTimeout}} returns 0 if 
{{ipc.client.ping}} is false?

 NM does not failover timely if RM node network connection fails
 ---

 Key: YARN-2578
 URL: https://issues.apache.org/jira/browse/YARN-2578
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.1
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg
 Attachments: YARN-2578.002.patch, YARN-2578.patch


 The NM does not fail over correctly when the network cable of the RM is 
 unplugged or the failure is simulated by a service network stop or a 
 firewall that drops all traffic on the node. The RM fails over to the standby 
 node when the failure is detected as expected. The NM should than re-register 
 with the new active RM. This re-register takes a long time (15 minutes or 
 more). Until then the cluster has no nodes for processing and applications 
 are stuck.
 Reproduction test case which can be used in any environment:
 - create a cluster with 3 nodes
 node 1: ZK, NN, JN, ZKFC, DN, RM, NM
 node 2: ZK, NN, JN, ZKFC, DN, RM, NM
 node 3: ZK, JN, DN, NM
 - start all services make sure they are in good health
 - kill the network connection of the RM that is active using one of the 
 network kills from above
 - observe the NN and RM failover
 - the DN's fail over to the new active NN
 - the NM does not recover for a long time
 - the logs show a long delay and traces show no change at all
 The stack traces of the NM all show the same set of threads. The main thread 
 which should be used in the re-register is the Node Status Updater This 
 thread is stuck in:
 {code}
 Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
 Object.wait() [0x7f5a51fc1000]
java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
   - locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call)
   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
   at 
 org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
 {code}
 The client connection which goes through the proxy can be traced back to the 
 ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
 should be using a version which takes the RPC timeout (from the 
 configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3174) Consolidate the NodeManager documentation into one


[ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629263#comment-14629263
 ] 

Tsuyoshi Ozawa commented on YARN-3174:
--

+1

 Consolidate the NodeManager documentation into one
 --

 Key: YARN-3174
 URL: https://issues.apache.org/jira/browse/YARN-3174
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Allen Wittenauer
Assignee: Masatake Iwasaki
 Attachments: YARN-3174.001.patch


 We really don't need a different document for every individual nodemanager 
 feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one

2015-07-16 Thread Masatake Iwasaki (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629283#comment-14629283
 ] 

Masatake Iwasaki commented on YARN-3174:


Thanks, [~ozawa]!

 Consolidate the NodeManager and NodeManagerRestart documentation into one
 -

 Key: YARN-3174
 URL: https://issues.apache.org/jira/browse/YARN-3174
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.7.1
Reporter: Allen Wittenauer
Assignee: Masatake Iwasaki
 Fix For: 2.8.0

 Attachments: YARN-3174.001.patch


 We really don't need a different document for every individual nodemanager 
 feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED

2015-07-16 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629296#comment-14629296
 ] 

zhihai xu commented on YARN-3535:
-

Sorry for coming late into this issue.
The latest Patch looks good to me except one nit:
Can we make {{ContainerRescheduledTransition}} child class of 
{{FinishedTransition}} similar as {{KillTransition}}?
So we can call {{super.transition(container, event);}} instead of {{new 
FinishedTransition().transition(container, event);}}.
I think this will make the code more readable and match other transition class 
implementation.

  ResourceRequest should be restored back to scheduler when RMContainer is 
 killed at ALLOCATED
 -

 Key: YARN-3535
 URL: https://issues.apache.org/jira/browse/YARN-3535
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Peng Zhang
Priority: Critical
 Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
 yarn-app.log


 During rolling update of NM, AM start of container on NM failed. 
 And then job hang there.
 Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup

2015-07-16 Thread wangfeng (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629410#comment-14629410
 ] 

wangfeng commented on YARN-2809:


failed when patching this to hadoop2.6.0,console output:
 patch -u -p0  YARN-2809-v3.patch

patching file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
Hunk #1 succeeded at 984 (offset -16 lines).
patching file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java
Hunk #1 FAILED at 22.
Hunk #2 succeeded at 33 (offset -4 lines).
Hunk #3 succeeded at 71 (offset -5 lines).
Hunk #4 succeeded at 105 (offset -5 lines).
Hunk #5 succeeded at 266 (offset -10 lines).
Hunk #6 succeeded at 338 (offset -10 lines).
1 out of 6 hunks FAILED -- saving rejects to file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java.rej
patching file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java

 Implement workaround for linux kernel panic when removing cgroup
 

 Key: YARN-2809
 URL: https://issues.apache.org/jira/browse/YARN-2809
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
 Environment:  RHEL 6.4
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Fix For: 2.7.0

 Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch


 Some older versions of linux have a bug that can cause a kernel panic when 
 the LCE attempts to remove a cgroup. It is a race condition so it's a bit 
 rare but on a few thousand node cluster it can result in a couple of panics 
 per day.
 This is the commit that likely (haven't verified) fixes the problem in linux: 
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.yid=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267
 Details will be added in comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3929) Uncleaning option for local app log files with log-aggregation feature

2015-07-16 Thread Dongwook Kwon (JIRA)

Dongwook Kwon created YARN-3929:
---

 Summary: Uncleaning option for local app log files with 
log-aggregation feature
 Key: YARN-3929
 URL: https://issues.apache.org/jira/browse/YARN-3929
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: log-aggregation
Affects Versions: 2.6.0, 2.4.0
Reporter: Dongwook Kwon
Priority: Minor


Although it makes sense to delete local app log files once AppLogAggregator 
copied all files into remote location(HDFS), I have some use cases that need to 
leave local app log files after it's copied to HDFS. Mostly it's for own backup 
purpose. I would like to use log-aggregation feature of YARN and want to back 
up app log files too. Without this option, files has to copy from HDFS to local 
again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90


[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629429#comment-14629429
 ] 

Tsuyoshi Ozawa commented on YARN-3805:
--

Checking this in.

 Update the documentation of Disk Checker based on YARN-90
 -

 Key: YARN-3805
 URL: https://issues.apache.org/jira/browse/YARN-3805
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Attachments: YARN-3805.001.patch, YARN-3805.002.patch


 NodeManager is able to recover status of the disk once broken and fixed 
 without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3929) Uncleaning option for local app log files with log-aggregation feature

2015-07-16 Thread Dongwook Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongwook Kwon updated YARN-3929:

Attachment: YARN-3929.01.patch

Could you review this patch, Thanks.

 Uncleaning option for local app log files with log-aggregation feature
 --

 Key: YARN-3929
 URL: https://issues.apache.org/jira/browse/YARN-3929
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: log-aggregation
Affects Versions: 2.4.0, 2.6.0
Reporter: Dongwook Kwon
Priority: Minor
 Attachments: YARN-3929.01.patch


 Although it makes sense to delete local app log files once AppLogAggregator 
 copied all files into remote location(HDFS), I have some use cases that need 
 to leave local app log files after it's copied to HDFS. Mostly it's for own 
 backup purpose. I would like to use log-aggregation feature of YARN and want 
 to back up app log files too. Without this option, files has to copy from 
 HDFS to local again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2015-07-16 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629491#comment-14629491
 ] 

Naganarasimha G R commented on YARN-3931:
-

Hi [~kyungwan nam], Thanks for raising the issue ... i have assigned this jira 
to my name but if you are interested to further look into this jira and solve 
it . please reassign.

 default-node-label-expression doesn’t apply when an application is submitted 
 by RM rest api
 ---

 Key: YARN-3931
 URL: https://issues.apache.org/jira/browse/YARN-3931
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: hadoop-2.6.0
Reporter: kyungwan nam
Assignee: Naganarasimha G R

 * 
 yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk
 * submit an application using rest api without app-node-label-expression”, 
 am-container-node-label-expression”
 * RM doesn’t allocate containers to the hosts associated with large_disk node 
 label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2015-07-16 Thread kyungwan nam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629489#comment-14629489
 ] 

kyungwan nam commented on YARN-3931:


node-label-expression is initialized to empty string 
{code}
...
public ApplicationSubmissionContextInfo() {
  applicationId = ;
  applicationName = ;
  containerInfo = new ContainerLaunchContextInfo();
  resource = new ResourceInfo();
  priority = Priority.UNDEFINED.getPriority();
  isUnmanagedAM = false;
  cancelTokensWhenComplete = true;
  keepContainers = false;
  applicationType = ;
  tags = new HashSetString();
  appNodeLabelExpression = ;
  amContainerNodeLabelExpression = ;
}
{code}

but, check whether node-label-expression is null or not
{code}
// check labels in the resource request.
String labelExp = resReq.getNodeLabelExpression();

// if queue has default label expression, and RR doesn't have, use the
// default label expression of queue
if (labelExp == null  queueInfo != null) {
  labelExp = queueInfo.getDefaultNodeLabelExpression();
  resReq.setNodeLabelExpression(labelExp);
}
{code}

 default-node-label-expression doesn’t apply when an application is submitted 
 by RM rest api
 ---

 Key: YARN-3931
 URL: https://issues.apache.org/jira/browse/YARN-3931
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: hadoop-2.6.0
Reporter: kyungwan nam
Assignee: Naganarasimha G R

 * 
 yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk
 * submit an application using rest api without app-node-label-expression”, 
 am-container-node-label-expression”
 * RM doesn’t allocate containers to the hosts associated with large_disk node 
 label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2015-07-16 Thread Naganarasimha G R (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R reassigned YARN-3931:
---

Assignee: Naganarasimha G R

 default-node-label-expression doesn’t apply when an application is submitted 
 by RM rest api
 ---

 Key: YARN-3931
 URL: https://issues.apache.org/jira/browse/YARN-3931
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: hadoop-2.6.0
Reporter: kyungwan nam
Assignee: Naganarasimha G R

 * 
 yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk
 * submit an application using rest api without app-node-label-expression”, 
 am-container-node-label-expression”
 * RM doesn’t allocate containers to the hosts associated with large_disk node 
 label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level

2015-07-16 Thread Ajith S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629555#comment-14629555
 ] 

Ajith S commented on YARN-3885:
---

not because of the patch

 ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 
 level
 --

 Key: YARN-3885
 URL: https://issues.apache.org/jira/browse/YARN-3885
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.8.0
Reporter: Ajith S
Assignee: Ajith S
Priority: Blocker
 Attachments: YARN-3885.02.patch, YARN-3885.03.patch, 
 YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, 
 YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch


 when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}}
 this piece of code, to calculate {{untoucable}} doesnt consider al the 
 children, it considers only immediate childern



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2015-07-16 Thread kyungwan nam (JIRA)

kyungwan nam created YARN-3931:
--

 Summary: default-node-label-expression doesn’t apply when an 
application is submitted by RM rest api
 Key: YARN-3931
 URL: https://issues.apache.org/jira/browse/YARN-3931
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: hadoop-2.6.0
Reporter: kyungwan nam


* yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk
* submit an application using rest api without app-node-label-expression”, 
am-container-node-label-expression”
* RM doesn’t allocate containers to the hosts associated with large_disk node 
label




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3928) launch application master on specific host


[ 
https://issues.apache.org/jira/browse/YARN-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629412#comment-14629412
 ] 

Varun Saxena commented on YARN-3928:


Duplicate of MAPREDUCE-6402

 launch application master on specific host
 --

 Key: YARN-3928
 URL: https://issues.apache.org/jira/browse/YARN-3928
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 2.6.0
 Environment: Ubuntu 12.04
Reporter: Wenrui

 Hi, 
 Is there a way to launch application master on a specific host ?
 If we can not do this in a managed-AM-launcher? 
 then is it possible to achieve this in unmanaged-AM-launcher?
 I just find it's quite necessary to set application master on a specific host 
 in some  scenes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED


[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629411#comment-14629411
 ] 

Sunil G commented on YARN-3535:
---

Thank you [~peng.zhang] and [~asuresh] for correcting.
bq.that notification will happen only AFTER the recoverResourceRequest has 
completed.. since it will be handled by the same dispatcher
Yes. I missed this. Ordering will be corrected here.  

  ResourceRequest should be restored back to scheduler when RMContainer is 
 killed at ALLOCATED
 -

 Key: YARN-3535
 URL: https://issues.apache.org/jira/browse/YARN-3535
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Peng Zhang
Priority: Critical
 Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
 yarn-app.log


 During rolling update of NM, AM start of container on NM failed. 
 And then job hang there.
 Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90


[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629464#comment-14629464
 ] 

Hudson commented on YARN-3805:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8173 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8173/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md


 Update the documentation of Disk Checker based on YARN-90
 -

 Key: YARN-3805
 URL: https://issues.apache.org/jira/browse/YARN-3805
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3805.001.patch, YARN-3805.002.patch


 NodeManager is able to recover status of the disk once broken and fixed 
 without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again


[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629465#comment-14629465
 ] 

Hudson commented on YARN-90:


FAILURE: Integrated in Hadoop-trunk-Commit #8173 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8173/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md


 NodeManager should identify failed disks becoming good again
 

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Fix For: 2.6.0

 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
 apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
 apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown

2015-07-16 Thread Dian Fu (JIRA)

Dian Fu created YARN-3930:
-

 Summary: FileSystemNodeLabelsStore should make sure edit log file 
closed when exception is thrown 
 Key: YARN-3930
 URL: https://issues.apache.org/jira/browse/YARN-3930
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Dian Fu
Assignee: Dian Fu


When I test the node label feature in my local environment, I encountered the 
following exception:
{code}
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)

at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
at java.lang.Thread.run(Thread.java:745)
{code}
The reason is that HDFS throws an exception when calling 
{{ensureAppendEditlogFile}} because of some reason which causes the edit log 
output stream isn't closed. This caused that the next time we call 
{{ensureAppendEditlogFile}}, lease recovery will failed because we are just the 
lease holder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown

2015-07-16 Thread Dian Fu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dian Fu updated YARN-3930:
--
Attachment: YARN-3930.001.patch

A simple patch attached.

 FileSystemNodeLabelsStore should make sure edit log file closed when 
 exception is thrown 
 -

 Key: YARN-3930
 URL: https://issues.apache.org/jira/browse/YARN-3930
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Dian Fu
Assignee: Dian Fu
 Attachments: YARN-3930.001.patch


 When I test the node label feature in my local environment, I encountered the 
 following exception:
 {code}
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662)
 at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418)
 at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
 at java.lang.Thread.run(Thread.java:745)
 {code}
 The reason is that HDFS throws an exception when calling 
 {{ensureAppendEditlogFile}} because of some reason which causes the edit log 
 output stream isn't closed. This caused that the next time we call 
 {{ensureAppendEditlogFile}}, lease recovery will failed because we are just 
 the lease holder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level

2015-07-16 Thread Ajith S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S updated YARN-3885:
--
Attachment: YARN-3885.08.patch

 ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 
 level
 --

 Key: YARN-3885
 URL: https://issues.apache.org/jira/browse/YARN-3885
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.8.0
Reporter: Ajith S
Assignee: Ajith S
Priority: Blocker
 Attachments: YARN-3885.02.patch, YARN-3885.03.patch, 
 YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, 
 YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch


 when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}}
 this piece of code, to calculate {{untoucable}} doesnt consider al the 
 children, it considers only immediate childern



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED

2015-07-16 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629452#comment-14629452
 ] 

zhihai xu commented on YARN-3535:
-

Also because {{containerCompleted}} and 
{{pullNewlyAllocatedContainersAndNMTokens}} are synchronized, it will guarantee 
if AM gets the container, 
{{ContainerRescheduledEvent}}({{recoverResourceRequestForContainer}}) won't be 
called.


  ResourceRequest should be restored back to scheduler when RMContainer is 
 killed at ALLOCATED
 -

 Key: YARN-3535
 URL: https://issues.apache.org/jira/browse/YARN-3535
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Peng Zhang
Priority: Critical
 Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
 yarn-app.log


 During rolling update of NM, AM start of container on NM failed. 
 And then job hang there.
 Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90


[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629418#comment-14629418
 ] 

Tsuyoshi Ozawa commented on YARN-3805:
--

+1, pending for Jenkins.

 Update the documentation of Disk Checker based on YARN-90
 -

 Key: YARN-3805
 URL: https://issues.apache.org/jira/browse/YARN-3805
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Attachments: YARN-3805.001.patch, YARN-3805.002.patch


 NodeManager is able to recover status of the disk once broken and fixed 
 without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level


[ 
https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629446#comment-14629446
 ] 

Hadoop QA commented on YARN-3885:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 12s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 46s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 37s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 50s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 18s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 23s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |  61m 19s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  99m 23s | |
\\
\\
|| Reason || Tests ||
| Timed out tests | 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12745584/YARN-3885.08.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 90bda9c |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8555/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8555/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8555/console |


This message was automatically generated.

 ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 
 level
 --

 Key: YARN-3885
 URL: https://issues.apache.org/jira/browse/YARN-3885
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.8.0
Reporter: Ajith S
Assignee: Ajith S
Priority: Blocker
 Attachments: YARN-3885.02.patch, YARN-3885.03.patch, 
 YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, 
 YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch


 when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}}
 this piece of code, to calculate {{untoucable}} doesnt consider al the 
 children, it considers only immediate childern



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED


[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629348#comment-14629348
 ] 

Sunil G commented on YARN-3535:
---

Hi [~rohithsharma] and [~peng.zhang]
After seeing this patch, I feel there may a synchronization problem. Please 
correct me if I am wrong.
In ContainerRescheduledTransition code, its been used like
{code}
+  container.eventHandler.handle(new ContainerRescheduledEvent(container));
+  new FinishedTransition().transition(container, event);
{code}
Hence ContainerRescheduledEvent is fired to Scheduler dispatcher and it will 
process the {{recoverResourceRequestForContainer}} is a separate thread. 
Meantime in RMAppImpl, {{FinishedTransition().transition}} will be invoked and 
it will be processed for closure for this container. If the Scheduler 
dispatcher is slower in processing due to pending event queue length, there are 
chances that recoverResourceRequest may not be correct.

I feel we can introduce a new Event in {{RMContainerImpl}} from ALLOCATED to 
WAIT_FOR_REQUEST_RECOVERY and scheduler can fire back an event to 
{{RMContainerImpl}} indicate recovery of resource request is completed. This 
can move the state forward to KILLED in {{RMContainerImpl}}. 
Please share your thoughts.

  ResourceRequest should be restored back to scheduler when RMContainer is 
 killed at ALLOCATED
 -

 Key: YARN-3535
 URL: https://issues.apache.org/jira/browse/YARN-3535
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Peng Zhang
Priority: Critical
 Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
 yarn-app.log


 During rolling update of NM, AM start of container on NM failed. 
 And then job hang there.
 Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED

2015-07-16 Thread Arun Suresh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629394#comment-14629394
 ] 

Arun Suresh commented on YARN-3535:
---

bq. I think recoverResourceRequest will not be affected by whether container 
finished event is processed faster. Cause recoverResourceRequest only process 
the ResourceRequest in container and not care its status.
I agree with [~peng.zhang] here. IIUC, The {{recoverResourceRequest}} only 
affects state of the Scheduler and the SchedulerApp. In any case, the fact that 
the container is killed (the outcome of the 
{{RMAppAttemptContainerFinishedEvent}} fired by 
{{FinishedTransition#transition}}) will be notified to the Scheduler.. and that 
notification will happen only AFTER the recoverResourceRequest has completed.. 
since it will be handled by the same dispatcher.

  ResourceRequest should be restored back to scheduler when RMContainer is 
 killed at ALLOCATED
 -

 Key: YARN-3535
 URL: https://issues.apache.org/jira/browse/YARN-3535
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Peng Zhang
Priority: Critical
 Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
 yarn-app.log


 During rolling update of NM, AM start of container on NM failed. 
 And then job hang there.
 Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED

2015-07-16 Thread Peng Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629369#comment-14629369
 ] 

Peng Zhang commented on YARN-3535:
--

bq. there are chances that recoverResourceRequest may not be correct.

Sorry, I didn't catch this, maybe I missed sth?. 

I think {{recoverResourceRequest}} will not be affected by whether container 
finished event is processed faster. 
Cause {{recoverResourceRequest}} only process the ResourceRequest in container 
and not care its status. 

  ResourceRequest should be restored back to scheduler when RMContainer is 
 killed at ALLOCATED
 -

 Key: YARN-3535
 URL: https://issues.apache.org/jira/browse/YARN-3535
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Peng Zhang
Priority: Critical
 Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
 yarn-app.log


 During rolling update of NM, AM start of container on NM failed. 
 And then job hang there.
 Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3805) Update the documentation of Disk Checker based on YARN-90

2015-07-16 Thread Masatake Iwasaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated YARN-3805:
---
Attachment: YARN-3805.002.patch

I rebased the patch. Thanks for pinging me, [~ozawa].

 Update the documentation of Disk Checker based on YARN-90
 -

 Key: YARN-3805
 URL: https://issues.apache.org/jira/browse/YARN-3805
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Attachments: YARN-3805.001.patch, YARN-3805.002.patch


 NodeManager is able to recover status of the disk once broken and fixed 
 without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90


[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629423#comment-14629423
 ] 

Hadoop QA commented on YARN-3805:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |   3m 42s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | release audit |   0m 21s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   2m 59s | Site still builds. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| | |   7m  5s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12745590/YARN-3805.002.patch |
| Optional Tests | site |
| git revision | trunk / 90bda9c |
| Java | 1.7.0_55 |
| uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8556/console |


This message was automatically generated.

 Update the documentation of Disk Checker based on YARN-90
 -

 Key: YARN-3805
 URL: https://issues.apache.org/jira/browse/YARN-3805
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Attachments: YARN-3805.001.patch, YARN-3805.002.patch


 NodeManager is able to recover status of the disk once broken and fixed 
 without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90


[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629615#comment-14629615
 ] 

Hudson commented on YARN-3805:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #258 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/258/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md


 Update the documentation of Disk Checker based on YARN-90
 -

 Key: YARN-3805
 URL: https://issues.apache.org/jira/browse/YARN-3805
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3805.001.patch, YARN-3805.002.patch


 NodeManager is able to recover status of the disk once broken and fixed 
 without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown


[ 
https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629646#comment-14629646
 ] 

Hadoop QA commented on YARN-3930:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m  8s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 39s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 34s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 52s | The applied patch generated  2 
new checkstyle issues (total was 14, now 15). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 19s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 34s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   1m 56s | Tests passed in 
hadoop-yarn-common. |
| | |  40m  1s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12745596/YARN-3930.001.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 1ba2986 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8557/artifact/patchprocess/diffcheckstylehadoop-yarn-common.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8557/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8557/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8557/console |


This message was automatically generated.

 FileSystemNodeLabelsStore should make sure edit log file closed when 
 exception is thrown 
 -

 Key: YARN-3930
 URL: https://issues.apache.org/jira/browse/YARN-3930
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Dian Fu
Assignee: Dian Fu
 Attachments: YARN-3930.001.patch


 When I test the node label feature in my local environment, I encountered the 
 following exception:
 {code}
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662)
 at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418)
 at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168)
 at

[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one


[ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629722#comment-14629722
 ] 

Hudson commented on YARN-3174:
--

ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #246 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/246/])
YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation 
into one. Contributed by Masatake Iwasaki. (ozawa: rev 
f02dd146f58bcfa0595eec7f2433bafdd857630f)
* hadoop-yarn-project/CHANGES.txt
* hadoop-project/src/site/site.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md


 Consolidate the NodeManager and NodeManagerRestart documentation into one
 -

 Key: YARN-3174
 URL: https://issues.apache.org/jira/browse/YARN-3174
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.7.1
Reporter: Allen Wittenauer
Assignee: Masatake Iwasaki
 Fix For: 2.8.0

 Attachments: YARN-3174.001.patch


 We really don't need a different document for every individual nodemanager 
 feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again


[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629728#comment-14629728
 ] 

Hudson commented on YARN-90:


ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #246 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/246/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt


 NodeManager should identify failed disks becoming good again
 

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Fix For: 2.6.0

 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
 apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
 apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90


[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629716#comment-14629716
 ] 

Hudson commented on YARN-3805:
--

ABORTED: Integrated in Hadoop-Mapreduce-trunk #2204 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2204/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt


 Update the documentation of Disk Checker based on YARN-90
 -

 Key: YARN-3805
 URL: https://issues.apache.org/jira/browse/YARN-3805
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3805.001.patch, YARN-3805.002.patch


 NodeManager is able to recover status of the disk once broken and fixed 
 without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90


[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629715#comment-14629715
 ] 

Hudson commented on YARN-3805:
--

ABORTED: Integrated in Hadoop-Hdfs-trunk #2185 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2185/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt


 Update the documentation of Disk Checker based on YARN-90
 -

 Key: YARN-3805
 URL: https://issues.apache.org/jira/browse/YARN-3805
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3805.001.patch, YARN-3805.002.patch


 NodeManager is able to recover status of the disk once broken and fixed 
 without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one


[ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629719#comment-14629719
 ] 

Hudson commented on YARN-3174:
--

ABORTED: Integrated in Hadoop-Mapreduce-trunk #2204 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2204/])
YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation 
into one. Contributed by Masatake Iwasaki. (ozawa: rev 
f02dd146f58bcfa0595eec7f2433bafdd857630f)
* hadoop-project/src/site/site.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md


 Consolidate the NodeManager and NodeManagerRestart documentation into one
 -

 Key: YARN-3174
 URL: https://issues.apache.org/jira/browse/YARN-3174
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.7.1
Reporter: Allen Wittenauer
Assignee: Masatake Iwasaki
 Fix For: 2.8.0

 Attachments: YARN-3174.001.patch


 We really don't need a different document for every individual nodemanager 
 feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again


[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629724#comment-14629724
 ] 

Hudson commented on YARN-90:


ABORTED: Integrated in Hadoop-Mapreduce-trunk #2204 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2204/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt


 NodeManager should identify failed disks becoming good again
 

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Fix For: 2.6.0

 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
 apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
 apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again


[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629618#comment-14629618
 ] 

Hudson commented on YARN-90:


SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #258 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/258/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md


 NodeManager should identify failed disks becoming good again
 

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Fix For: 2.6.0

 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
 apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
 apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one


[ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629616#comment-14629616
 ] 

Hudson commented on YARN-3174:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #258 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/258/])
YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation 
into one. Contributed by Masatake Iwasaki. (ozawa: rev 
f02dd146f58bcfa0595eec7f2433bafdd857630f)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-project/src/site/site.xml


 Consolidate the NodeManager and NodeManagerRestart documentation into one
 -

 Key: YARN-3174
 URL: https://issues.apache.org/jira/browse/YARN-3174
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.7.1
Reporter: Allen Wittenauer
Assignee: Masatake Iwasaki
 Fix For: 2.8.0

 Attachments: YARN-3174.001.patch


 We really don't need a different document for every individual nodemanager 
 feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one


[ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629717#comment-14629717
 ] 

Hudson commented on YARN-3174:
--

ABORTED: Integrated in Hadoop-Hdfs-trunk #2185 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2185/])
YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation 
into one. Contributed by Masatake Iwasaki. (ozawa: rev 
f02dd146f58bcfa0595eec7f2433bafdd857630f)
* hadoop-project/src/site/site.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md


 Consolidate the NodeManager and NodeManagerRestart documentation into one
 -

 Key: YARN-3174
 URL: https://issues.apache.org/jira/browse/YARN-3174
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.7.1
Reporter: Allen Wittenauer
Assignee: Masatake Iwasaki
 Fix For: 2.8.0

 Attachments: YARN-3174.001.patch


 We really don't need a different document for every individual nodemanager 
 feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions


[ 
https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629660#comment-14629660
 ] 

Varun Saxena commented on YARN-3877:


[~chris.douglas], thanks for the review.

Yes, you are correct that this config is not required for test. Will remove it.
Will move the relevant test code to a separate test.

 YarnClientImpl.submitApplication swallows exceptions
 

 Key: YARN-3877
 URL: https://issues.apache.org/jira/browse/YARN-3877
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: client
Affects Versions: 2.7.2
Reporter: Steve Loughran
Assignee: Varun Saxena
Priority: Minor
 Attachments: YARN-3877.01.patch


 When {{YarnClientImpl.submitApplication}} spins waiting for the application 
 to be accepted, any interruption during its Sleep() calls are logged and 
 swallowed.
 this makes it hard to interrupt the thread during shutdown. Really it should 
 throw some form of exception and let the caller deal with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions


 [ 
https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3877:
---
Attachment: YARN-3877.02.patch

 YarnClientImpl.submitApplication swallows exceptions
 

 Key: YARN-3877
 URL: https://issues.apache.org/jira/browse/YARN-3877
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: client
Affects Versions: 2.7.2
Reporter: Steve Loughran
Assignee: Varun Saxena
Priority: Minor
 Attachments: YARN-3877.01.patch, YARN-3877.02.patch


 When {{YarnClientImpl.submitApplication}} spins waiting for the application 
 to be accepted, any interruption during its Sleep() calls are logged and 
 swallowed.
 this makes it hard to interrupt the thread during shutdown. Really it should 
 throw some form of exception and let the caller deal with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3928) launch application master on specific host

2015-07-16 Thread Lei Guo (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629750#comment-14629750
 ] 

Lei Guo commented on YARN-3928:
---

[~varun_saxena], I read this JIRA as a host preference requirement during 
container allocation, it's not a duplicate of MAPREDUCE-6402. [~wenrui], can 
you confirm?

 launch application master on specific host
 --

 Key: YARN-3928
 URL: https://issues.apache.org/jira/browse/YARN-3928
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 2.6.0
 Environment: Ubuntu 12.04
Reporter: Wenrui

 Hi, 
 Is there a way to launch application master on a specific host ?
 If we can not do this in a managed-AM-launcher? 
 then is it possible to achieve this in unmanaged-AM-launcher?
 I just find it's quite necessary to set application master on a specific host 
 in some  scenes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90


[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629622#comment-14629622
 ] 

Hudson commented on YARN-3805:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #988 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/988/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt


 Update the documentation of Disk Checker based on YARN-90
 -

 Key: YARN-3805
 URL: https://issues.apache.org/jira/browse/YARN-3805
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3805.001.patch, YARN-3805.002.patch


 NodeManager is able to recover status of the disk once broken and fixed 
 without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one


[ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629623#comment-14629623
 ] 

Hudson commented on YARN-3174:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #988 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/988/])
YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation 
into one. Contributed by Masatake Iwasaki. (ozawa: rev 
f02dd146f58bcfa0595eec7f2433bafdd857630f)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md
* hadoop-project/src/site/site.xml
* hadoop-yarn-project/CHANGES.txt


 Consolidate the NodeManager and NodeManagerRestart documentation into one
 -

 Key: YARN-3174
 URL: https://issues.apache.org/jira/browse/YARN-3174
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.7.1
Reporter: Allen Wittenauer
Assignee: Masatake Iwasaki
 Fix For: 2.8.0

 Attachments: YARN-3174.001.patch


 We really don't need a different document for every individual nodemanager 
 feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again


[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629625#comment-14629625
 ] 

Hudson commented on YARN-90:


SUCCESS: Integrated in Hadoop-Yarn-trunk #988 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/988/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt


 NodeManager should identify failed disks becoming good again
 

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Fix For: 2.6.0

 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
 apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
 apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90


[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629718#comment-14629718
 ] 

Hudson commented on YARN-3805:
--

ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #246 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/246/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt


 Update the documentation of Disk Checker based on YARN-90
 -

 Key: YARN-3805
 URL: https://issues.apache.org/jira/browse/YARN-3805
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3805.001.patch, YARN-3805.002.patch


 NodeManager is able to recover status of the disk once broken and fixed 
 without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again


[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629723#comment-14629723
 ] 

Hudson commented on YARN-90:


ABORTED: Integrated in Hadoop-Hdfs-trunk #2185 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2185/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt


 NodeManager should identify failed disks becoming good again
 

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Fix For: 2.6.0

 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
 apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
 apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3928) launch application master on specific host


[ 
https://issues.apache.org/jira/browse/YARN-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629753#comment-14629753
 ] 

Varun Saxena commented on YARN-3928:


Oh, then it is not. Misread the JIRA title.
Apologies.


 launch application master on specific host
 --

 Key: YARN-3928
 URL: https://issues.apache.org/jira/browse/YARN-3928
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 2.6.0
 Environment: Ubuntu 12.04
Reporter: Wenrui

 Hi, 
 Is there a way to launch application master on a specific host ?
 If we can not do this in a managed-AM-launcher? 
 then is it possible to achieve this in unmanaged-AM-launcher?
 I just find it's quite necessary to set application master on a specific host 
 in some  scenes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions


[ 
https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629811#comment-14629811
 ] 

Hadoop QA commented on YARN-3877:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 34s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 41s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 42s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 28s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 20s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   0m 53s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   6m 55s | Tests passed in 
hadoop-yarn-client. |
| | |  43m 31s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12745625/YARN-3877.02.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 1ba2986 |
| hadoop-yarn-client test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8558/artifact/patchprocess/testrun_hadoop-yarn-client.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8558/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8558/console |


This message was automatically generated.

 YarnClientImpl.submitApplication swallows exceptions
 

 Key: YARN-3877
 URL: https://issues.apache.org/jira/browse/YARN-3877
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: client
Affects Versions: 2.7.2
Reporter: Steve Loughran
Assignee: Varun Saxena
Priority: Minor
 Attachments: YARN-3877.01.patch, YARN-3877.02.patch


 When {{YarnClientImpl.submitApplication}} spins waiting for the application 
 to be accepted, any interruption during its Sleep() calls are logged and 
 swallowed.
 this makes it hard to interrupt the thread during shutdown. Really it should 
 throw some form of exception and let the caller deal with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3784) Indicate preemption timout along with the list of containers to AM (preemption message)


 [ 
https://issues.apache.org/jira/browse/YARN-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3784:
--
Attachment: 0002-YARN-3784.patch

Uploading a new version of the patch. 

Initially RM was sending list of container IDs in the preemption message. This 
patch is now improved that to include timeout also along with container id. New 
timeout is an optional param in proto.
[~chris.douglas] Could you please take a look.

 Indicate preemption timout along with the list of containers to AM 
 (preemption message)
 ---

 Key: YARN-3784
 URL: https://issues.apache.org/jira/browse/YARN-3784
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Sunil G
Assignee: Sunil G
 Attachments: 0001-YARN-3784.patch, 0002-YARN-3784.patch


 Currently during preemption, AM is notified with a list of containers which 
 are marked for preemption. Introducing a timeout duration also along with 
 this container list so that AM can know how much time it will get to do a 
 graceful shutdown to its containers (assuming one of preemption policy is 
 loaded in AM).
 This will help in decommissioning NM scenarios, where NM will be 
 decommissioned after a timeout (also killing containers on it). This timeout 
 will be helpful to indicate AM that those containers can be killed by RM 
 forcefully after the timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2015-07-16 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629790#comment-14629790
 ] 

Naganarasimha G R commented on YARN-3931:
-

[~kyungwan nam], Good that you are trying to contribute :), we need to request 
some committer to add you to the list of contributors but in the mean time you 
can upload the patch with test case i can help you in reviewing
[~wangda tan], 
Can you please add [~kyungwan nam] to the contributor list and assign him this 
jira ?

 default-node-label-expression doesn’t apply when an application is submitted 
 by RM rest api
 ---

 Key: YARN-3931
 URL: https://issues.apache.org/jira/browse/YARN-3931
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: hadoop-2.6.0
Reporter: kyungwan nam
Assignee: Naganarasimha G R

 * 
 yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk
 * submit an application using rest api without app-node-label-expression”, 
 am-container-node-label-expression”
 * RM doesn’t allocate containers to the hosts associated with large_disk node 
 label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3929) Uncleaning option for local app log files with log-aggregation feature

2015-07-16 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629998#comment-14629998
 ] 

Xuan Gong commented on YARN-3929:
-

[~dongwook]
Does this configuration: yarn.nodemanager.delete.debug-delay-sec satisfy your 
requirement ? 

 Uncleaning option for local app log files with log-aggregation feature
 --

 Key: YARN-3929
 URL: https://issues.apache.org/jira/browse/YARN-3929
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: log-aggregation
Affects Versions: 2.4.0, 2.6.0
Reporter: Dongwook Kwon
Priority: Minor
 Attachments: YARN-3929.01.patch


 Although it makes sense to delete local app log files once AppLogAggregator 
 copied all files into remote location(HDFS), I have some use cases that need 
 to leave local app log files after it's copied to HDFS. Mostly it's for own 
 backup purpose. I would like to use log-aggregation feature of YARN and want 
 to back up app log files too. Without this option, files has to copy from 
 HDFS to local again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()

2015-07-16 Thread Xuan Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-3893:

Issue Type: Sub-task  (was: Bug)
Parent: YARN-149

 Both RM in active state when Admin#transitionToActive failure from refeshAll()
 --

 Key: YARN-3893
 URL: https://issues.apache.org/jira/browse/YARN-3893
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 
 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml


 Cases that can cause this.
 # Capacity scheduler xml is wrongly configured during switch
 # Refresh ACL failure due to configuration
 # Refresh User group failure due to configuration
 Continuously both RM will try to be active
 {code}
 dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin
  ./yarn rmadmin  -getServiceState rm1
 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 active
 dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin
  ./yarn rmadmin  -getServiceState rm2
 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 active
 {code}
 # Both Web UI active
 # Status shown as active for both RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api


 [ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3931:
-
Assignee: kyungwan nam  (was: Naganarasimha G R)

 default-node-label-expression doesn’t apply when an application is submitted 
 by RM rest api
 ---

 Key: YARN-3931
 URL: https://issues.apache.org/jira/browse/YARN-3931
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: hadoop-2.6.0
Reporter: kyungwan nam
Assignee: kyungwan nam

 * 
 yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk
 * submit an application using rest api without app-node-label-expression”, 
 am-container-node-label-expression”
 * RM doesn’t allocate containers to the hosts associated with large_disk node 
 label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api


[ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630010#comment-14630010
 ] 

Wangda Tan commented on YARN-3931:
--

Thanks for raising the issue [~kyungwan nam], I just added you to contributor 
list and assigned the JIRA to you.

 default-node-label-expression doesn’t apply when an application is submitted 
 by RM rest api
 ---

 Key: YARN-3931
 URL: https://issues.apache.org/jira/browse/YARN-3931
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: hadoop-2.6.0
Reporter: kyungwan nam
Assignee: kyungwan nam

 * 
 yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk
 * submit an application using rest api without app-node-label-expression”, 
 am-container-node-label-expression”
 * RM doesn’t allocate containers to the hosts associated with large_disk node 
 label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown


[ 
https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630017#comment-14630017
 ] 

Wangda Tan commented on YARN-3930:
--

[~dian.fu], Thanks for working on the JIRA. Patch looks good, will commit soon.

 FileSystemNodeLabelsStore should make sure edit log file closed when 
 exception is thrown 
 -

 Key: YARN-3930
 URL: https://issues.apache.org/jira/browse/YARN-3930
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Dian Fu
Assignee: Dian Fu
 Attachments: YARN-3930.001.patch


 When I test the node label feature in my local environment, I encountered the 
 following exception:
 {code}
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662)
 at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418)
 at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
 at java.lang.Thread.run(Thread.java:745)
 {code}
 The reason is that HDFS throws an exception when calling 
 {{ensureAppendEditlogFile}} because of some reason which causes the edit log 
 output stream isn't closed. This caused that the next time we call 
 {{ensureAppendEditlogFile}}, lease recovery will failed because we are just 
 the lease holder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level


[ 
https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630022#comment-14630022
 ] 

Wangda Tan commented on YARN-3885:
--

Patch LGTM, +1, will commit soon. Thanks [~ajithshetty].

 ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 
 level
 --

 Key: YARN-3885
 URL: https://issues.apache.org/jira/browse/YARN-3885
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.8.0
Reporter: Ajith S
Assignee: Ajith S
Priority: Blocker
 Attachments: YARN-3885.02.patch, YARN-3885.03.patch, 
 YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, 
 YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch


 when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}}
 this piece of code, to calculate {{untoucable}} doesnt consider al the 
 children, it considers only immediate childern



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2015-07-16 Thread kyungwan nam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629701#comment-14629701
 ] 

kyungwan nam commented on YARN-3931:


hi, i couldn't reassign it to me.
i think i don't have the privilege to assign issue

 default-node-label-expression doesn’t apply when an application is submitted 
 by RM rest api
 ---

 Key: YARN-3931
 URL: https://issues.apache.org/jira/browse/YARN-3931
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: hadoop-2.6.0
Reporter: kyungwan nam
Assignee: Naganarasimha G R

 * 
 yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk
 * submit an application using rest api without app-node-label-expression”, 
 am-container-node-label-expression”
 * RM doesn’t allocate containers to the hosts associated with large_disk node 
 label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions


[ 
https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629714#comment-14629714
 ] 

Varun Saxena commented on YARN-3877:


[~chris.douglas], updated a new patch.
Kindly review.

To avoid timing issues in test, added code to wait for thread to enter 
sleep(enter TIMED_WAITING state) before call to interrupt.

 YarnClientImpl.submitApplication swallows exceptions
 

 Key: YARN-3877
 URL: https://issues.apache.org/jira/browse/YARN-3877
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: client
Affects Versions: 2.7.2
Reporter: Steve Loughran
Assignee: Varun Saxena
Priority: Minor
 Attachments: YARN-3877.01.patch, YARN-3877.02.patch


 When {{YarnClientImpl.submitApplication}} spins waiting for the application 
 to be accepted, any interruption during its Sleep() calls are logged and 
 swallowed.
 this makes it hard to interrupt the thread during shutdown. Really it should 
 throw some form of exception and let the caller deal with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-16 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3932:
---
Attachment: ApplicationReport.jpg

 SchedulerApplicationAttempt#getResourceUsageReport should be based on 
 NodeLabel
 ---

 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: ApplicationReport.jpg


 Application Resource Report shown wrong when node Label is used.
 1.Submit application with NodeLabel
 2.Check RM UI for resources used 
 Allocated CPU VCores and Allocated Memory MB is always {{zero}}
 {code}
  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
 AggregateAppResourceUsage runningResourceUsage =
 getRunningAggregateAppResourceUsage();
 Resource usedResourceClone =
 Resources.clone(attemptResourceUsage.getUsed());
 Resource reservedResourceClone =
 Resources.clone(attemptResourceUsage.getReserved());
 return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
 reservedContainers.size(), usedResourceClone, reservedResourceClone,
 Resources.add(usedResourceClone, reservedResourceClone),
 runningResourceUsage.getMemorySeconds(),
 runningResourceUsage.getVcoreSeconds());
   }
 {code}
 should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-16 Thread Bibin A Chundatt (JIRA)

Bibin A Chundatt created YARN-3932:
--

 Summary: SchedulerApplicationAttempt#getResourceUsageReport should 
be based on NodeLabel
 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt


Application Resource Report shown wrong when node Label is used.


1.Submit application with NodeLabel
2.Check RM UI for resources used 
Allocated CPU VCores and Allocated Memory MB is always {{zero}}

{code}
 public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
AggregateAppResourceUsage runningResourceUsage =
getRunningAggregateAppResourceUsage();
Resource usedResourceClone =
Resources.clone(attemptResourceUsage.getUsed());
Resource reservedResourceClone =
Resources.clone(attemptResourceUsage.getReserved());
return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
reservedContainers.size(), usedResourceClone, reservedResourceClone,
Resources.add(usedResourceClone, reservedResourceClone),
runningResourceUsage.getMemorySeconds(),
runningResourceUsage.getVcoreSeconds());
  }
{code}
should be {{attemptResourceUsage.getUsed(label)}}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1644) RM-NM protocol changes and NodeStatusUpdater implementation to support container resizing

2015-07-16 Thread MENG DING (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1644:

Attachment: YARN-1644-YARN-1197.4.patch

Updated this patch as dependent patch has been updated.

 RM-NM protocol changes and NodeStatusUpdater implementation to support 
 container resizing
 -

 Key: YARN-1644
 URL: https://issues.apache.org/jira/browse/YARN-1644
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Wangda Tan
Assignee: MENG DING
 Attachments: YARN-1644-YARN-1197.4.patch, YARN-1644.1.patch, 
 YARN-1644.2.patch, YARN-1644.3.patch, yarn-1644.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2015-07-16 Thread Ming Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629940#comment-14629940
 ] 

Ming Ma commented on YARN-2578:
---

Thanks [~iwasakims]. Is it similar to HADOOP-11252? Given your latest patch is 
in hadoop-common, it might be better to fix it as a HADOOP jira instead.

 NM does not failover timely if RM node network connection fails
 ---

 Key: YARN-2578
 URL: https://issues.apache.org/jira/browse/YARN-2578
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.1
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg
 Attachments: YARN-2578.002.patch, YARN-2578.patch


 The NM does not fail over correctly when the network cable of the RM is 
 unplugged or the failure is simulated by a service network stop or a 
 firewall that drops all traffic on the node. The RM fails over to the standby 
 node when the failure is detected as expected. The NM should than re-register 
 with the new active RM. This re-register takes a long time (15 minutes or 
 more). Until then the cluster has no nodes for processing and applications 
 are stuck.
 Reproduction test case which can be used in any environment:
 - create a cluster with 3 nodes
 node 1: ZK, NN, JN, ZKFC, DN, RM, NM
 node 2: ZK, NN, JN, ZKFC, DN, RM, NM
 node 3: ZK, JN, DN, NM
 - start all services make sure they are in good health
 - kill the network connection of the RM that is active using one of the 
 network kills from above
 - observe the NN and RM failover
 - the DN's fail over to the new active NN
 - the NM does not recover for a long time
 - the logs show a long delay and traces show no change at all
 The stack traces of the NM all show the same set of threads. The main thread 
 which should be used in the re-register is the Node Status Updater This 
 thread is stuck in:
 {code}
 Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
 Object.wait() [0x7f5a51fc1000]
java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
   - locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call)
   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
   at 
 org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
 {code}
 The client connection which goes through the proxy can be traced back to the 
 ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
 should be using a version which takes the RPC timeout (from the 
 configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers


[ 
https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630151#comment-14630151
 ] 

Hadoop QA commented on YARN-433:


\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12740222/YARN-433.2.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 1ba2986 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8559/console |


This message was automatically generated.

 When RM is catching up with node updates then it should not expire acquired 
 containers
 --

 Key: YARN-433
 URL: https://issues.apache.org/jira/browse/YARN-433
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Xuan Gong
 Attachments: YARN-433.1.patch, YARN-433.2.patch


 RM expires containers that are not launched within some time of being 
 allocated. The default is 10mins. When an RM is not keeping up with node 
 updates then it may not be aware of new launched containers. If the expire 
 thread fires for such containers then the RM can expire them even though they 
 may have launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3868) ContainerManager recovery for container resizing

2015-07-16 Thread MENG DING (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-3868:

Attachment: YARN-3868-YARN-1197.3.patch

Update patch as dependent patches have been updated.

 ContainerManager recovery for container resizing
 

 Key: YARN-3868
 URL: https://issues.apache.org/jira/browse/YARN-3868
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: MENG DING
Assignee: MENG DING
 Attachments: YARN-3868-YARN-1197.3.patch, YARN-3868.1.patch, 
 YARN-3868.2.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations

2015-07-16 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630251#comment-14630251
 ] 

Subru Krishnan commented on YARN-3656:
--

Thanks [~asuresh] for reviewing the patch. We did consider allowing declarative 
plugging of planners during the early stages of development but decided against 
it to keep the code base simpler to make it easier to grok as the current 
algorithms themselves are non-trivial. We are open to doing this in the future 
as  when the need arises.

 LowCost: A Cost-Based Placement Agent for YARN Reservations
 ---

 Key: YARN-3656
 URL: https://issues.apache.org/jira/browse/YARN-3656
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.6.0
Reporter: Ishai Menache
Assignee: Jonathan Yaniv
  Labels: capacity-scheduler, resourcemanager
 Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.1.patch, 
 YARN-3656-v1.2.patch, YARN-3656-v1.patch, lowcostrayonexternal_v2.pdf


 YARN-1051 enables SLA support by allowing users to reserve cluster capacity 
 ahead of time. YARN-1710 introduced a greedy agent for placing user 
 reservations. The greedy agent makes fast placement decisions but at the cost 
 of ignoring the cluster committed resources, which might result in blocking 
 the cluster resources for certain periods of time, and in turn rejecting some 
 arriving jobs.
 We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” 
 the demand of the job throughout the allowed time-window according to a 
 global, load-based cost function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3784) Indicate preemption timout along with the list of containers to AM (preemption message)


[ 
https://issues.apache.org/jira/browse/YARN-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630167#comment-14630167
 ] 

Wangda Tan commented on YARN-3784:
--

Beyond timeout, another thing we may need consider is: after a container is 
removed from to-be-preempted list, should we notify scheduler/AM about that? 
This could happen if other applications release containers, or other 
queues/applications cancel resource requests.

Now proportionalCPP can notify scheduler many times for a same container, if we 
have to-preempt/remove-from-to-preempt event, we can also reduce number of 
messages send to scheduler (which could cause YARN-3508 happens).

 Indicate preemption timout along with the list of containers to AM 
 (preemption message)
 ---

 Key: YARN-3784
 URL: https://issues.apache.org/jira/browse/YARN-3784
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Sunil G
Assignee: Sunil G
 Attachments: 0001-YARN-3784.patch, 0002-YARN-3784.patch


 Currently during preemption, AM is notified with a list of containers which 
 are marked for preemption. Introducing a timeout duration also along with 
 this container list so that AM can know how much time it will get to do a 
 graceful shutdown to its containers (assuming one of preemption policy is 
 loaded in AM).
 This will help in decommissioning NM scenarios, where NM will be 
 decommissioned after a timeout (also killing containers on it). This timeout 
 will be helpful to indicate AM that those containers can be killed by RM 
 forcefully after the timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3900) Protobuf layout of yarn_security_token causes errors in other protos that include it

2015-07-16 Thread Anubhav Dhoot (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630209#comment-14630209
 ] 

Anubhav Dhoot commented on YARN-3900:
-

This is needed for YARN-3736. Without this the leveldb state store 
implementation of YARN-3736 actually causes a dump

 Protobuf layout  of yarn_security_token causes errors in other protos that 
 include it
 -

 Key: YARN-3900
 URL: https://issues.apache.org/jira/browse/YARN-3900
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
 Attachments: YARN-3900.001.patch, YARN-3900.001.patch


 Because of the subdirectory server used in 
 {{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto}}
  there are errors in other protos that include them.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3908) Bugs in HBaseTimelineWriterImpl

2015-07-16 Thread Joep Rottinghuis (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630275#comment-14630275
]

Joep Rottinghuis commented on YARN-3908:

bq. In fact, I'm wondering if we should but info and events into a separate
column family like what we did for configs/metrics?

In principle we should keep everything in the same column family (fewer store
files) unless:
a) The items that we store require a different TTL, compression, etc. This is
the case for metrics where we need a separate TTL.
b) The columns are rather significant in size, and in many queries they'll be
skipped (and specifically not used in push-down predicate ie. column value
filters etc). This is the case for configuration. If we have many queries to
just retrieve info fields and we skip configs in these, then iterating over
just the rows in the info column family will have a benefit of not needing to
access the config store files.

Otherwise a separate column family just results in more store files and doesn't
really gain us anything.
Given the current code setup, switching column family is almost trivial, so
given that there are no functionality differences, I'd say let's not even try
to further optimize this until we have way more code in place.
Then we can run large batches of historical job history files and other inputs
(perhaps porting data from ATS v1) and then we can see the potential benefit or
downside.

The other reason to not do premature optimization is that I'm still thinking of
adding a few more perf tweaks. Those would also just be performance
optimizations, and not any functionality different, so also not a priority now.
We should look at tuning all those things much later and together in a coherent
way. Additional settings that we need to test are RPC compression, encoding of
the store files and/or compression of the same.

In short, let's focus on completing functionality and then tinker with these
settings later.

Bugs in HBaseTimelineWriterImpl
---

Key: YARN-3908
URL: https://issues.apache.org/jira/browse/YARN-3908
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Zhijie Shen
Assignee: Vrushali C
Attachments: YARN-3908-YARN-2928.001.patch,
YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch

1. In HBaseTimelineWriterImpl, the info column family contains the basic
fields of a timeline entity plus events. However, entity#info map is not
stored at all.
2 event#timestamp is also not persisted.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers

2015-07-16 Thread Xuan Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-433:
---
Attachment: YARN-433.3.patch

rebase the patch

 When RM is catching up with node updates then it should not expire acquired 
 containers
 --

 Key: YARN-433
 URL: https://issues.apache.org/jira/browse/YARN-433
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Xuan Gong
 Attachments: YARN-433.1.patch, YARN-433.2.patch, YARN-433.3.patch


 RM expires containers that are not launched within some time of being 
 allocated. The default is 10mins. When an RM is not keeping up with node 
 updates then it may not be aware of new launched containers. If the expire 
 thread fires for such containers then the RM can expire them even though they 
 may have launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel


[ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630283#comment-14630283
 ] 

Wangda Tan commented on YARN-3932:
--

[~bibinchundatt],
I think we can add a method such as getTotalUsed in ResourceUsage class, which 
will be more efficient than iterating all liveContainers. This can be done in 
the near term.

To make it correct, I think we need to return usage-by-partition object to 
application, which requires to change APIs.

 SchedulerApplicationAttempt#getResourceUsageReport should be based on 
 NodeLabel
 ---

 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: ApplicationReport.jpg


 Application Resource Report shown wrong when node Label is used.
 1.Submit application with NodeLabel
 2.Check RM UI for resources used 
 Allocated CPU VCores and Allocated Memory MB is always {{zero}}
 {code}
  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
 AggregateAppResourceUsage runningResourceUsage =
 getRunningAggregateAppResourceUsage();
 Resource usedResourceClone =
 Resources.clone(attemptResourceUsage.getUsed());
 Resource reservedResourceClone =
 Resources.clone(attemptResourceUsage.getReserved());
 return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
 reservedContainers.size(), usedResourceClone, reservedResourceClone,
 Resources.add(usedResourceClone, reservedResourceClone),
 runningResourceUsage.getMemorySeconds(),
 runningResourceUsage.getVcoreSeconds());
   }
 {code}
 should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3914) Entity created time should be part of the row key of entity table

2015-07-16 Thread Sangjin Lee (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630125#comment-14630125
]

Sangjin Lee commented on YARN-3914:
---

[~zjshen], we have been discussing this. While adding entity creation time to
the row key may solve this problem, the concern is that it may introduce others.

If the row key is
(user/cluster/flow/run/app_id/entity_type/created_time/entity_id), then even
the most basic query for (entity_type + entity_id) will get much more
complicated, right? We cannot expect readers to provide the creation time every
time they query for an entity by id.

Also, as you said, we cannot always accommodate different query vectors by
adding more to the row key, or we would be risking blowing up the row key size
or breaking other queries. We should be real judicious what goes into the row
key...

I think it's reasonable to expect that the entity id order would be either
completely or nearly identical to the chronological order (e.g. app id, or
container id). So perhaps we could rely on the entity id order to help mitigate
this problem.

Thoughts?

Entity created time should be part of the row key of entity table
-

Key: YARN-3914
URL: https://issues.apache.org/jira/browse/YARN-3914
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen

Entity created time should be part of the row key of entity table, between
entity type and entity Id. The reason to have it is to index the entities.
Though we cannot index the entities for all kinds of information, indexing
them according to the created time is very necessary. Without it, every query
for the latest entities that belong to an application and a type will scan
through all the entities that belong to them. For example, if we want to list
the 100 latest started containers in an YARN app.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3635) Get-queue-mapping should be a common interface of YarnScheduler

[
https://issues.apache.org/jira/browse/YARN-3635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630139#comment-14630139
]

Wangda Tan commented on YARN-3635:
--

Hi [~sandyr],
Thanks for your comments, actually I have read
QueuePlacementPolicy/QueuePlacementRule from FS before working on this patch.
The generic design fo this patch is based on FS's queue placement policy
structure, but also with some changes.

To your comments:

bq. Is a common way of configuration proposed?
No common configuration, it only defined a set of common interfaces. Since
FS/CS have very different ways to configuration, now rules are created by
different schedulers, see CapacityScheduler#updatePlacementRules as an example.

bq. What steps are required for the Fair Scheduler to integrate with this?
1) Port existing rules to new APIs defined in the patch, this should be simple
2) Change configuration implementation to instance new defined PlacementRule,
you may not need to change existing configuration items itself.
3) Change FS workflow, with this patch, queue mapping is happened before submit
to scheduler. Remove queue mapping related logics from FS and create queue if
needed.

bq. Each placement rule gets the chance to assign the app to a queue, reject
the app, or pass. If it passes, the next rule gets a chance.
New APIs are very similar:
Non-null is determined
Null is not determined
Throw exception when rejected.
You can take a look at
{{org.apache.hadoop.yarn.server.resourcemanager.placement.PlacementRule}}

bq. A placement rule can base its decision on:
bq.
Yes you can do all of them with the new API except The set of queues given in
the Fair Scheduler configuration.:
I was thinking necessarity of passing set of queues in the interface. In
existing implementations of QueuePlacementPolicy, FS queues are only used to
check mapped queue's existence. I would prefer to delay the check to submit to
scheduler. See my next comment about create flag for more details.
Another reason of not passing queue names set via interface is, queues are very
dynamic. For example, if user wants to submit application to queue with lowest
utilization, queue names set may not enough. I would prefer to let rule choose
to get what need from scheduler.

bq. Rules are marked as terminal if they will never pass. This helps to avoid
misconfigurations where users place rules after terminal rules.
I'm not sure if is it necessary. I think terminal or not should be determined
by runtime, but I'm OK if you think it's must to have.

bq. Rules have a create attribute which determines whether they can create a
new queue or whether they must assign to existing queues.
I think queue is create-able or not should be determined by scheduler, it
should be a part of scheduler configuration instead of rule itself.
You can put create to your implemented rules without any issue, but I prefer
not to expose it to public interface.

bq. Currently the set of placement rules is limited to what's implemented in
YARN. I.e. there's no public pluggable rule support.
Agree, this is one thing we need to do in the future. For now, we can make
queue mapping happens in a central place first.

bq. Are there places where my summary would not describe what's going on in
this patch?
I think it should covers most of my patch, you can also take a look at my patch
to see if anything unexpected :).

Get-queue-mapping should be a common interface of YarnScheduler
---

Key: YARN-3635
URL: https://issues.apache.org/jira/browse/YARN-3635
Project: Hadoop YARN
Issue Type: Sub-task
Components: scheduler
Reporter: Wangda Tan
Assignee: Tan, Wangda
Attachments: YARN-3635.1.patch, YARN-3635.2.patch, YARN-3635.3.patch,
YARN-3635.4.patch, YARN-3635.5.patch, YARN-3635.6.patch

Currently, both of fair/capacity scheduler support queue mapping, which makes
scheduler can change queue of an application after submitted to scheduler.
One issue of doing this in specific scheduler is: If the queue after mapping
has different maximum_allocation/default-node-label-expression of the
original queue, {{validateAndCreateResourceRequest}} in RMAppManager checks
the wrong queue.
I propose to make the queue mapping as a common interface of scheduler, and
RMAppManager set the queue after mapping before doing validations.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3908) Bugs in HBaseTimelineWriterImpl

2015-07-16 Thread Joep Rottinghuis (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630289#comment-14630289
 ] 

Joep Rottinghuis commented on YARN-3908:


Patch looks good with one comment. I completely overlooked the event info map, 
because it isn't part of the javadoc on the EntityTable. I should have 
double-checked but didn't. Thanks for catching this.

[~sjlee0] I think it would be good to update the javadoc that describes the 
EntityTable in the EntityTable.java file.
The same is probably missing from the doc Timeline service schema for native 
HBase tables (not sure which jira the PDF for that doc is attached to), 
because I copied it from the code. I don't think that the application table has 
been copied yet, so it won't be missing from there. 

 Bugs in HBaseTimelineWriterImpl
 ---

 Key: YARN-3908
 URL: https://issues.apache.org/jira/browse/YARN-3908
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Vrushali C
 Attachments: YARN-3908-YARN-2928.001.patch, 
 YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch


 1. In HBaseTimelineWriterImpl, the info column family contains the basic 
 fields of a timeline entity plus events. However, entity#info map is not 
 stored at all.
 2 event#timestamp is also not persisted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2015-07-16 Thread Xianyin Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630659#comment-14630659
 ] 

Xianyin Xin commented on YARN-3931:
---

This reminds me an earlier trouble i have met. Hi [~Naganarasimha], can we 
consider to remove the  node label expression in the code? It seems not make 
sense we set a node label as . For node label expression, it should be 
some_label or null. 

Just an unrigorous thoughts, what do you think?

 default-node-label-expression doesn’t apply when an application is submitted 
 by RM rest api
 ---

 Key: YARN-3931
 URL: https://issues.apache.org/jira/browse/YARN-3931
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: hadoop-2.6.0
Reporter: kyungwan nam
Assignee: kyungwan nam
 Attachments: YARN-3931.001.patch


 * 
 yarn.scheduler.capacity.queue-path.default-node-label-expression=large_disk
 * submit an application using rest api without app-node-label-expression”, 
 am-container-node-label-expression”
 * RM doesn’t allocate containers to the hosts associated with large_disk node 
 label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level

2015-07-16 Thread Ajith S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630669#comment-14630669
 ] 

Ajith S commented on YARN-3885:
---

Thanks [~leftnoteasy] , [~xinxianyin] and [~sunilg] :)

 ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 
 level
 --

 Key: YARN-3885
 URL: https://issues.apache.org/jira/browse/YARN-3885
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.8.0
Reporter: Ajith S
Assignee: Ajith S
Priority: Blocker
 Fix For: 2.8.0

 Attachments: YARN-3885.02.patch, YARN-3885.03.patch, 
 YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, 
 YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch


 when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}}
 this piece of code, to calculate {{untoucable}} doesnt consider al the 
 children, it considers only immediate childern



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2306) leak of reservation metrics (fair scheduler)


[ 
https://issues.apache.org/jira/browse/YARN-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630694#comment-14630694
 ] 

Hong Zhiguo commented on YARN-2306:
---

hi, [~rchiang], do you mean running the unit test in patch againt trunk?

 leak of reservation metrics (fair scheduler)
 

 Key: YARN-2306
 URL: https://issues.apache.org/jira/browse/YARN-2306
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Attachments: YARN-2306-2.patch, YARN-2306.patch


 This only applies to fair scheduler. Capacity scheduler is OK.
 When appAttempt or node is removed, the metrics for 
 reservation(reservedContainers, reservedMB, reservedVCores) is not reduced 
 back.
 These are important metrics for administrator. The wrong metrics confuses may 
 confuse them. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-1974) add args for DistributedShell to specify a set of nodes on which the tasks run


 [ 
https://issues.apache.org/jira/browse/YARN-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Zhiguo resolved YARN-1974.
---
Resolution: Not A Problem

 add args for DistributedShell to specify a set of nodes on which the tasks run
 --

 Key: YARN-1974
 URL: https://issues.apache.org/jira/browse/YARN-1974
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: applications/distributed-shell
Affects Versions: 2.7.0
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Attachments: YARN-1974.patch


 It's very useful to execute a script on a specific set of machines for both 
 testing and maintenance purpose.
 The args --nodes and --relax_locality are added to DistributedShell. 
 Together with an unit test using miniCluster.
 It's also tested on our real cluster with Fair scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2306) leak of reservation metrics (fair scheduler)


[ 
https://issues.apache.org/jira/browse/YARN-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630742#comment-14630742
 ] 

Hong Zhiguo commented on YARN-2306:
---

Updated the patch. I ran testReservationMetrics several times and no failure 
now.

 leak of reservation metrics (fair scheduler)
 

 Key: YARN-2306
 URL: https://issues.apache.org/jira/browse/YARN-2306
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Attachments: YARN-2306-2.patch, YARN-2306-3.patch, YARN-2306.patch


 This only applies to fair scheduler. Capacity scheduler is OK.
 When appAttempt or node is removed, the metrics for 
 reservation(reservedContainers, reservedMB, reservedVCores) is not reduced 
 back.
 These are important metrics for administrator. The wrong metrics confuses may 
 confuse them. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3049) [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend

2015-07-16 Thread Zhijie Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-3049:
--
Attachment: YARN-3049-WIP.2.patch

[~sjlee0] and [~gtCarrera9], thanks for review the patch. I'm currently 
targeting an E2E reader POC, and I'll try to address your comments a bit later. 
I upload a new WIP patch, which basically makes the reader work E2E, while 
their are couple of bugs. I'll spend some more time to fix them.

 [Storage Implementation] Implement storage reader interface to fetch raw data 
 from HBase backend
 

 Key: YARN-3049
 URL: https://issues.apache.org/jira/browse/YARN-3049
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Zhijie Shen
 Attachments: YARN-3049-WIP.1.patch, YARN-3049-WIP.2.patch


 Implement existing ATS queries with the new ATS reader design.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2768) optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% of computing time of update thread


[ 
https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630688#comment-14630688
 ] 

Hong Zhiguo commented on YARN-2768:
---

[~kasha], could you please review the patch?

 optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% 
 of computing time of update thread
 

 Key: YARN-2768
 URL: https://issues.apache.org/jira/browse/YARN-2768
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Attachments: YARN-2768.patch, profiling_FairScheduler_update.png


 See the attached picture of profiling result. The clone of Resource object 
 within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the 
 function FairScheduler.update().
 The code of FSAppAttempt.updateDemand:
 {code}
 public void updateDemand() {
 demand = Resources.createResource(0);
 // Demand is current consumption plus outstanding requests
 Resources.addTo(demand, app.getCurrentConsumption());
 // Add up outstanding resource requests
 synchronized (app) {
   for (Priority p : app.getPriorities()) {
 for (ResourceRequest r : app.getResourceRequests(p).values()) {
   Resource total = Resources.multiply(r.getCapability(), 
 r.getNumContainers());
   Resources.addTo(demand, total);
 }
   }
 }
   }
 {code}
 The code of Resources.multiply:
 {code}
 public static Resource multiply(Resource lhs, double by) {
 return multiplyTo(clone(lhs), by);
 }
 {code}
 The clone could be skipped by directly update the value of this.demand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3845) [YARN] YARN status in web ui does not show correctly in IE 11

2015-07-16 Thread Mohammad Shahid Khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan updated YARN-3845:
---
Attachment: YARN-3845.patch

 [YARN] YARN status in web ui does not show correctly in IE 11
 -

 Key: YARN-3845
 URL: https://issues.apache.org/jira/browse/YARN-3845
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jagadesh Kiran N
Assignee: Mohammad Shahid Khan
Priority: Trivial
 Attachments: IE11_yarn.gif, YARN-3845.patch


 In IE 11 , the color display is not proper for the scheduler . In other 
 browser it is showing correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3784) Indicate preemption timout along with the list of containers to AM (preemption message)

[
https://issues.apache.org/jira/browse/YARN-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630794#comment-14630794
]

Sunil G commented on YARN-3784:
---

Yes [~leftnoteasy] Thank you for sharing your thoughts.
I f I understood you correctly, there are chances that to-be-preempted
container will reside in FicaSchedulerApp till allocate call comes from AM.
Within this duration, there are chances that some more containers got free or
cancelled its resource requests. Due to this, we may remove this container from
this to-be-preempted list. I feel we can have a remove-from-to-preempt in
scheduler, and propportionalCPP can notify the app when such scenario occurs.
This can be added as a new argument to AM response also. I will separate this
improvement in to another ticket.

From your second point, I feel we can keep a getter api (synchronized) for
to-be-preempted containers which is present in FicaSchedulerApp (scheduler
level). With this api, proportionalCPP can have look whether the container
which is newly identified to preempt is already reported as to-be-preempted
container in app level. If so, proportionalCPP need not have to raise another
event to scheduler. I ll separate this if its ok.

Indicate preemption timout along with the list of containers to AM
(preemption message)
---

Key: YARN-3784
URL: https://issues.apache.org/jira/browse/YARN-3784
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Sunil G
Assignee: Sunil G
Attachments: 0001-YARN-3784.patch, 0002-YARN-3784.patch

Currently during preemption, AM is notified with a list of containers which
are marked for preemption. Introducing a timeout duration also along with
this container list so that AM can know how much time it will get to do a
graceful shutdown to its containers (assuming one of preemption policy is
loaded in AM).
This will help in decommissioning NM scenarios, where NM will be
decommissioned after a timeout (also killing containers on it). This timeout
will be helpful to indicate AM that those containers can be killed by RM
forcefully after the timeout.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3736) Persist the Plan information, ie. accepted reservations to the RMStateStore for failover