date:20160323

[jira] [Updated] (YARN-4862) Handle duplicated completed containers in RMNodeImpl

2016-03-23 Thread Rohith Sharma K S (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-4862:

Description: 
As per 
[comment|https://issues.apache.org/jira/browse/YARN-4852?focusedCommentId=15209689&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209689]
 from [~sharadag], there should be safe guard for duplicated container status 
in RMNodeImpl before creating UpdatedContainerInfo. 
Or else in heavily loaded cluster where event processing is gradually slow, if 
any duplicated container are sent to RM(may be bug in NM also), there is 
significant impact that RMNodImpl always create UpdatedContainerInfo for 
duplicated containers. This result in increase in the heap memory and causes 
problem like YARN-4852.
This is an optimization for issue kind YARN-4852

  was:
As per 
[comment|https://issues.apache.org/jira/browse/YARN-4852?focusedCommentId=15209689&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209689]
 from [~sharadag], there should be safe guard for duplicated container status 
in RMNodeImpl before creating UpdatedContainerInfo. Or else in heavily loaded 
cluster, if any duplicated container are sent to RM(may be bug in NM also), RM 
should not create UpdatedContainerInfo for duplicated containers. 
This is an optimization for issue kind YARN-4852


> Handle duplicated completed containers in RMNodeImpl
> 
>
> Key: YARN-4862
> URL: https://issues.apache.org/jira/browse/YARN-4862
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>
> As per 
> [comment|https://issues.apache.org/jira/browse/YARN-4852?focusedCommentId=15209689&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209689]
>  from [~sharadag], there should be safe guard for duplicated container status 
> in RMNodeImpl before creating UpdatedContainerInfo. 
> Or else in heavily loaded cluster where event processing is gradually slow, 
> if any duplicated container are sent to RM(may be bug in NM also), there is 
> significant impact that RMNodImpl always create UpdatedContainerInfo for 
> duplicated containers. This result in increase in the heap memory and causes 
> problem like YARN-4852.
> This is an optimization for issue kind YARN-4852



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4676) Automatic and Asynchronous Decommissioning Nodes Status Tracking

2016-03-23 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209809#comment-15209809
 ] 

Hadoop QA commented on YARN-4676:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 6 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 18s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
41s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 5s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 10s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
12s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 14s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
40s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 7m 
44s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 7s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 6m 14s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
32s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 17s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 9m 17s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 7s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 8m 7s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
15s {color} | {color:green} root: patch generated 0 new + 498 unchanged - 4 
fixed = 498 total (was 502) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 9s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 2m 
1s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 9m 
21s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 0s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 6m 10s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 10m 2s {color} 
| {color:red} hadoop-common in the patch failed with JDK v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 32s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_74. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 18s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 3s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_74. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 78m 27s {color} 
| {color:red} hadoop

[jira] [Commented] (YARN-4820) ResourceManager web redirects in HA mode drops query parameters

2016-03-23 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209783#comment-15209783
 ] 

Hudson commented on YARN-4820:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9494 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9494/])
YARN-4820. ResourceManager web redirects in HA mode drops query (junping_du: 
rev 19b645c93801a53d4486f9a7639186525e51f723)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebAppFilter.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java


> ResourceManager web redirects in HA mode drops query parameters
> ---
>
> Key: YARN-4820
> URL: https://issues.apache.org/jira/browse/YARN-4820
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Fix For: 2.8.0
>
> Attachments: YARN-4820.001.patch, YARN-4820.002.patch, 
> YARN-4820.003.patch
>
>
> The RMWebAppFilter redirects http requests from the standby to the active. 
> However it drops all the query parameters when it does the redirect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4822) Refactor existing Preemption Policy of CS for easier adding new approach to select preemption candidates

2016-03-23 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209733#comment-15209733
 ] 

Hadoop QA commented on YARN-4822:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
55s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
18s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 41s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
17s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
19s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
32s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 35s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 28s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 18s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 patch generated 65 new + 47 unchanged - 52 fixed = 112 total (was 99) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 44s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
14s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
41s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 75m 20s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_74. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 75m 46s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
20s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 170m 39s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_74 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
 |
| JDK v1.7.0_95 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:

[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory

2016-03-23 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209726#comment-15209726
 ] 

Rohith Sharma K S commented on YARN-4852:
-

Raised a JIRA YARN-4862 for handling duplicated container status check.

> Resource Manager Ran Out of Memory
> --
>
> Key: YARN-4852
> URL: https://issues.apache.org/jira/browse/YARN-4852
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gokul
> Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digging  deeper, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Back to Back Full GC kept on happening. GC wasn't able to recover any heap 
> and went OOM. JVM dumped the heap before quitting. We analyzed the heap. 
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4862) Handle duplicated completed containers in RMNodeImpl

2016-03-23 Thread Rohith Sharma K S (JIRA)

Rohith Sharma K S created YARN-4862:
---

 Summary: Handle duplicated completed containers in RMNodeImpl
 Key: YARN-4862
 URL: https://issues.apache.org/jira/browse/YARN-4862
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Rohith Sharma K S
Assignee: Rohith Sharma K S


As per 
[comment|https://issues.apache.org/jira/browse/YARN-4852?focusedCommentId=15209689&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209689]
 from [~sharadag], there should be safe guard for duplicated container status 
in RMNodeImpl before creating UpdatedContainerInfo. Or else in heavily loaded 
cluster, if any duplicated container are sent to RM(may be bug in NM also), RM 
should not create UpdatedContainerInfo for duplicated containers. 
This is an optimization for issue kind YARN-4852



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory

2016-03-23 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209715#comment-15209715
 ] 

Rohith Sharma K S commented on YARN-4852:
-

I will raise a new ticket for this. Thanks:-)

> Resource Manager Ran Out of Memory
> --
>
> Key: YARN-4852
> URL: https://issues.apache.org/jira/browse/YARN-4852
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gokul
> Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digging  deeper, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Back to Back Full GC kept on happening. GC wasn't able to recover any heap 
> and went OOM. JVM dumped the heap before quitting. We analyzed the heap. 
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory

2016-03-23 Thread Sharad Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209689#comment-15209689
 ] 

Sharad Agarwal commented on YARN-4852:
--

Thanks Rohith. Should we consider adding duplicate check in the RM side as well 
for completed containers as we are doing for launched ones. This will make it 
more full proof and eliminate scenarious like resync etc where NM might still 
send duplicates.
 we can open a new ticket for the same.

> Resource Manager Ran Out of Memory
> --
>
> Key: YARN-4852
> URL: https://issues.apache.org/jira/browse/YARN-4852
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gokul
> Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digging  deeper, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Back to Back Full GC kept on happening. GC wasn't able to recover any heap 
> and went OOM. JVM dumped the heap before quitting. We analyzed the heap. 
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory

2016-03-23 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209609#comment-15209609
 ] 

Rohith Sharma K S commented on YARN-4852:
-

Adding to above point, since NM->RM is push design, already sent containers are 
not supposed to send again unless there is RESYNC command from RM. So it should 
be a bug from NodeManager

> Resource Manager Ran Out of Memory
> --
>
> Key: YARN-4852
> URL: https://issues.apache.org/jira/browse/YARN-4852
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gokul
> Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digging  deeper, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Back to Back Full GC kept on happening. GC wasn't able to recover any heap 
> and went OOM. JVM dumped the heap before quitting. We analyzed the heap. 
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4436) DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled

2016-03-23 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209606#comment-15209606
 ] 

Hadoop QA commented on YARN-4436:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
47s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 12s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 15s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
13s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 18s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
14s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
27s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 11s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 13s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 10s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 10s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 12s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 12s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 11s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell:
 patch generated 1 new + 49 unchanged - 2 fixed = 50 total (was 51) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 17s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
10s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
36s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 9s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 10s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 9s 
{color} | {color:green} hadoop-yarn-applications-distributedshell in the patch 
passed with JDK v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 25s 
{color} | {color:green} hadoop-yarn-applications-distributedshell in the patch 
passed with JDK v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
19s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 26m 59s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:fbe3e86 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12795084/YARN-4436.002.patch |
| JIRA Issue | YARN-4436 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 90e15ca288ff 3

[jira] [Updated] (YARN-2883) Queuing of container requests in the NM

2016-03-23 Thread Konstantinos Karanasos (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantinos Karanasos updated YARN-2883:
-
Attachment: YARN-2883-trunk.005.patch

Adding updated patch, after addressing [~chris.douglas]'s comments.
Also addressed [~kasha]'s first comments.

I added a new JIRA (YARN-4861), so that we address the comment related to the 
ExitStatus of a killed OPPORTUNISTIC container.
Moreover, I did not address the comment about bounding the queue size, as this 
should be done in a new JIRA too.

> Queuing of container requests in the NM
> ---
>
> Key: YARN-2883
> URL: https://issues.apache.org/jira/browse/YARN-2883
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Konstantinos Karanasos
> Attachments: YARN-2883-trunk.004.patch, YARN-2883-trunk.005.patch, 
> YARN-2883-yarn-2877.001.patch, YARN-2883-yarn-2877.002.patch, 
> YARN-2883-yarn-2877.003.patch, YARN-2883-yarn-2877.004.patch
>
>
> We propose to add a queue in each NM, where queueable container requests can 
> be held.
> Based on the available resources in the node and the containers in the queue, 
> the NM will decide when to allow the execution of a queued container.
> In order to ensure the instantaneous start of a guaranteed-start container, 
> the NM may decide to pre-empt/kill running queueable containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4861) Define ContainerExitStatus for OPPORTUNISTIC containers that get killed

2016-03-23 Thread Konstantinos Karanasos (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209596#comment-15209596
 ] 

Konstantinos Karanasos commented on YARN-4861:
--

An OPPORTUNISTIC container might be killed in one of the following cases:
* by the AM while running;
* by the AM while queued;
* by the NM, while running, in order to free up resources for a GUARANTEED 
container to start its execution;
* by the NM, while queued, in order to reduce the length of the queue.

In all these cases, we need to define the proper Exit Status for the container.
Then, we need to make sure that the AM reacts properly to the defined Exit 
Statuses (e.g., by rescheduling killed OPPORTUNISTIC containers).

Currently, in YARN-2883, OPPORTUNISTIC containers that got killed by the NM 
while running get a KILLED_BY_APPMASTER ExitStatus.
In YARN-4738, OPPORTUNISTIC containers that got killed while queued are get an 
ABORTED ExitStatus.

> Define ContainerExitStatus for OPPORTUNISTIC containers that get killed
> ---
>
> Key: YARN-4861
> URL: https://issues.apache.org/jira/browse/YARN-4861
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Konstantinos Karanasos
>
> When we kill an OPPORTUNISTIC container, which is either running or queued, 
> we need to define its Exit Status.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory

2016-03-23 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209583#comment-15209583
 ] 

Rohith Sharma K S commented on YARN-4852:
-

Thanks for bringing out duplicated container status stored in 
UpdatedContainerInfo. This makes to think of ticket YARN-2997 which is already 
solved.

Scenario is NM keeps the containers in NMContext as long as RM sends 
notification to NM in response to remove from NM. Every heart beat 
these(pendingCompletedContainers) container status is sent to RM which could be 
duplicated!!  But from RM , while creating UpdatedContainerInfo validation is 
not done for duplicated entries. This is keep accumulating when there is slow 
in scheduler event processing.

> Resource Manager Ran Out of Memory
> --
>
> Key: YARN-4852
> URL: https://issues.apache.org/jira/browse/YARN-4852
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gokul
> Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digging  deeper, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Back to Back Full GC kept on happening. GC wasn't able to recover any heap 
> and went OOM. JVM dumped the heap before quitting. We analyzed the heap. 
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4861) Define ContainerExitStatus for OPPORTUNISTIC containers that get killed

2016-03-23 Thread Konstantinos Karanasos (JIRA)

Konstantinos Karanasos created YARN-4861:


 Summary: Define ContainerExitStatus for OPPORTUNISTIC containers 
that get killed
 Key: YARN-4861
 URL: https://issues.apache.org/jira/browse/YARN-4861
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Konstantinos Karanasos


When we kill an OPPORTUNISTIC container, which is either running or queued, we 
need to define its Exit Status.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4826) Document configuration of ReservationSystem for CapacityScheduler

2016-03-23 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209571#comment-15209571
 ] 

Hadoop QA commented on YARN-4826:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
29s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 15s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 11s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
17s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 7m 38s {color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:fbe3e86 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12795078/YARN-4826.v1.patch |
| JIRA Issue | YARN-4826 |
| Optional Tests |  asflicense  mvnsite  |
| uname | Linux a5c487fd9567 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 938222b |
| modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/10861/console |
| Powered by | Apache Yetus 0.2.0   http://yetus.apache.org |


This message was automatically generated.



> Document configuration of ReservationSystem for CapacityScheduler
> -
>
> Key: YARN-4826
> URL: https://issues.apache.org/jira/browse/YARN-4826
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Subru Krishnan
>Assignee: Subru Krishnan
>Priority: Minor
> Attachments: YARN-4826.v1.patch
>
>
> This JIRA tracks the effort to add documentation on how to configure 
> ReservationSystem for CapacityScheduler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4860) Created Node label disappear after restart Resoure Manager

2016-03-23 Thread Yi Zhou (JIRA)

Yi Zhou created YARN-4860:
-

 Summary: Created Node label disappear after restart Resoure Manager
 Key: YARN-4860
 URL: https://issues.apache.org/jira/browse/YARN-4860
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Yi Zhou


In 2.6, if restart RM, it cause created node label to disappear and rm failed 
to start up

Error starting ResourceManager
org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
NodeLabelManager doesn't include label = y, please check.
at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1000)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:262)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1221)
Caused by: java.io.IOException: NodeLabelManager doesn't include label = y, 
please check.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.checkIfLabelInClusterNodeLabels(SchedulerUtils.java:287)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.(AbstractCSQueue.java:106)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:120)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:569)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:589)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:464)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:296)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:326)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 7 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory

2016-03-23 Thread Sharad Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209535#comment-15209535
 ] 

Sharad Agarwal commented on YARN-4852:
--

Further analysis shows that we are seeing exceptionally high log lines of "Null 
container completed...", somewhere in between 100k to 200k every minute. This 
could be related to lot of duplicate UpdatedContainerInfo objects for completed 
containers.

> Resource Manager Ran Out of Memory
> --
>
> Key: YARN-4852
> URL: https://issues.apache.org/jira/browse/YARN-4852
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gokul
> Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digging  deeper, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Back to Back Full GC kept on happening. GC wasn't able to recover any heap 
> and went OOM. JVM dumped the heap before quitting. We analyzed the heap. 
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory

2016-03-23 Thread Sharad Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209520#comment-15209520
 ] 

Sharad Agarwal commented on YARN-4852:
--

[~rohithsharma] the slowness in schedulers still does not explain the built up 
of UpdatedContainerInfo to be 0.5 million objects in a short span. 
UpdatedContainerInfo should only be created in case of newly launched/completed 
containers. 
Looking at the code at RMNodeImpl.StatusUpdateWhenHealthyTransition  (branch 
2.6.0)
{code}
 // Process running containers
if (remoteContainer.getState() == ContainerState.RUNNING) {
  if (!rmNode.launchedContainers.contains(containerId)) {
// Just launched container. RM knows about it the first time.
rmNode.launchedContainers.add(containerId);
newlyLaunchedContainers.add(remoteContainer);
  }
} else {
  // A finished container
  rmNode.launchedContainers.remove(containerId);
  completedContainers.add(remoteContainer);
}
  }
  if(newlyLaunchedContainers.size() != 0 
  || completedContainers.size() != 0) {
rmNode.nodeUpdateQueue.add(new UpdatedContainerInfo
(newlyLaunchedContainers, completedContainers));
  }
{code}

Above UpdatedContainerInfo is seemed to be getting created each time there is a 
completed containers in the container status (it is not checking if from 
previous update this has already been created). Wouldn't this lead to lot of 
duplicates UpdatedContainerInfo objects and further putting stress on the 
scheduler unnecessarily.


> Resource Manager Ran Out of Memory
> --
>
> Key: YARN-4852
> URL: https://issues.apache.org/jira/browse/YARN-4852
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gokul
> Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digging  deeper, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Back to Back Full GC kept on happening. GC wasn't able to recover any heap 
> and went OOM. JVM dumped the heap before quitting. We analyzed the heap. 
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4822) Refactor existing Preemption Policy of CS for easier adding new approach to select preemption candidates

2016-03-23 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209469#comment-15209469
 ] 

Wangda Tan commented on YARN-4822:
--

[~eepayne], [~sunilg], [~jianhe],

Appreciate if you could take a look at latest patch, it contains a couple of 
refactorings:
- PCPP becomes 2 parts:
1) Basic code such as clone queues, record what to preempt and send kill event 
when max-wait reaches.
2) Candidates-selection policy, includes calculate ideal allocation and select 
preemption candidates
- Original calculate ideal allocation and select preemption candidates goes to 
two classes
1) FifoPreemptableAmountCalculator is for ideal allocation calculation
2) FifoCandidatesSelectionPolicy is for how to select containers
- CandidatesSelectionPolicy and calculator needs to read some fields from PCPP, 
so I add an interface for them to use, which is implemented by PCPP: 
CapacitySchedulerPreemptionContext
- Moved all configurations keys from PCPP to CapacitySchedulerConfiguration, so 
admin can set configurations in either yarn-site.xml or capacity-scheduler.xml. 
(Ideally should be set in capacity-scheduler.xml, however, existing user sets 
configs in yarn-site.xml. Since CapacitySchedulerConfiguration reads 
yarn-site.xml as well, it is backward compatible change.)

Thanks,

> Refactor existing Preemption Policy of CS for easier adding new approach to 
> select preemption candidates
> 
>
> Key: YARN-4822
> URL: https://issues.apache.org/jira/browse/YARN-4822
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4822.1.patch, YARN-4822.2.patch, YARN-4822.3.patch, 
> YARN-4822.4.patch
>
>
> Currently, ProportionalCapacityPreemptionPolicy has hard coded logic to 
> select candidates to be preempted (based on FIFO order of 
> applications/containers). It's not a simple to add new candidate-selection 
> logics, such as preemption for large container, intra-queeu fairness/policy, 
> etc.
> In this JIRA, I propose to do following changes:
> 1) Cleanup code bases, consolidate current logic into 3 stages:
> - Compute ideal sharing of queues
> - Select to-be-preempt candidates
> - Send preemption/kill events to scheduler
> 2) Add a new interface: {{PreemptionCandidatesSelectionPolicy}} for above 
> "select to-be-preempt candidates" part. Move existing how to select 
> candidates logics to {{FifoPreemptionCandidatesSelectionPolicy}}. 
> 3) Allow multiple PreemptionCandidatesSelectionPolicies work together in a 
> chain. Preceding PreemptionCandidatesSelectionPolicy has higher priority to 
> select candidates, and later PreemptionCandidatesSelectionPolicy can make 
> decisions according to already selected candidates and pre-computed queue 
> ideal shares of resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4285) Display resource usage as percentage of queue and cluster in the RM UI

2016-03-23 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209449#comment-15209449
 ] 

Wangda Tan commented on YARN-4285:
--

[~jianhe],
bq. However, the queue's used resource in the UI does include reserved resource 
too.
IIUC, it should queue's used resource should include reserved resources 

[~vvasudev],
bq. it makes sense to remove reserved resources from the used resources,
Actually I think we should include reserved resources by used resources, unless 
we can show them together on UI. See my [#2 
comment|https://issues.apache.org/jira/browse/YARN-4678?focusedCommentId=15209365&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209365].

bq. ...but do we know why we counted reserved resources as part of used 
resources in the first place?
The reason is, if we create a reserved container under a queue, we need to make 
sure it doesn't go beyond queue's max capacity. In another word, if resource is 
reserved by someone, nobody else can use that part of resources.
>From YARN's perspective, a queue has 99G allocated (not reserved) + 1G 
>reserved is as same as 1G allocated + 99G reserved. To be more transparent to 
>users and avoid answer questions like: "why my total allocated resource is 
>always less than total resources, used resource should be allocated + reserved.

> Display resource usage as percentage of queue and cluster in the RM UI
> --
>
> Key: YARN-4285
> URL: https://issues.apache.org/jira/browse/YARN-4285
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Fix For: 2.8.0
>
> Attachments: YARN-4285.001.patch, YARN-4285.002.patch, 
> YARN-4285.003.patch, YARN-4285.004.patch
>
>
> Currently, we display the memory and vcores allocated to an app in the RM UI. 
> It would be useful to display the resources consumed as a %of the queue and 
> the cluster to identify apps that are using a lot of resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4859) [Bug] Unable to submit a job to a reservation when using FairScheduler

2016-03-23 Thread Subru Krishnan (JIRA)

Subru Krishnan created YARN-4859:


 Summary: [Bug] Unable to submit a job to a reservation when using 
FairScheduler
 Key: YARN-4859
 URL: https://issues.apache.org/jira/browse/YARN-4859
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: fairscheduler
Reporter: Subru Krishnan
Assignee: Arun Suresh


Jobs submitted to a reservation get stuck at scheduled stage when using 
FairScheduler. I came across this when working on YARN-4827 (documentation for 
configuring ReservationSystem for FairScheduler)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4751) In 2.7, Labeled queue usage not shown properly in capacity scheduler UI

2016-03-23 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209373#comment-15209373
 ] 

Wangda Tan commented on YARN-4751:
--

Hi [~eepayne], [~sunilg].

Quickly read discussions and looked at patch.

Several questions / comments:
1) The ultimate solution seems to be YARN-3362. Have you evaluated how hard to 
back port it?
2) If you don't want to backport YARN-3362. IIUC, the computation of 
total-used-capacity-considers-all-labels seems wrong:
In your patch it is Σ(queue.label.used_capacity), actually it should be 
Σ(queue.label.used_resource) / Σ(root.label.total_resource)

Thoughts?


> In 2.7, Labeled queue usage not shown properly in capacity scheduler UI
> ---
>
> Key: YARN-4751
> URL: https://issues.apache.org/jira/browse/YARN-4751
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.3
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: 2.7 CS UI No BarGraph.jpg, 
> YARH-4752-branch-2.7.001.patch, YARH-4752-branch-2.7.002.patch
>
>
> In 2.6 and 2.7, the capacity scheduler UI does not have the queue graphs 
> separated by partition. When applications are running on a labeled queue, no 
> color is shown in the bar graph, and several of the "Used" metrics are zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4678) Cluster used capacity is > 100 when container reserved

2016-03-23 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209365#comment-15209365
 ] 

Wangda Tan commented on YARN-4678:
--

Actually I think we may need to consider this task to be 3 separate tasks:
1) Understand why reserved resource + allocated resource could excess queue's 
max capacity, maybe we can add a test to make sure it won't happen
2) If we simply deduct reserved resources from used and show on the UI, user 
could find cluster utilization is < 100 in most of the time, and it gonna be 
hard to explain the reason of why it cannot reach 100%. The ideal solution is 
that we can show reserved and allocated resources on the same bar with 
different color.
3) Record reserved resources in ResourceUsage and QueueCapacities separately.

Thoughts?

> Cluster used capacity is > 100 when container reserved 
> ---
>
> Key: YARN-4678
> URL: https://issues.apache.org/jira/browse/YARN-4678
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Brahma Reddy Battula
>Assignee: Sunil G
> Attachments: 0001-YARN-4678.patch, 0002-YARN-4678.patch, 
> 0003-YARN-4678.patch
>
>
>  *Scenario:* 
> * Start cluster with Three NM's each having 8GB (cluster memory:24GB).
> * Configure queues with elasticity and userlimitfactor=10.
> * disable pre-emption.
> * run two job with different priority in different queue at the same time
> ** yarn jar hadoop-mapreduce-examples-2.7.2.jar pi -Dyarn.app.priority=LOW 
> -Dmapreduce.job.queuename=QueueA -Dmapreduce.map.memory.mb=4096 
> -Dyarn.app.mapreduce.am.resource.mb=1536 
> -Dmapreduce.job.reduce.slowstart.completedmaps=1.0 10 1
> ** yarn jar hadoop-mapreduce-examples-2.7.2.jar pi -Dyarn.app.priority=HIGH 
> -Dmapreduce.job.queuename=QueueB -Dmapreduce.map.memory.mb=4096 
> -Dyarn.app.mapreduce.am.resource.mb=1536 3 1
> * observe the cluster capacity which was used in RM web UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4436) DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled

2016-03-23 Thread Matt LaMantia (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt LaMantia updated YARN-4436:

Attachment: YARN-4436.002.patch

> DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled
> -
>
> Key: YARN-4436
> URL: https://issues.apache.org/jira/browse/YARN-4436
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: applications/distributed-shell
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Matt LaMantia
>Priority: Trivial
> Attachments: YARN-4436.001.patch, YARN-4436.002.patch
>
>
> It should be ExecBatScriptStringPath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4678) Cluster used capacity is > 100 when container reserved

2016-03-23 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209298#comment-15209298
 ] 

Wangda Tan commented on YARN-4678:
--

Hi [~sunilg],

Thanks for working on this JIRA, it is useful to record reserved resources 
separately.

However, I'm thinking how this could happen: ParentQueue's capacity will be 
checked when we reserve container, we should make sure that allocation of 
reserved container shouldn't violate parent queue's max capacity.

> Cluster used capacity is > 100 when container reserved 
> ---
>
> Key: YARN-4678
> URL: https://issues.apache.org/jira/browse/YARN-4678
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Brahma Reddy Battula
>Assignee: Sunil G
> Attachments: 0001-YARN-4678.patch, 0002-YARN-4678.patch, 
> 0003-YARN-4678.patch
>
>
>  *Scenario:* 
> * Start cluster with Three NM's each having 8GB (cluster memory:24GB).
> * Configure queues with elasticity and userlimitfactor=10.
> * disable pre-emption.
> * run two job with different priority in different queue at the same time
> ** yarn jar hadoop-mapreduce-examples-2.7.2.jar pi -Dyarn.app.priority=LOW 
> -Dmapreduce.job.queuename=QueueA -Dmapreduce.map.memory.mb=4096 
> -Dyarn.app.mapreduce.am.resource.mb=1536 
> -Dmapreduce.job.reduce.slowstart.completedmaps=1.0 10 1
> ** yarn jar hadoop-mapreduce-examples-2.7.2.jar pi -Dyarn.app.priority=HIGH 
> -Dmapreduce.job.queuename=QueueB -Dmapreduce.map.memory.mb=4096 
> -Dyarn.app.mapreduce.am.resource.mb=1536 3 1
> * observe the cluster capacity which was used in RM web UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4822) Refactor existing Preemption Policy of CS for easier adding new approach to select preemption candidates

2016-03-23 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4822:
-
Attachment: YARN-4822.4.patch

Attached ver.4 patch, fixed unit test failures, javac warnings.

> Refactor existing Preemption Policy of CS for easier adding new approach to 
> select preemption candidates
> 
>
> Key: YARN-4822
> URL: https://issues.apache.org/jira/browse/YARN-4822
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4822.1.patch, YARN-4822.2.patch, YARN-4822.3.patch, 
> YARN-4822.4.patch
>
>
> Currently, ProportionalCapacityPreemptionPolicy has hard coded logic to 
> select candidates to be preempted (based on FIFO order of 
> applications/containers). It's not a simple to add new candidate-selection 
> logics, such as preemption for large container, intra-queeu fairness/policy, 
> etc.
> In this JIRA, I propose to do following changes:
> 1) Cleanup code bases, consolidate current logic into 3 stages:
> - Compute ideal sharing of queues
> - Select to-be-preempt candidates
> - Send preemption/kill events to scheduler
> 2) Add a new interface: {{PreemptionCandidatesSelectionPolicy}} for above 
> "select to-be-preempt candidates" part. Move existing how to select 
> candidates logics to {{FifoPreemptionCandidatesSelectionPolicy}}. 
> 3) Allow multiple PreemptionCandidatesSelectionPolicies work together in a 
> chain. Preceding PreemptionCandidatesSelectionPolicy has higher priority to 
> select candidates, and later PreemptionCandidatesSelectionPolicy can make 
> decisions according to already selected candidates and pre-computed queue 
> ideal shares of resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4826) Document configuration of ReservationSystem for CapacityScheduler

2016-03-23 Thread Subru Krishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-4826:
-
Attachment: YARN-4826.v1.patch

> Document configuration of ReservationSystem for CapacityScheduler
> -
>
> Key: YARN-4826
> URL: https://issues.apache.org/jira/browse/YARN-4826
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Subru Krishnan
>Assignee: Subru Krishnan
>Priority: Minor
> Attachments: YARN-4826.v1.patch
>
>
> This JIRA tracks the effort to add documentation on how to configure 
> ReservationSystem for CapacityScheduler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-03-23 Thread Jonathan Maron (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209288#comment-15209288
 ] 

Jonathan Maron commented on YARN-4757:
--

{quote}
Are there situations where you would just return the IP Address of the node the 
container is running on?
{quote}

One situration I can readily think of  is YARN linux containers etc that are 
not assigned an IP.  The appropriate way to manage those should be considered 
(I can add to Open Issues on next revision)

{quote}
Does that mean that we will return records for any service API no matter how 
the IP Addresses are assigned, or there is no way for the IP Address to not be 
available?
{quote}

Application records are generally associated with the AM and the host on which 
it resided (at least that's true of the Slider use cases that are the only ones 
currently making use of the YARN registry and service records).  So most to the 
CNAME/TXT records mapping to an API will leverage that host IP.

{quote}
How is authentication with zookeeper handled? Is it always SASL+kerberos?
{quote}

Probably best to just point you to this writeup:  
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/registry/registry-security.html

{quote}
Would we be exposing SRV records for both of these combinations? If so how 
would they be named?
{quote}

Yes.  The current design calls for the creation and registration of SRV records 
that have both the application name as well as the API names

{quote}
 I am not an expert on DNS so if I say something silly after you stop laughing 
please let me know
{quote}
I have been working with dnsjava and BIND trying to learn the internals for the 
last few months, so I'm by no means an expert.  And I'm not going to laugh - if 
anything I'm going to thank you profusely for the help!

{quote}
What about limits on the number of IP addresses that can be returned for a 
given name. I could not find anything specific but I have to assume that in 
practice most systems don't support a huge number of these, and large clusters 
on YARN can easily launch hundreds or even thousands of containers for a given 
service.
{quote}

I'd have to look into the relevant RFC's and other literature to see if there 
is a length limit.  Generally documentation point to the host name RFC (1123?). 
 I think limits on length of name would also be dictated by other software 
products (DBs etc).  So we'd have to consider any "shortening" that may be 
required.
You can have multiple addresses mapped to a single name, e.g.
{code}
HW10386:hadoop jmaron$ nslookup www.google.com
Server: 192.168.1.1
Address:192.168.1.1#53

Non-authoritative answer:
Name:   www.google.com
Address: 63.117.14.150
Name:   www.google.com
Address: 63.117.14.151
Name:   www.google.com
Address: 63.117.14.155
Name:   www.google.com
Address: 63.117.14.154
Name:   www.google.com
Address: 63.117.14.153
Name:   www.google.com
Address: 63.117.14.148
Name:   www.google.com
Address: 63.117.14.149
Name:   www.google.com
Address: 63.117.14.152
{code}

So, some of the naming conventions (e.g. component name) may point to multiple 
container IPs.  Addressing that through component name uniqueness (there is a 
slider JIRA for that) may be one possibility.  

{quote}
In addition to Allen's concerns the document does not seem to address/call out 
my initial concerns about requiring mutual authentication, or handling of port 
availability in scheduling.
{quote}

I'm going to need a little more help in understanding these concerns.  The 
approach we provide is targeted at supporting standard DNS clients, and DNS 
does not provide for mutual authentication - the concept of restricting who can 
query the DNS for records is considered outside the scope of the protocol.  As 
for port availability - currently the DNS implementation is targeted at 
relaying the port assignments as designated by YARN scheduler, rather than 
actively participating in the scheduling itself.  So I assume I'm 
misunderstanding



> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice,

[jira] [Commented] (YARN-4436) DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled

2016-03-23 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209209#comment-15209209
 ] 

Hadoop QA commented on YARN-4436:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
7s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 13s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 15s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
14s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 20s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
28s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 12s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 13s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 11s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 11s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 12s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 12s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 11s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell:
 patch generated 1 new + 50 unchanged - 1 fixed = 51 total (was 51) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 16s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
11s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
38s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 10s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 12s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 16s 
{color} | {color:green} hadoop-yarn-applications-distributedshell in the patch 
passed with JDK v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 29s 
{color} | {color:green} hadoop-yarn-applications-distributedshell in the patch 
passed with JDK v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
17s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 27m 42s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:fbe3e86 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12795048/YARN-4436.001.patch |
| JIRA Issue | YARN-4436 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 04b6529d8b2a

[jira] [Updated] (YARN-4676) Automatic and Asynchronous Decommissioning Nodes Status Tracking

2016-03-23 Thread Daniel Zhi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Zhi updated YARN-4676:
-
Attachment: YARN-4676.008.patch

rebased to latest trunk code, merged and resolved conflict with the 
recently-added DECOMMISSIONING node resource update logic.

> Automatic and Asynchronous Decommissioning Nodes Status Tracking
> 
>
> Key: YARN-4676
> URL: https://issues.apache.org/jira/browse/YARN-4676
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daniel Zhi
>Assignee: Daniel Zhi
>  Labels: features
> Attachments: GracefulDecommissionYarnNode.pdf, YARN-4676.004.patch, 
> YARN-4676.005.patch, YARN-4676.006.patch, YARN-4676.007.patch, 
> YARN-4676.008.patch
>
>
> DecommissioningNodeWatcher inside ResourceTrackingService tracks 
> DECOMMISSIONING nodes status automatically and asynchronously after 
> client/admin made the graceful decommission request. It tracks 
> DECOMMISSIONING nodes status to decide when, after all running containers on 
> the node have completed, will be transitioned into DECOMMISSIONED state. 
> NodesListManager detect and handle include and exclude list changes to kick 
> out decommission or recommission as necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-03-23 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209067#comment-15209067
 ] 

Robert Joseph Evans commented on YARN-4757:
---

I also did a quick pass through the document and I wanted to clarify a few 
things.

So in some places in the document, like with names that map to containers and 
names that map to components it says something like "If Available" indicating 
that if an IP address is not assigned to the individual container no mapping 
will be made.  Am I interpreting that correctly? Are there situations where you 
would just return the IP Address of the node the container is running on? Am I 
just mistaken in my interpretation and there are different situations where we 
could launch a container that would have no IP address available.

However for the per application records there is no such conditional.  Does 
that mean that we will return records for any service API no matter how the IP 
Addresses are assigned, or there is no way for the IP Address to not be 
available?

Also I am not super familiar with the slider registry so perhaps you could 
clarify a few things there too.

How is authentication with zookeeper handled?  Is it always SASL+kerberos?  
Just because the doc mentions that the RM has to set up the base user directory 
with permissions.  Would then any secure slider app that wants to use the 
registry be required to ship a keytab with their application?

Also I am not super familiar with the existing registry API, from the example 
in the doc it shows a few different types of services that an Application 
Master can register.  Both Host/Port and URI.  Would we be exposing SRV records 
for both of these combinations?  If so how would they be named?

I am also curious about limits to various DNS fields both in the protocol and 
in practice with common implementations.  I am not an expert on DNS so if I say 
something silly after you stop laughing please let me know.  The document talks 
a lot about doing character remapping and having to have unique application 
names, but it does not talk about limits to the lengths of those names (I have 
seen some DNS servers don't support more then 254 character names).  What about 
limits on the number of IP addresses that can be returned for a given name.  I 
could not find anything specific but I have to assume that in practice most 
systems don't support a huge number of these, and large clusters on YARN can 
easily launch hundreds or even thousands of containers for a given service.

In addition to Allen's concerns the document does not seem to address/call out 
my initial concerns about requiring mutual authentication, or handling of port 
availability in scheduling.



> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent endpoints of a service is not easy to implement 
> using the present registry-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4436) DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled

2016-03-23 Thread Daniel Templeton (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209022#comment-15209022
 ] 

Daniel Templeton commented on YARN-4436:


LGTM. +1 (non-binding).

[~rkanter], wanna do the honors after Jenkins reports back?

> DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled
> -
>
> Key: YARN-4436
> URL: https://issues.apache.org/jira/browse/YARN-4436
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: applications/distributed-shell
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Matt LaMantia
>Priority: Trivial
> Attachments: YARN-4436.001.patch
>
>
> It should be ExecBatScriptStringPath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4436) DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled

2016-03-23 Thread Matt LaMantia (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt LaMantia updated YARN-4436:

Attachment: YARN-4436.001.patch

> DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled
> -
>
> Key: YARN-4436
> URL: https://issues.apache.org/jira/browse/YARN-4436
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: applications/distributed-shell
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Matt LaMantia
>Priority: Trivial
> Attachments: YARN-4436.001.patch
>
>
> It should be ExecBatScriptStringPath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3863) Support complex filters in TimelineReader

2016-03-23 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209001#comment-15209001
 ] 

Varun Saxena commented on YARN-3863:


[~sjlee0], kindly review.
I had replaced the patch after rebasing it.

> Support complex filters in TimelineReader
> -
>
> Key: YARN-3863
> URL: https://issues.apache.org/jira/browse/YARN-3863
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3863-YARN-2928.v2.01.patch, 
> YARN-3863-YARN-2928.v2.02.patch, YARN-3863-YARN-2928.v2.03.patch, 
> YARN-3863-YARN-2928.v2.04.patch, YARN-3863-YARN-2928.v2.05.patch, 
> YARN-3863-feature-YARN-2928.wip.003.patch, 
> YARN-3863-feature-YARN-2928.wip.01.patch, 
> YARN-3863-feature-YARN-2928.wip.02.patch, 
> YARN-3863-feature-YARN-2928.wip.04.patch, 
> YARN-3863-feature-YARN-2928.wip.05.patch
>
>
> Currently filters in timeline reader will return an entity only if all the 
> filter conditions hold true i.e. only AND operation is supported. We can 
> support OR operation for the filters as well. Additionally as primary backend 
> implementation is HBase, we can design our filters in a manner, where they 
> closely resemble HBase Filters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-03-23 Thread Jonathan Maron (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208950#comment-15208950
 ] 

Jonathan Maron commented on YARN-4757:
--

a)  A records are more usable from an existing client interaction perspective.  
For example, you can use a tool such as nslookup to map from a known name to 
its IP.  You could potentially leverage an SRV record in that instance, but 
you'd have to go into the interactive mode of nslookup, set the type, and then 
perform the query - a less intuitive and well known approach.

b)  It's not a matter of managing a named.conf file as much as setting up bind 
to support the dynamic update protocol (YARN containers will come up and go 
down and those record updates may be relatively frequent).  In addition, the 
stateful complaint has more to do with the need to synch state in multiple 
processes rather than rely on one source of truth.  Finally, the security needs 
for an internal zone server are finite enough that, if security was the primary 
driver, would make the BIND selection overkill.

c) Not familiar with manta (even initial web searches didn't seem to bring 
anything up)?  If there is an open source, available solution I'd be more happy 
to evaluate its potential use.

d) I'm not sure the problem is necessarily solved.  DNS is well understood, 
obviously.  But the use case here - mirroring the details of an existing 
ZK-based registry or, more accurately, the state of the YARN cluster - present 
some requirements that perhaps can be best addressed by a tailored solution.  
Given the availability of APIs such as dnsjava etc. the approach is not 
necessarily daunting from a development perspective.  As such, testing can be 
performed to address security and performance concerns, though I'm not naive - 
I understand some issues will not manifest till actual deployment.



> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent endpoints of a service is not easy to implement 
> using the present registry-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-03-23 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208925#comment-15208925
 ] 

Allen Wittenauer commented on YARN-4757:


I did a quick pass, so I'll need to read more in-depth, but I have some 
concerns:

a) I'm still not sure what value registering A records are here when you can 
point a SRV record in the fake DNS zone to an existing host in an existing zone 
using the existing DNS services.  This eliminates a ton of corner cases (split 
zones, NAT, multi-nics, etc) that will need to be covered when registering As.

b) The BIND cons are very... odd:
* I'm not particularly sure what you find complex about BIND?  Most 
named.conf's aren't complex and rarely change after initial install in my 
experience.  Managing the zone files isn't particularly hard and lots of tools 
exist in this space for large scale deployments.
* You're effectively trading multiple instances of BIND for multiple instances 
of ZK.
* I don't understand the 'stateful' complaint given that, again, you're trading 
state of BIND for the state stored in ZK.
* Better security requirements sounds like a good thing to me...

c) Where are the comparisons with other open source DNS solutions?  Doesn't 
Manta already have something exactly like this already?

d) The NIH DNS server solution:
* "No operational dependencies on elements external to the Hadoop cluster"... 
Nothing says "thrown over the fence" like "no operational dependencies" when 
stated by a developer.
* it's unknown how well it's going to perform at scale.
* no idea how secure it's actually going to be--spoofing, MITM, etc.
* admins have zero experience with it vs. pre-existing solutions so will be a 
knowledge gap.  (Never mind the "the software doesn't exist yet so how can 
someone have experience with it?" problem...)
* increases the source footprint for what is effectively a solved problem


> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent endpoints of a service is not easy to implement 
> using the present registry-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1040) De-link container life cycle from an Allocation

2016-03-23 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208906#comment-15208906
 ] 

Bikas Saha commented on YARN-1040:
--

It would be great if existing apps can use the changes in YARN-1040 to be able 
to run more than a single process (sequentially or concurrently). If we use 
YARN-1040 to build the primitives here then those primitives could be used for 
the broader work designed for services (which seems to be indicated in the 
design doc). Without YARN-1040, existing java based apps cannot use features 
like increasing container memory because the JVM has to be restarted before it 
can grow to a larger size. I can see the argument of asking users to use new 
APIs for new features but requiring existing apps to change their AM/RM 
implementations (that have been stabilized with much effort) just to be able to 
launch multiple processes does not seem empathetic.

Separately from this, I have not been actively involved in the project for a 
while. Hence my understanding of the scope and semantic changes proposed in it 
may be stale and I may be inaccurate in thinking that these are fundamental 
enough to be done in a special jira for that purpose for a wider discussion. 
You guys can make a call on that.

> De-link container life cycle from an Allocation
> ---
>
> Key: YARN-1040
> URL: https://issues.apache.org/jira/browse/YARN-1040
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
> Attachments: YARN-1040-rough-design.pdf
>
>
> The AM should be able to exec >1 process in a container, rather than have the 
> NM automatically release the container when the single process exits.
> This would let an AM restart a process on the same container repeatedly, 
> which for HBase would offer locality on a restarted region server.
> We may also want the ability to exec multiple processes in parallel, so that 
> something could be run in the container while a long-lived process was 
> already running. This can be useful in monitoring and reconfiguring the 
> long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2883) Queuing of container requests in the NM

2016-03-23 Thread Konstantinos Karanasos (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208893#comment-15208893
 ] 

Konstantinos Karanasos commented on YARN-2883:
--

Thanks for the feedback, [~chris.douglas] and [~kasha]!

I am in the process of addressing Chris' comments -- will upload a new patch 
soon.

Regarding Karthik's comments:
bq. Any reason we use a map instead of a queue to store the queued containers?
I am using a Map only to track the allocated containers; for the queued 
containers, I am using a queue, as you suggest.

bq. I like that QueuingContainerManagerImpl extends ContainerManagerImpl - 
while we harden the queuing side of things, it will help keep the code clean. 
In the longer run, we might want to default to Queuing implementation and play 
with the queue length, but we can cross that bridge when we get there.
Agreed, that was exactly our intention too.

bq. IIUC, the intent is to use queueing for all opportunistic containers. The 
ContainerManagerImpl implementation seems to depend on whether queuing is 
enabled - wouldn't that affect all containers and not just opportunistic 
containers?
In most cases (including distributed scheduling and resource over-commitment), 
queues will indeed only be used for opportunistic containers.
However, as long as queuing is enabled, guaranteed containers might need to be 
queued momentarily until the opportunistic containers that block their 
execution get killed.
That's the reason you see guaranteed containers going through the same 
code-path too. But again, this will not break any semantics of the guaranteed 
containers.

bq. The patch has the author's name left against a TODO. Also, we don't want to 
leave orphaned TODOs - let us go ahead and file a JIRA
True, I will make sure I remove any TODOs and author names.

bq. The ResourceUtilization changes are not strictly related to this patch, do 
they?
This is correct. I put them in this JIRA because they are just a couple of 
methods. Do you think I should create a separate JIRA for this?

bq. TestQueuingContainerMgr: We typically don't wrap imports at 80 chars.
Yep, will fix that.

> Queuing of container requests in the NM
> ---
>
> Key: YARN-2883
> URL: https://issues.apache.org/jira/browse/YARN-2883
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Konstantinos Karanasos
> Attachments: YARN-2883-trunk.004.patch, 
> YARN-2883-yarn-2877.001.patch, YARN-2883-yarn-2877.002.patch, 
> YARN-2883-yarn-2877.003.patch, YARN-2883-yarn-2877.004.patch
>
>
> We propose to add a queue in each NM, where queueable container requests can 
> be held.
> Based on the available resources in the node and the containers in the queue, 
> the NM will decide when to allow the execution of a queued container.
> In order to ensure the instantaneous start of a guaranteed-start container, 
> the NM may decide to pre-empt/kill running queueable containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4757) [Umbrella] Simplified discovery of services via DNS mechanisms

2016-03-23 Thread Jonathan Maron (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Maron updated YARN-4757:
-
Attachment: YARN-4757- Simplified discovery of services via DNS 
mechanisms.pdf

I’ve posted a document providing greater detail concerning this effort.  It is 
intended as a description of the background, a proposed architectural approach, 
implementation details, and some open issues.  I've already had some initial 
reviews that were of great help in both describing existing points and 
identifying additional ones. /cc [~vvasudev],  [~vinodkv], [~sidharta-s], 
[~ste...@apache.org], [~elserj]

> [Umbrella] Simplified discovery of services via DNS mechanisms
> --
>
> Key: YARN-4757
> URL: https://issues.apache.org/jira/browse/YARN-4757
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jonathan Maron
> Attachments: YARN-4757- Simplified discovery of services via DNS 
> mechanisms.pdf
>
>
> [See overview doc at YARN-4692, copying the sub-section (3.2.10.2) to track 
> all related efforts.]
> In addition to completing the present story of service-registry (YARN-913), 
> we also need to simplify the access to the registry entries. The existing 
> read mechanisms of the YARN Service Registry are currently limited to a 
> registry specific (java) API and a REST interface. In practice, this makes it 
> very difficult for wiring up existing clients and services. For e.g, dynamic 
> configuration of dependent endpoints of a service is not easy to implement 
> using the present registry-read mechanisms, *without* code-changes to 
> existing services.
> A good solution to this is to expose the registry information through a more 
> generic and widely used discovery mechanism: DNS. Service Discovery via DNS 
> uses the well-known DNS interfaces to browse the network for services. 
> YARN-913 in fact talked about such a DNS based mechanism but left it as a 
> future task. (Task) Having the registry information exposed via DNS 
> simplifies the life of services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4436) DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled

2016-03-23 Thread Daniel Templeton (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-4436:
---
Assignee: Matt LaMantia  (was: Devon Michaels)

> DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled
> -
>
> Key: YARN-4436
> URL: https://issues.apache.org/jira/browse/YARN-4436
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: applications/distributed-shell
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Matt LaMantia
>Priority: Trivial
>
> It should be ExecBatScriptStringPath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4813) TestRMWebServicesDelegationTokenAuthentication.testDoAs fails intermittently

2016-03-23 Thread Daniel Templeton (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208734#comment-15208734
 ] 

Daniel Templeton commented on YARN-4813:


Nope.  Putting this one on the back burner for now.

> TestRMWebServicesDelegationTokenAuthentication.testDoAs fails intermittently
> 
>
> Key: YARN-4813
> URL: https://issues.apache.org/jira/browse/YARN-4813
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.0
>Reporter: Daniel Templeton
>
> {noformat}
> ---
>  T E S T S
> ---
> Running 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication
> Tests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 11.627 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication
> testDoAs[0](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication)
>   Time elapsed: 0.208 sec  <<< ERROR!
> java.io.IOException: Server returned HTTP response code: 403 for URL: 
> http://localhost:8088/ws/v1/cluster/delegation-token
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1626)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication$3.call(TestRMWebServicesDelegationTokenAuthentication.java:407)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication$3.call(TestRMWebServicesDelegationTokenAuthentication.java:398)
>   at 
> org.apache.hadoop.security.authentication.KerberosTestUtils$1.run(KerberosTestUtils.java:120)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.authentication.KerberosTestUtils.doAs(KerberosTestUtils.java:117)
>   at 
> org.apache.hadoop.security.authentication.KerberosTestUtils.doAsClient(KerberosTestUtils.java:133)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication.getDelegationToken(TestRMWebServicesDelegationTokenAuthentication.java:398)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication.testDoAs(TestRMWebServicesDelegationTokenAuthentication.java:357)
> Results :
> Tests in error: 
>   
> TestRMWebServicesDelegationTokenAuthentication.testDoAs:357->getDelegationToken:398
>  » IO
> Tests run: 8, Failures: 0, Errors: 1, Skipped: 0
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2883) Queuing of container requests in the NM

2016-03-23 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208731#comment-15208731
 ] 

Karthik Kambatla commented on YARN-2883:


Just skimmed through the patch. Will take a more thorough look once these and 
Chris' comments are addressed:
# Any reason we use a map instead of a queue to store the queued containers? 
# I like that QueuingContainerManagerImpl extends ContainerManagerImpl - while 
we harden the queuing side of things, it will help keep the code clean. In the 
longer run, we might want to default to Queuing implementation and play with 
the queue length, but we can cross that bridge when we get there. 
# IIUC, the intent is to use queueing for all opportunistic containers. The 
ContainerManagerImpl implementation seems to depend on whether queuing is 
enabled - wouldn't that affect all containers and not just opportunistic 
containers? 
# The patch has the author's name left against a TODO. Also, we don't want to 
leave orphaned TODOs - let us go ahead and file a JIRA
# The ResourceUtilization changes are not strictly related to this patch, do 
they? 
# If ContainerExecutionEvent is only used by the Queuing implementation, should 
the class name reflect that? 
# TestQueuingContainerMgr: We typically don't wrap imports at 80 chars. 

> Queuing of container requests in the NM
> ---
>
> Key: YARN-2883
> URL: https://issues.apache.org/jira/browse/YARN-2883
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Konstantinos Karanasos
> Attachments: YARN-2883-trunk.004.patch, 
> YARN-2883-yarn-2877.001.patch, YARN-2883-yarn-2877.002.patch, 
> YARN-2883-yarn-2877.003.patch, YARN-2883-yarn-2877.004.patch
>
>
> We propose to add a queue in each NM, where queueable container requests can 
> be held.
> Based on the available resources in the node and the containers in the queue, 
> the NM will decide when to allow the execution of a queued container.
> In order to ensure the instantaneous start of a guaranteed-start container, 
> the NM may decide to pre-empt/kill running queueable containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-4660) o.a.h.yarn.event.TestAsyncDispatcher.testDispatcherOnCloseIfQueueEmpty() swallows YarnExceptions

2016-03-23 Thread Daniel Templeton (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton resolved YARN-4660.

Resolution: Invalid

> o.a.h.yarn.event.TestAsyncDispatcher.testDispatcherOnCloseIfQueueEmpty() 
> swallows YarnExceptions
> 
>
> Key: YARN-4660
> URL: https://issues.apache.org/jira/browse/YARN-4660
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: test
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Minor
>
> Either we expect the exception, or we don't.  Quietly swallowing it is the 
> wrong thing to do in any case.  Introduced in YARN-3878.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4820) ResourceManager web redirects in HA mode drops query parameters

2016-03-23 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208674#comment-15208674
 ] 

Junping Du commented on YARN-4820:
--

+1. Will commit it shortly if no further comments.

> ResourceManager web redirects in HA mode drops query parameters
> ---
>
> Key: YARN-4820
> URL: https://issues.apache.org/jira/browse/YARN-4820
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: YARN-4820.001.patch, YARN-4820.002.patch, 
> YARN-4820.003.patch
>
>
> The RMWebAppFilter redirects http requests from the standby to the active. 
> However it drops all the query parameters when it does the redirect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4686) MiniYARNCluster.start() returns before cluster is completely started

2016-03-23 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208653#comment-15208653
 ] 

Eric Payne commented on YARN-4686:
--

{quote}
Hi Eric Badger and Eric Payne, TestMRJobs#testJobWithChangePriority is failing 
after this issue. Would you fix the test failure?
I've filed MAPREDUCE-6658 for fixing the failure.
{quote}
Thanks, [~ajisakaa] for reporting this. [~ebadger] is looking into this.

> MiniYARNCluster.start() returns before cluster is completely started
> 
>
> Key: YARN-4686
> URL: https://issues.apache.org/jira/browse/YARN-4686
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Eric Badger
> Fix For: 2.7.3
>
> Attachments: MAPREDUCE-6507.001.patch, 
> YARN-4686-branch-2.7.006.patch, YARN-4686.001.patch, YARN-4686.002.patch, 
> YARN-4686.003.patch, YARN-4686.004.patch, YARN-4686.005.patch, 
> YARN-4686.006.patch
>
>
> TestRMNMInfo fails intermittently. Below is trace for the failure
> {noformat}
> testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo)  Time elapsed: 0.28 
> sec  <<< FAILURE!
> java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but 
> was:<3>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at 
> org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4814) ATS 1.5 timelineclient impl call flush after every event write

2016-03-23 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208649#comment-15208649
 ] 

Hudson commented on YARN-4814:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9489 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9489/])
YARN-4814. ATS 1.5 timelineclient impl call flush after every event 
(junping_du: rev af1d125f9ce35ec69a610674a1c5c60cc17141a7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/FileSystemTimelineWriter.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java


> ATS 1.5 timelineclient impl call flush after every event write
> --
>
> Key: YARN-4814
> URL: https://issues.apache.org/jira/browse/YARN-4814
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Fix For: 2.8.0
>
> Attachments: YARN-4814.1.patch, YARN-4814.2.patch
>
>
> ATS 1.5 timelineclient impl call flush after every event write.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4183) Enabling generic application history forces every job to get a timeline service delegation token

2016-03-23 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208618#comment-15208618
 ] 

Naganarasimha G R commented on YARN-4183:
-

Hi [~sjlee0] & [~jeagles],
Shall we conclude on this ? or we may miss this eventually ..

> Enabling generic application history forces every job to get a timeline 
> service delegation token
> 
>
> Key: YARN-4183
> URL: https://issues.apache.org/jira/browse/YARN-4183
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Mit Desai
>Assignee: Naganarasimha G R
> Attachments: YARN-4183.1.patch, YARN-4183.v1.001.patch, 
> YARN-4183.v1.002.patch
>
>
> When enabling just the Generic History Server and not the timeline server, 
> the system metrics publisher will not publish the events to the timeline 
> store as it checks if the timeline server and system metrics publisher are 
> enabled before creating a timeline client.
> To make it work, if the timeline service flag is turned on, it will force 
> every yarn application to get a delegation token.
> Instead of checking if timeline service is enabled, we should be checking if 
> application history server is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4759) Revisit signalContainer() for docker containers

2016-03-23 Thread Shane Kumpf (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208622#comment-15208622
 ] 

Shane Kumpf commented on YARN-4759:
---

Also of note, we should propagate the return code that killed the container up 
to the end user to allow them to ensure that an exotic signal handling worked 
appropriately. This can be achieved through getting the return code from the 
container and subtracting 128 to get the actual signal sent.

{code}
docker inspect -f '{{.State.ExitCode}}' 
{code}

> Revisit signalContainer() for docker containers
> ---
>
> Key: YARN-4759
> URL: https://issues.apache.org/jira/browse/YARN-4759
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Sidharta Seethana
>Assignee: Shane Kumpf
>
> The current signal handling (in the DockerContainerRuntime) needs to be 
> revisited for docker containers. For example, container reacquisition on NM 
> restart might not work, depending on which user the process in the container 
> runs as. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4759) Revisit signalContainer() for docker containers

2016-03-23 Thread Shane Kumpf (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208603#comment-15208603
 ] 

Shane Kumpf commented on YARN-4759:
---

We need to use docker client commands to signal to processes in containers 
versus the OS kill command.

docker stop sends a SIGTERM to PID 1 and waits 10 seconds for the process to 
stop (by default, configurable), if the container hasn't stopped at the end of 
the timeout, SIGKILL is sent. docker kill, OTOH, has no delay and simply sends 
SIGKILL to PID 1 of the container (by default, signal configurable).

Signals that invoke graceful shutdown vary between processes. For instance to 
gracefully shutdown nginx (allowing outstanding requests to finish) SIGQUIT 
should be sent. For Apache HTTPD, SIGWINCH is used for graceful shutdown. 

To complicate matters, the docker client sends signals PID 1 in the container, 
so depending on if exec form is used for CMD in the Dockerfile, the process we 
want to send the signal to may be a subprocess of the shell running as PID 1. 
User's that require specific signals will need to properly understand this 
limitation.

We should allow for user configurable signals and timeouts. There are a couple 
of approaches to achieve this:

1) Only use docker kill and sleep in Java code. Docker kill accepts the 
--signal argument, but does not support a wait timeout. The flow would be: send 
signal, sleep 10 seconds by default  or the user supplied sleep value.

2) Use docker stop if the user has not specified a signal. Use the default of 
10 seconds for the timeout or the user supplied timeout. Use docker kill if the 
user supplies a signal.

The default behavior should be to send a SIGTERM, sleep 10 seconds, if still 
running, send SIGKILL. Signal and timeouts should be configurable.

How the above impacts NM reacquistion is yet to be determined, but it may make 
sense to make this an umbrella to split out the required changes.

/cc [~sidharta-s] - thoughts on the above?

> Revisit signalContainer() for docker containers
> ---
>
> Key: YARN-4759
> URL: https://issues.apache.org/jira/browse/YARN-4759
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Sidharta Seethana
>Assignee: Shane Kumpf
>
> The current signal handling (in the DockerContainerRuntime) needs to be 
> revisited for docker containers. For example, container reacquisition on NM 
> restart might not work, depending on which user the process in the container 
> runs as. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1040) De-link container life cycle from an Allocation

2016-03-23 Thread Varun Vasudev (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208594#comment-15208594
 ] 

Varun Vasudev commented on YARN-1040:
-

Thanks for putting up the proposal [~asuresh]! 

bq. "ContainerId" becomes "AllocationId"
Is AllocationId a new class that we will introduce or a rename of the existing 
ContainerId class? In either case we have some issues to sort out - the first 
one won't be backward compatible and in the second case, will the NM generate 
container ids for the individual containers?

bq. An AM can receive only a single allocation on a Node, The Scheduler will 
"bundle" all Allocations on a Node for an app into a single Large Allocation.
Can you explain why we need this restriction?

bq. Each Container is tagged with a "ContainerId" which is known only to the AM.
Are you referring to the current ContainerId class? If yes, why is it known 
only to the AM?

I actually agree with both Vinod and Bikas. The current approach is a little 
disruptive and not very useful for existing apps. I think we should separate 
out allocations work into their own classes on the RM and the NM with new APIs 
added for the RM and the NM. I don't think we can get away with modifying the 
existing APIs, the one exception being the allocate call, where we can add an 
additional flag to indicate whether an allocation or a container is desired. 
Internally, we can change the implementation to have the container model use 
allocations but I think allocations will have to have their own state machine 
withe slightly different semantics than containers(on both the RM and NM). 

> De-link container life cycle from an Allocation
> ---
>
> Key: YARN-1040
> URL: https://issues.apache.org/jira/browse/YARN-1040
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
> Attachments: YARN-1040-rough-design.pdf
>
>
> The AM should be able to exec >1 process in a container, rather than have the 
> NM automatically release the container when the single process exits.
> This would let an AM restart a process on the same container repeatedly, 
> which for HBase would offer locality on a restarted region server.
> We may also want the ability to exec multiple processes in parallel, so that 
> something could be run in the container while a long-lived process was 
> already running. This can be useful in monitoring and reconfiguring the 
> long-lived process, as well as shutting it down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4852) Resource Manager Ran Out of Memory

2016-03-23 Thread Umesh Prasad (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Umesh Prasad updated YARN-4852:
---
Description: 
Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut down 
itself. 



Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% of 
memory. When digging  deeper, there are around 0.5 million objects of 
UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn contains 
around 1.7 million objects of YarnProtos$ContainerIdProto, 
ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
which retain around 1 GB heap.

Back to Back Full GC kept on happening. GC wasn't able to recover any heap and 
went OOM. JVM dumped the heap before quitting. We analyzed the heap. 

RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 mins 
time and went OOM.

There are no spike in job submissions, container numbers at the time of issue 
occurrence. 


  was:
Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut down 
itself. 

GC related settings Settings :
 
 -XX:CMSInitiatingOccupancyFraction=75 
-XX:+CMSParallelRemarkEnabled 
 -XX:InitialTenuringThreshold=1
 -XX:+ManagementServer
-XX:InitialHeapSize=611042752
 -XX:MaxHeapSize=8589934592
 -XX:MaxNewSize=348966912 
-XX:MaxTenuringThreshold=1 
-XX:OldPLABSize=16
 -XX:ParallelGCThreads=4 
 -XX:SurvivorRatio=8 
-XX:+UseCMSInitiatingOccupancyOnly 
 -XX:+UseConcMarkSweepGC
  -XX:+UseParNewGC 

Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% of 
memory. When digging  deeper, there are around 0.5 million objects of 
UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn contains 
around 1.7 million objects of YarnProtos$ContainerIdProto, 
ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
which retain around 1 GB heap.

Back to Back Full GC kept on happening. GC wasn't able to recover any heap and 
went OOM. JVM dumped the heap before quitting. We analyzed the heap. 

RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 mins 
time and went OOM.

There are no spike in job submissions, container numbers at the time of issue 
occurrence. 



> Resource Manager Ran Out of Memory
> --
>
> Key: YARN-4852
> URL: https://issues.apache.org/jira/browse/YARN-4852
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gokul
> Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digging  deeper, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Back to Back Full GC kept on happening. GC wasn't able to recover any heap 
> and went OOM. JVM dumped the heap before quitting. We analyzed the heap. 
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4858) start-yarn and stop-yarn scripts to support timeline and sharedcachemanager

2016-03-23 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208537#comment-15208537
 ] 

Hadoop QA commented on YARN-4858:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 21s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:blue}0{color} | {color:blue} shelldocs {color} | {color:blue} 0m 6s 
{color} | {color:blue} Shelldocs was not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 
2s {color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 37s 
{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 28s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} shellcheck {color} | {color:red} 0m 7s 
{color} | {color:red} The applied patch generated 2 new + 498 unchanged - 0 
fixed = 500 total (was 498) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 43s 
{color} | {color:green} hadoop-yarn in the patch passed with JDK v1.8.0_74. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 3s 
{color} | {color:green} hadoop-yarn in the patch passed with JDK v1.7.0_95. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
19s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 19m 0s {color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:babe025 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12795001/YARN-4858-branch-2.001.patch
 |
| JIRA Issue | YARN-4858 |
| Optional Tests |  asflicense  mvnsite  unit  shellcheck  shelldocs  |
| uname | Linux dc7d12adebc6 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | branch-2 / 7a3fd1b |
| shellcheck | v0.4.3 |
| shellcheck | 
https://builds.apache.org/job/PreCommit-YARN-Build/10857/artifact/patchprocess/diff-patch-shellcheck.txt
 |
| JDK v1.7.0_95  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/10857/testReport/ |
| modules | C: hadoop-yarn-project/hadoop-yarn U: 
hadoop-yarn-project/hadoop-yarn |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/10857/console |
| Powered by | Apache Yetus 0.2.0   http://yetus.apache.org |


This message was automatically generated.



> start-yarn and stop-yarn scripts to support timeline and sharedcachemanager
> ---
>
> Key: YARN-4858
> URL: https://issues.apache.org/jira/browse/YARN-4858
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scripts
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: YARN-4858-001.patch, YARN-4858-branch-2.001.patch
>
>
> The start-yarn and stop-yarn scripts don't have any (even commented out) 
> support for the  timeline and sharedcachemanager
> Proposed:
> * bash and cmd start-yarn scripts have commented out start actions
> * stop-yarn scripts stop the servers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4852) Resource Manager Ran Out of Memory

2016-03-23 Thread Umesh Prasad (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Umesh Prasad updated YARN-4852:
---
Description: 
Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut down 
itself. 

GC related settings Settings :
 
 -XX:CMSInitiatingOccupancyFraction=75 
-XX:+CMSParallelRemarkEnabled 
 -XX:InitialTenuringThreshold=1
 -XX:+ManagementServer
-XX:InitialHeapSize=611042752
 -XX:MaxHeapSize=8589934592
 -XX:MaxNewSize=348966912 
-XX:MaxTenuringThreshold=1 
-XX:OldPLABSize=16
 -XX:ParallelGCThreads=4 
 -XX:SurvivorRatio=8 
-XX:+UseCMSInitiatingOccupancyOnly 
 -XX:+UseConcMarkSweepGC
  -XX:+UseParNewGC 

Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% of 
memory. When digging  deeper, there are around 0.5 million objects of 
UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn contains 
around 1.7 million objects of YarnProtos$ContainerIdProto, 
ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
which retain around 1 GB heap.

Back to Back Full GC kept on happening. GC wasn't able to recover any heap and 
went OOM. JVM dumped the heap before quitting. We analyzed the heap. 

RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 mins 
time and went OOM.

There are no spike in job submissions, container numbers at the time of issue 
occurrence. 


  was:
Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut down 
itself. 

Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% of 
memory. When digged deep, there are around 0.5 million objects of 
UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn contains 
around 1.7 million objects of YarnProtos$ContainerIdProto, 
ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
which retain around 1 GB heap.

Full GC was triggered multiple times when RM went OOM and only 300 MB of heap 
was released. So all these objects look like live objects.

RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 mins 
time and went OOM.

There are no spike in job submissions, container numbers at the time of issue 
occurrence. 



> Resource Manager Ran Out of Memory
> --
>
> Key: YARN-4852
> URL: https://issues.apache.org/jira/browse/YARN-4852
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gokul
> Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> GC related settings Settings :
>  
>  -XX:CMSInitiatingOccupancyFraction=75 
> -XX:+CMSParallelRemarkEnabled 
>  -XX:InitialTenuringThreshold=1
>  -XX:+ManagementServer
> -XX:InitialHeapSize=611042752
>  -XX:MaxHeapSize=8589934592
>  -XX:MaxNewSize=348966912 
> -XX:MaxTenuringThreshold=1 
> -XX:OldPLABSize=16
>  -XX:ParallelGCThreads=4 
>  -XX:SurvivorRatio=8 
> -XX:+UseCMSInitiatingOccupancyOnly 
>  -XX:+UseConcMarkSweepGC
>   -XX:+UseParNewGC 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digging  deeper, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Back to Back Full GC kept on happening. GC wasn't able to recover any heap 
> and went OOM. JVM dumped the heap before quitting. We analyzed the heap. 
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4858) start-yarn and stop-yarn scripts to support timeline and sharedcachemanager

2016-03-23 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-4858:
-
Attachment: YARN-4858-branch-2.001.patch

this patch is branch-2; trunk will need the same only reworked for the new bash 
scripts. Windows will remain the same

> start-yarn and stop-yarn scripts to support timeline and sharedcachemanager
> ---
>
> Key: YARN-4858
> URL: https://issues.apache.org/jira/browse/YARN-4858
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scripts
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: YARN-4858-001.patch, YARN-4858-branch-2.001.patch
>
>
> The start-yarn and stop-yarn scripts don't have any (even commented out) 
> support for the  timeline and sharedcachemanager
> Proposed:
> * bash and cmd start-yarn scripts have commented out start actions
> * stop-yarn scripts stop the servers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4858) start-yarn and stop-yarn scripts to support timeline and sharedcachemanager

2016-03-23 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-4858:
-
Attachment: YARN-4858-001.patch

Adds the extra services, ready to be uncommented by anyone who wants them.

> start-yarn and stop-yarn scripts to support timeline and sharedcachemanager
> ---
>
> Key: YARN-4858
> URL: https://issues.apache.org/jira/browse/YARN-4858
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scripts
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Attachments: YARN-4858-001.patch
>
>
> The start-yarn and stop-yarn scripts don't have any (even commented out) 
> support for the  timeline and sharedcachemanager
> Proposed:
> * bash and cmd start-yarn scripts have commented out start actions
> * stop-yarn scripts stop the servers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4847) Add documentation for the Node Label features supported in 2.6

2016-03-23 Thread Yi Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208499#comment-15208499
 ] 

Yi Zhou commented on YARN-4847:
---

I have simulated the negative case successfully. Thanks [~Naganarasimha] for 
your patience :) ! 

> Add documentation for the Node Label features supported in 2.6 
> ---
>
> Key: YARN-4847
> URL: https://issues.apache.org/jira/browse/YARN-4847
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Affects Versions: 2.6.4
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>
> We constantly face issue with what are the node label supported features in 
> 2.6 and general commands to use it. So it would be better to have 
> documentation capturing what all is supported as part of 2.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4858) start-yarn and stop-yarn scripts to support timeline and sharedcachemanager

2016-03-23 Thread Steve Loughran (JIRA)

Steve Loughran created YARN-4858:


 Summary: start-yarn and stop-yarn scripts to support timeline and 
sharedcachemanager
 Key: YARN-4858
 URL: https://issues.apache.org/jira/browse/YARN-4858
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scripts
Affects Versions: 2.8.0
Reporter: Steve Loughran
Assignee: Steve Loughran
Priority: Minor


The start-yarn and stop-yarn scripts don't have any (even commented out) 
support for the  timeline and sharedcachemanager

Proposed:
* bash and cmd start-yarn scripts have commented out start actions
* stop-yarn scripts stop the servers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics

2016-03-23 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-3816:
-
Assignee: Li Lu  (was: Junping Du)

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> 
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Li Lu
>  Labels: yarn-2928-1st-milestone
> Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, 
> YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, 
> YARN-3816-feature-YARN-2928.v4.1.patch, YARN-3816-poc-v1.patch, 
> YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics

2016-03-23 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208443#comment-15208443
 ] 

Junping Du commented on YARN-3816:
--

Sorry guys. I was planning to finish it a few month ago but we had code rebase 
several times and my bandwidth is quite challenging recently. Assign to Li to 
follow up the patch work as his YARN-3817 depends on this JIRA.

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> 
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>  Labels: yarn-2928-1st-milestone
> Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, 
> YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, 
> YARN-3816-feature-YARN-2928.v4.1.patch, YARN-3816-poc-v1.patch, 
> YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3959) Store application related configurations in Timeline Service v2

2016-03-23 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-3959:
-
Assignee: Varun Saxena  (was: Junping Du)

> Store application related configurations in Timeline Service v2
> ---
>
> Key: YARN-3959
> URL: https://issues.apache.org/jira/browse/YARN-3959
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Varun Saxena
>  Labels: yarn-2928-1st-milestone
>
> We already have configuration field in HBase schema for application entity. 
> We need to make sure AM write it out when it get launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4856) RM /ws/v1/cluster/scheduler JSON format Error

2016-03-23 Thread Daniel Templeton (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208434#comment-15208434
 ] 

Daniel Templeton commented on YARN-4856:


I'll look into it.

> RM  /ws/v1/cluster/scheduler JSON format Error
> --
>
> Key: YARN-4856
> URL: https://issues.apache.org/jira/browse/YARN-4856
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
> Environment: Hadoop-2.7.1
>Reporter: zhangyubiao
>Assignee: Daniel Templeton
>  Labels: patch
>
> Hadoop-2.7.1  RM  /ws/v1/cluster/scheduler JSON format Error
> Root Queue's ChildQueue  is 
> {"memory":3717120,"vCores":1848},"queueName":"root","schedulingPolicy":"fair","childQueues":{color:red}[{"type":"fairSchedulerLeafQueueInfo",
> {color}"maxApps":400,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":0,"vCores":0},"maxResources":{"memory":0,"vCores":0},
> But  Other's ChildQueue is 
> {"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad","schedulingPolicy":"fair","childQueues":{"type":{color:red}
> ["fairSchedulerLeafQueueInfo"],
> {color}"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_anti","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0},"childQueues":{"type":"fairSchedulerLeafQueueInfo","maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_formal_1","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0},"childQueues":{"type":"fairSchedulerLeafQueueInfo","maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_oozie","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0}},{"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":0,"vCores":0},"maxResources":{"memory":0,"vCores":0},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":0,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3959) Store application related configurations in Timeline Service v2

2016-03-23 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208436#comment-15208436
 ] 

Junping Du commented on YARN-3959:
--

Sure. [~varun_saxena], please go ahead.

> Store application related configurations in Timeline Service v2
> ---
>
> Key: YARN-3959
> URL: https://issues.apache.org/jira/browse/YARN-3959
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>  Labels: yarn-2928-1st-milestone
>
> We already have configuration field in HBase schema for application entity. 
> We need to make sure AM write it out when it get launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-4856) RM /ws/v1/cluster/scheduler JSON format Error

2016-03-23 Thread Daniel Templeton (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton reassigned YARN-4856:
--

Assignee: Daniel Templeton

> RM  /ws/v1/cluster/scheduler JSON format Error
> --
>
> Key: YARN-4856
> URL: https://issues.apache.org/jira/browse/YARN-4856
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
> Environment: Hadoop-2.7.1
>Reporter: zhangyubiao
>Assignee: Daniel Templeton
>  Labels: patch
>
> Hadoop-2.7.1  RM  /ws/v1/cluster/scheduler JSON format Error
> Root Queue's ChildQueue  is 
> {"memory":3717120,"vCores":1848},"queueName":"root","schedulingPolicy":"fair","childQueues":{color:red}[{"type":"fairSchedulerLeafQueueInfo",
> {color}"maxApps":400,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":0,"vCores":0},"maxResources":{"memory":0,"vCores":0},
> But  Other's ChildQueue is 
> {"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad","schedulingPolicy":"fair","childQueues":{"type":{color:red}
> ["fairSchedulerLeafQueueInfo"],
> {color}"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_anti","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0},"childQueues":{"type":"fairSchedulerLeafQueueInfo","maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_formal_1","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0},"childQueues":{"type":"fairSchedulerLeafQueueInfo","maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_oozie","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0}},{"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":0,"vCores":0},"maxResources":{"memory":0,"vCores":0},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":0,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4857) Missing default configuration regarding preemption of CapacityScheduler

2016-03-23 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208385#comment-15208385
 ] 

Hadoop QA commented on YARN-4857:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
36s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
26s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 21s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
10s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 50s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 8s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
17s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 17m 10s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:fbe3e86 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12794989/YARN-4857.01.patch |
| JIRA Issue | YARN-4857 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  xml  |
| uname | Linux fef21a1116a6 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / a107cee |
| Default Java | 1.7.0_95 |
| Multi-JDK versions |  /usr/lib/jvm/java-8-oracle:1.8.0_74 
/usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 |
| JDK v1.7.0_95  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/10856/testReport/ |
| modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common U: 
hadoop-yarn-project/hadoop-yarn/h

[jira] [Commented] (YARN-4849) [YARN-3368] cleanup code base, integrate web UI related build to mvn, and add licenses.

2016-03-23 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208368#comment-15208368
 ] 

Hadoop QA commented on YARN-4849:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 50s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:blue}0{color} | {color:blue} shelldocs {color} | {color:blue} 0m 4s 
{color} | {color:blue} Shelldocs was not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 2m 55s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
12s {color} | {color:green} YARN-3368 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 9s 
{color} | {color:green} YARN-3368 passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 49s 
{color} | {color:green} YARN-3368 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 9m 14s 
{color} | {color:green} YARN-3368 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
22s {color} | {color:green} YARN-3368 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 33s 
{color} | {color:green} YARN-3368 passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 9m 27s 
{color} | {color:green} YARN-3368 passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 5m 51s 
{color} | {color:red} hadoop-yarn in the patch failed. {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 25s 
{color} | {color:red} hadoop-yarn-ui in the patch failed. {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 6m 58s 
{color} | {color:red} root in the patch failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red} 2m 26s 
{color} | {color:red} root in the patch failed with JDK v1.8.0_74. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 2m 26s {color} 
| {color:red} root in the patch failed with JDK v1.8.0_74. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red} 2m 37s 
{color} | {color:red} root in the patch failed with JDK v1.7.0_95. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 2m 37s {color} 
| {color:red} root in the patch failed with JDK v1.7.0_95. {color} |
| {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 32s 
{color} | {color:red} root in the patch failed. {color} |
| {color:red}-1{color} | {color:red} mvneclipse {color} | {color:red} 0m 39s 
{color} | {color:red} root in the patch failed. {color} |
| {color:red}-1{color} | {color:red} shellcheck {color} | {color:red} 0m 13s 
{color} | {color:red} The applied patch generated 552 new + 98 unchanged - 0 
fixed = 650 total (was 98) {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 50 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 1s 
{color} | {color:red} The patch has 235 line(s) with tabs. {color} |
| {color:red}-1{color} | {color:red} xml {color} | {color:red} 0m 2s {color} | 
{color:red} The patch has 1 ill-formed XML file(s). {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 2m 43s 
{color} | {color:red} root in the patch failed with JDK v1.8.0_74. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 3m 41s 
{color} | {color:red} root in the patch failed with JDK v1.7.0_95. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 21m 15s {color} 
| {color:red} root in the patch failed with JDK v1.8.0_74. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 21m 47s {color} 
| {color:red} root in the patch failed with JDK v1.7.0_95. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {

[jira] [Updated] (YARN-4857) Missing default configuration regarding preemption of CapacityScheduler

2016-03-23 Thread Kai Sasaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Sasaki updated YARN-4857:
-
Attachment: YARN-4857.01.patch

> Missing default configuration regarding preemption of CapacityScheduler
> ---
>
> Key: YARN-4857
> URL: https://issues.apache.org/jira/browse/YARN-4857
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, documentation
>Reporter: Kai Sasaki
>Assignee: Kai Sasaki
>Priority: Minor
>  Labels: documentaion
> Attachments: YARN-4857.01.patch
>
>
> {{yarn.resourcemanager.monitor.*}} configurations are missing in 
> yarn-default.xml. Since they were documented explicitly by YARN-4492, 
> yarn-default.xml can be modified as same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4857) Missing default configuration regarding preemption of CapacityScheduler

2016-03-23 Thread Kai Sasaki (JIRA)

Kai Sasaki created YARN-4857:


 Summary: Missing default configuration regarding preemption of 
CapacityScheduler
 Key: YARN-4857
 URL: https://issues.apache.org/jira/browse/YARN-4857
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler, documentation
Reporter: Kai Sasaki
Assignee: Kai Sasaki
Priority: Minor


{{yarn.resourcemanager.monitor.*}} configurations are missing in 
yarn-default.xml. Since they were documented explicitly by YARN-4492, 
yarn-default.xml can be modified as same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory

2016-03-23 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208314#comment-15208314
 ] 

Rohith Sharma K S commented on YARN-4852:
-

bq. By the way what is the hearbeat interval from AM to RM in which it will 
acquire the CS lock.
MRAppMaster heartbeat is 1sec default. And CS lock is aquired only if there are 
ask resource request in heartbeat.

> Resource Manager Ran Out of Memory
> --
>
> Key: YARN-4852
> URL: https://issues.apache.org/jira/browse/YARN-4852
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gokul
> Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digged deep, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Full GC was triggered multiple times when RM went OOM and only 300 MB of heap 
> was released. So all these objects look like live objects.
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory

2016-03-23 Thread Gokul (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208301#comment-15208301
 ] 

Gokul commented on YARN-4852:
-

Thanks [~rohithsharma], this gives some perspective about the starvation of 
Scheduler Event Processor Thread. May be YARN-3487 would bring down the 
probability of this issue. 

It took more than 30 minutes for the heap to double and go OOM. So Scheduler 
Event Processor would have got to process at least some nodeUpdate events. But 
heap was on growing state continuously and never came down. That's why I am not 
fully convinced that YARN-3487 would solve the issue. By the way what is the 
hearbeat interval from AM to RM in which it will acquire the CS lock.

> Resource Manager Ran Out of Memory
> --
>
> Key: YARN-4852
> URL: https://issues.apache.org/jira/browse/YARN-4852
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gokul
> Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digged deep, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Full GC was triggered multiple times when RM went OOM and only 300 MB of heap 
> was released. So all these objects look like live objects.
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory

2016-03-23 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208283#comment-15208283
 ] 

Rohith Sharma K S commented on YARN-4852:
-

I was justifying how without YARN-3487 might cause oom. There could be other 
reason causing for nodeUpdate queue pill up which need to be analysed. For 
leaving out a suspect of YARN-3487, apply the patch in the cluster. If issue 
occur again it is easy to focus on particular area.

> Resource Manager Ran Out of Memory
> --
>
> Key: YARN-4852
> URL: https://issues.apache.org/jira/browse/YARN-4852
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gokul
> Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digged deep, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Full GC was triggered multiple times when RM went OOM and only 300 MB of heap 
> was released. So all these objects look like live objects.
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory

2016-03-23 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208246#comment-15208246
 ] 

Rohith Sharma K S commented on YARN-4852:
-

To be more clear, 
*Flow-1* : Each AM heart beat or application submission try to acquire CS lock. 
In your cluster, 93 apps running concurrently would send resource request in  
AM heartbeat to RM. These many AM's heartbeat are race to obtain CS lock. 

*Flow-2* And other hand, scheduler event process thread dispatches events one 
by one. So at any point of time, only one nodeUpdate event is processed.This 
nodeUpdate event try to acquire a CS lock which is also in race ( From your 
thread dump, nodeUpdate has acquired the CS lock as I mentioned previous 
comment).

Consider worst case where always AM heart beat is getting chance to acquire CS 
lock, then nodeUpdate call would be delayed. As I said scheduler event 
processor process an event one by one, other node update events will be piled 
up. Note that scheduler node status event is triggered from RMNodeIMpl. Delay 
in scheduler event processing does not block NodeManagers heartbeat. So 
NodeManager keep sending node heart beat and updating the 
RMNodeImpl#nodeUpdateQueue.

> Resource Manager Ran Out of Memory
> --
>
> Key: YARN-4852
> URL: https://issues.apache.org/jira/browse/YARN-4852
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gokul
> Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digged deep, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Full GC was triggered multiple times when RM went OOM and only 300 MB of heap 
> was released. So all these objects look like live objects.
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4849) [YARN-3368] cleanup code base, integrate web UI related build to mvn, and add licenses.

2016-03-23 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208224#comment-15208224
 ] 

Hadoop QA commented on YARN-4849:
-

(!) A patch to the testing environment has been detected. 
Re-executing against the patched versions to perform further tests. 
The console is at 
https://builds.apache.org/job/PreCommit-YARN-Build/10855/console in case of 
problems.


> [YARN-3368] cleanup code base, integrate web UI related build to mvn, and add 
> licenses.
> ---
>
> Key: YARN-4849
> URL: https://issues.apache.org/jira/browse/YARN-4849
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4849-YARN-3368.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4856) RM /ws/v1/cluster/scheduler JSON format Error

2016-03-23 Thread zhangyubiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangyubiao updated YARN-4856:
--
Summary: RM  /ws/v1/cluster/scheduler JSON format Error  (was: RM  
/ws/v1/cluster/scheduler JSON format err )

> RM  /ws/v1/cluster/scheduler JSON format Error
> --
>
> Key: YARN-4856
> URL: https://issues.apache.org/jira/browse/YARN-4856
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
> Environment: Hadoop-2.7.1
>Reporter: zhangyubiao
>  Labels: patch
>
> Hadoop-2.7.1  RM  /ws/v1/cluster/scheduler JSON format Error
> Root Queue's ChildQueue  is 
> {"memory":3717120,"vCores":1848},"queueName":"root","schedulingPolicy":"fair","childQueues":{color:red}[{"type":"fairSchedulerLeafQueueInfo",
> {color}"maxApps":400,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":0,"vCores":0},"maxResources":{"memory":0,"vCores":0},
> But  Other's ChildQueue is 
> {"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad","schedulingPolicy":"fair","childQueues":{"type":{color:red}
> ["fairSchedulerLeafQueueInfo"],
> {color}"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_anti","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0},"childQueues":{"type":"fairSchedulerLeafQueueInfo","maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_formal_1","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0},"childQueues":{"type":"fairSchedulerLeafQueueInfo","maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_oozie","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0}},{"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":0,"vCores":0},"maxResources":{"memory":0,"vCores":0},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":0,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4856) RM /ws/v1/cluster/scheduler JSON format err

2016-03-23 Thread zhangyubiao (JIRA)

zhangyubiao created YARN-4856:
-

 Summary: RM  /ws/v1/cluster/scheduler JSON format err 
 Key: YARN-4856
 URL: https://issues.apache.org/jira/browse/YARN-4856
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.1
 Environment: Hadoop-2.7.1
Reporter: zhangyubiao


Hadoop-2.7.1  RM  /ws/v1/cluster/scheduler JSON format Error

Root Queue's ChildQueue  is 
{"memory":3717120,"vCores":1848},"queueName":"root","schedulingPolicy":"fair","childQueues":{color:red}[{"type":"fairSchedulerLeafQueueInfo",
{color}"maxApps":400,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":0,"vCores":0},"maxResources":{"memory":0,"vCores":0},

But  Other's ChildQueue is 
{"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad","schedulingPolicy":"fair","childQueues":{"type":{color:red}
["fairSchedulerLeafQueueInfo"],
{color}"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_anti","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0},"childQueues":{"type":"fairSchedulerLeafQueueInfo","maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_formal_1","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0},"childQueues":{"type":"fairSchedulerLeafQueueInfo","maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":2867200,"vCores":1400},"maxResources":{"memory":2867200,"vCores":1400},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":2867200,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848},"queueName":"root.bdp_jmart_ad.jd_ad_oozie","schedulingPolicy":"fair","numPendingApps":0,"numActiveApps":0}},{"maxApps":300,"queueMaxMapsForEachJob":2147483647,"queueMaxReducesForEachJob":2147483647,"minResources":{"memory":0,"vCores":0},"maxResources":{"memory":0,"vCores":0},"usedResources":{"memory":0,"vCores":0},"steadyFairResources":{"memory":0,"vCores":0},"fairResources":{"memory":0,"vCores":0},"clusterResources":{"memory":3717120,"vCores":1848}}






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3863) Support complex filters in TimelineReader

2016-03-23 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208216#comment-15208216
 ] 

Hadoop QA commented on YARN-3863:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 6 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 11m 
23s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 18s 
{color} | {color:green} YARN-2928 passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s 
{color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
20s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s 
{color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
19s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
42s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s 
{color} | {color:green} YARN-2928 passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s 
{color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
24s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 15s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 15s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 18s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 18s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 15s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice:
 patch generated 5 new + 4 unchanged - 1 fixed = 9 total (was 5) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
14s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
51s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 14s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 45s 
{color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed 
with JDK v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 36s 
{color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed 
with JDK v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 28m 54s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12794930/YARN-3863-YARN-2928.v2.05.patch
 |
| JIRA Issue | YARN-3863 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux ccbcd85a77e6 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
|

[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory

2016-03-23 Thread Gokul (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208204#comment-15208204
 ] 

Gokul commented on YARN-4852:
-

Agreed 7 threads are waiting to lock CapacityScheduler.getQueueInfo. What is 
the impact if these many threads are waiting on this lock in application 
submission phase? Will it be the cause for RMNodeImpl.nodeUpdateQueue piling 
up? If yes then YARN-3487 will fix the issue. Else there should be some other 
reason - like the consumer thread of the queue(RMNodeImpl.nodeUpdateQueue) 
which is ResourceManager Event processor stuck at something that it is not 
draining the queue.

Also the thread which is doing nodeUpdate(ResourceManager Event processor) is 
not in blocked state. It is still runnable. 

There are around 1200 NMs in the cluster. 93 apps were running when issue 
occurred. The number of containers allocated were 17803 and pending were 63422. 
Job submission rate was roughly 6 per minute.

> Resource Manager Ran Out of Memory
> --
>
> Key: YARN-4852
> URL: https://issues.apache.org/jira/browse/YARN-4852
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gokul
> Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digged deep, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Full GC was triggered multiple times when RM went OOM and only 300 MB of heap 
> was released. So all these objects look like live objects.
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory

2016-03-23 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208172#comment-15208172
 ] 

Rohith Sharma K S commented on YARN-4852:
-

Looking at your attached threadump, I feel root cause for your issue is 
YARN-3487. May be you can try if it is recurring regularly.

>From the thread dump,
I see that there are 8 threads are waiting for CS lock out of 7 are 
{{CapacityScheduler.getQueueInf}} which are called from validating resource 
request either during application submission for AM resource request OR for AM 
heartbeat request. 
At this time, nodeUpdate is holding the CS lock. This would take few mills to 
process containers status if more are there.
{code}
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:1190)
- locked <0x0005d4cfe5c8> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:951)
- locked <0x0005d4cfe5c8> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
{code}


In larger cluster what can happen is if more ApplicationsMaster are running 
concurrently and application submission rate is very high, then significantly 
nodeUpdate will be blocked for obtaining CS lock. The reason for blocking is 
YARN-3487. So if more NodeManagers are there then time consumed to process each 
node update increase which internally pill up the container status and might be 
causing oom.

Just for an info,  How many NodeManagers are there in cluster? How many AM are 
running concurrently and How many tasks per job? what is the job submission 
rate? 


> Resource Manager Ran Out of Memory
> --
>
> Key: YARN-4852
> URL: https://issues.apache.org/jira/browse/YARN-4852
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gokul
> Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digged deep, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Full GC was triggered multiple times when RM went OOM and only 300 MB of heap 
> was released. So all these objects look like live objects.
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4855) Should check if node exists when replace nodelabels

2016-03-23 Thread Tao Jie (JIRA)

Tao Jie created YARN-4855:
-

 Summary: Should check if node exists when replace nodelabels
 Key: YARN-4855
 URL: https://issues.apache.org/jira/browse/YARN-4855
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.6.0
Reporter: Tao Jie
Priority: Minor


Today when we add nodelabels to nodes, it would succeed even if nodes are not 
existing NodeManger in cluster without any message.
It could be like this:
When we use *yarn rmadmin -replaceLabelsOnNode "node1=label1"*, it would be 
denied if node does not exist.
When we use *yarn rmadmin -replaceLabelsOnNode -force "node1=label1"* would add 
nodelabels no matter whether node exists



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4686) MiniYARNCluster.start() returns before cluster is completely started

2016-03-23 Thread Akira AJISAKA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208103#comment-15208103
 ] 

Akira AJISAKA commented on YARN-4686:
-

Hi [~ebadger] and [~eepayne], TestMRJobs#testJobWithChangePriority is failing 
after this issue. Would you fix the test failure?
I've filed MAPREDUCE-6658 for fixing the failure.

> MiniYARNCluster.start() returns before cluster is completely started
> 
>
> Key: YARN-4686
> URL: https://issues.apache.org/jira/browse/YARN-4686
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Eric Badger
> Fix For: 2.7.3
>
> Attachments: MAPREDUCE-6507.001.patch, 
> YARN-4686-branch-2.7.006.patch, YARN-4686.001.patch, YARN-4686.002.patch, 
> YARN-4686.003.patch, YARN-4686.004.patch, YARN-4686.005.patch, 
> YARN-4686.006.patch
>
>
> TestRMNMInfo fails intermittently. Below is trace for the failure
> {noformat}
> testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo)  Time elapsed: 0.28 
> sec  <<< FAILURE!
> java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but 
> was:<3>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at 
> org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4847) Add documentation for the Node Label features supported in 2.6

2016-03-23 Thread Yi Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208093#comment-15208093
 ] 

Yi Zhou commented on YARN-4847:
---

Thanks! I will double check this.

> Add documentation for the Node Label features supported in 2.6 
> ---
>
> Key: YARN-4847
> URL: https://issues.apache.org/jira/browse/YARN-4847
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Affects Versions: 2.6.4
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>
> We constantly face issue with what are the node label supported features in 
> 2.6 and general commands to use it. So it would be better to have 
> documentation capturing what all is supported as part of 2.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4847) Add documentation for the Node Label features supported in 2.6

2016-03-23 Thread Yi Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208092#comment-15208092
 ] 

Yi Zhou commented on YARN-4847:
---

Thanks! I will double check this.

> Add documentation for the Node Label features supported in 2.6 
> ---
>
> Key: YARN-4847
> URL: https://issues.apache.org/jira/browse/YARN-4847
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Affects Versions: 2.6.4
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>
> We constantly face issue with what are the node label supported features in 
> 2.6 and general commands to use it. So it would be better to have 
> documentation capturing what all is supported as part of 2.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4847) Add documentation for the Node Label features supported in 2.6

2016-03-23 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208090#comment-15208090
 ] 

Naganarasimha G R commented on YARN-4847:
-

I tested in 2.6.4

> Add documentation for the Node Label features supported in 2.6 
> ---
>
> Key: YARN-4847
> URL: https://issues.apache.org/jira/browse/YARN-4847
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Affects Versions: 2.6.4
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>
> We constantly face issue with what are the node label supported features in 
> 2.6 and general commands to use it. So it would be better to have 
> documentation capturing what all is supported as part of 2.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4847) Add documentation for the Node Label features supported in 2.6

2016-03-23 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208084#comment-15208084
 ] 

Naganarasimha G R commented on YARN-4847:
-

Hi [~jameszhouyi], 
i tried testing with your configuration and i was able to see exception being 
thrown
{code}
16/03/23 14:20:14 FATAL distributedshell.Client: Error running Client
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
resource request, queue=m doesn't have permission to access all labels in 
resource request. labelExpression of resource request=y. Queue labels=*
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:289)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
{code}

> Add documentation for the Node Label features supported in 2.6 
> ---
>
> Key: YARN-4847
> URL: https://issues.apache.org/jira/browse/YARN-4847
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Affects Versions: 2.6.4
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>
> We constantly face issue with what are the node label supported features in 
> 2.6 and general commands to use it. So it would be better to have 
> documentation capturing what all is supported as part of 2.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4847) Add documentation for the Node Label features supported in 2.6

2016-03-23 Thread Yi Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208021#comment-15208021
 ] 

Yi Zhou commented on YARN-4847:
---

Hi [~Naganarasimha]
Thank you for your great help! OK. If it is relative to doc, i will input here. 
i will post my issues in mailing list..
 

> Add documentation for the Node Label features supported in 2.6 
> ---
>
> Key: YARN-4847
> URL: https://issues.apache.org/jira/browse/YARN-4847
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Affects Versions: 2.6.4
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>
> We constantly face issue with what are the node label supported features in 
> 2.6 and general commands to use it. So it would be better to have 
> documentation capturing what all is supported as part of 2.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3959) Store application related configurations in Timeline Service v2

2016-03-23 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207998#comment-15207998
 ] 

Varun Saxena commented on YARN-3959:


[~djp],
I can work on this issue if you are not planning to work on this in short term, 
as this is marked for 1st milestone. Do let me know.

> Store application related configurations in Timeline Service v2
> ---
>
> Key: YARN-3959
> URL: https://issues.apache.org/jira/browse/YARN-3959
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>  Labels: yarn-2928-1st-milestone
>
> We already have configuration field in HBase schema for application entity. 
> We need to make sure AM write it out when it get launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4847) Add documentation for the Node Label features supported in 2.6

2016-03-23 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207994#comment-15207994
 ] 

Naganarasimha G R commented on YARN-4847:
-

Let me check your issue,
Also as [~wangda] and [~sunilg] were mentioning it would be better to capture 
usability issues in the forums rather than here. Main intention of this jira is 
to capture the documentation and if for it anything required we can discuss in 
this jira. 
And as part of this documentation i would be more documenting on the aspect 
that what is supported as part of 2.6.x Node label than what not is supported 
in 2.6.x. in comparison with 2.7.x or later.

> Add documentation for the Node Label features supported in 2.6 
> ---
>
> Key: YARN-4847
> URL: https://issues.apache.org/jira/browse/YARN-4847
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Affects Versions: 2.6.4
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>
> We constantly face issue with what are the node label supported features in 
> 2.6 and general commands to use it. So it would be better to have 
> documentation capturing what all is supported as part of 2.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics

2016-03-23 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207995#comment-15207995
 ] 

Varun Saxena commented on YARN-3816:


[~sjlee0],
Maybe in Thursday's meeting, we can revisit open 1st milestone JIRAs' and check 
if assignees have the bandwidth or not.
If Junping does not have bandwidth, I can pitch in on couple of his open JIRAs' 
too.

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> 
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
>  Labels: yarn-2928-1st-milestone
> Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, 
> YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, 
> YARN-3816-feature-YARN-2928.v4.1.patch, YARN-3816-poc-v1.patch, 
> YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4820) ResourceManager web redirects in HA mode drops query parameters

2016-03-23 Thread Varun Vasudev (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207993#comment-15207993
 ] 

Varun Vasudev commented on YARN-4820:
-

The test failures are unrelated to the patch.

> ResourceManager web redirects in HA mode drops query parameters
> ---
>
> Key: YARN-4820
> URL: https://issues.apache.org/jira/browse/YARN-4820
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: YARN-4820.001.patch, YARN-4820.002.patch, 
> YARN-4820.003.patch
>
>
> The RMWebAppFilter redirects http requests from the standby to the active. 
> However it drops all the query parameters when it does the redirect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4849) [YARN-3368] cleanup code base, integrate web UI related build to mvn, and add licenses.

2016-03-23 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4849:
-
Attachment: YARN-4849-YARN-3368.1.patch

> [YARN-3368] cleanup code base, integrate web UI related build to mvn, and add 
> licenses.
> ---
>
> Key: YARN-4849
> URL: https://issues.apache.org/jira/browse/YARN-4849
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4849-YARN-3368.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4849) [YARN-3368] cleanup code base, integrate web UI related build to mvn, and add licenses.

2016-03-23 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4849:
-
Attachment: (was: YARN-4849.1.patch)

> [YARN-3368] cleanup code base, integrate web UI related build to mvn, and add 
> licenses.
> ---
>
> Key: YARN-4849
> URL: https://issues.apache.org/jira/browse/YARN-4849
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4285) Display resource usage as percentage of queue and cluster in the RM UI

2016-03-23 Thread Varun Vasudev (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207991#comment-15207991
 ] 

Varun Vasudev commented on YARN-4285:
-

[~jianhe] - it makes sense to remove reserved resources from the used 
resources, but do we know why we counted reserved resources as part of used 
resources in the first place?

> Display resource usage as percentage of queue and cluster in the RM UI
> --
>
> Key: YARN-4285
> URL: https://issues.apache.org/jira/browse/YARN-4285
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Fix For: 2.8.0
>
> Attachments: YARN-4285.001.patch, YARN-4285.002.patch, 
> YARN-4285.003.patch, YARN-4285.004.patch
>
>
> Currently, we display the memory and vcores allocated to an app in the RM UI. 
> It would be useful to display the resources consumed as a %of the queue and 
> the cluster to identify apps that are using a lot of resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

91 matches

Mail list logo