[jira] [Updated] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level

2015-07-16 Thread Ajith S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S updated YARN-3885:
--
Attachment: YARN-3885.08.patch

> ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 
> level
> --
>
> Key: YARN-3885
> URL: https://issues.apache.org/jira/browse/YARN-3885
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Blocker
> Attachments: YARN-3885.02.patch, YARN-3885.03.patch, 
> YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, 
> YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch
>
>
> when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}}
> this piece of code, to calculate {{untoucable}} doesnt consider al the 
> children, it considers only immediate childern



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED

2015-07-16 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629348#comment-14629348
 ] 

Sunil G commented on YARN-3535:
---

Hi [~rohithsharma] and [~peng.zhang]
After seeing this patch, I feel there may a synchronization problem. Please 
correct me if I am wrong.
In ContainerRescheduledTransition code, its been used like
{code}
+  container.eventHandler.handle(new ContainerRescheduledEvent(container));
+  new FinishedTransition().transition(container, event);
{code}
Hence ContainerRescheduledEvent is fired to Scheduler dispatcher and it will 
process the {{recoverResourceRequestForContainer}} is a separate thread. 
Meantime in RMAppImpl, {{FinishedTransition().transition}} will be invoked and 
it will be processed for closure for this container. If the Scheduler 
dispatcher is slower in processing due to pending event queue length, there are 
chances that recoverResourceRequest may not be correct.

I feel we can introduce a new Event in {{RMContainerImpl}} from ALLOCATED to 
WAIT_FOR_REQUEST_RECOVERY and scheduler can fire back an event to 
{{RMContainerImpl}} indicate recovery of resource request is completed. This 
can move the state forward to KILLED in {{RMContainerImpl}}. 
Please share your thoughts.

>  ResourceRequest should be restored back to scheduler when RMContainer is 
> killed at ALLOCATED
> -
>
> Key: YARN-3535
> URL: https://issues.apache.org/jira/browse/YARN-3535
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Peng Zhang
>Priority: Critical
> Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
> 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
> yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED

2015-07-16 Thread Peng Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629369#comment-14629369
 ] 

Peng Zhang commented on YARN-3535:
--

bq. there are chances that recoverResourceRequest may not be correct.

Sorry, I didn't catch this, maybe I missed sth?. 

I think {{recoverResourceRequest}} will not be affected by whether container 
finished event is processed faster. 
Cause {{recoverResourceRequest}} only process the ResourceRequest in container 
and not care its status. 

>  ResourceRequest should be restored back to scheduler when RMContainer is 
> killed at ALLOCATED
> -
>
> Key: YARN-3535
> URL: https://issues.apache.org/jira/browse/YARN-3535
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Peng Zhang
>Priority: Critical
> Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
> 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
> yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3805) Update the documentation of Disk Checker based on YARN-90

2015-07-16 Thread Masatake Iwasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated YARN-3805:
---
Attachment: YARN-3805.002.patch

I rebased the patch. Thanks for pinging me, [~ozawa].

> Update the documentation of Disk Checker based on YARN-90
> -
>
> Key: YARN-3805
> URL: https://issues.apache.org/jira/browse/YARN-3805
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Masatake Iwasaki
>Assignee: Masatake Iwasaki
>Priority: Minor
> Attachments: YARN-3805.001.patch, YARN-3805.002.patch
>
>
> NodeManager is able to recover status of the disk once broken and fixed 
> without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED

2015-07-16 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629394#comment-14629394
 ] 

Arun Suresh commented on YARN-3535:
---

bq. I think recoverResourceRequest will not be affected by whether container 
finished event is processed faster. Cause recoverResourceRequest only process 
the ResourceRequest in container and not care its status.
I agree with [~peng.zhang] here. IIUC, The {{recoverResourceRequest}} only 
affects state of the Scheduler and the SchedulerApp. In any case, the fact that 
the container is killed (the outcome of the 
{{RMAppAttemptContainerFinishedEvent}} fired by 
{{FinishedTransition#transition}}) will be notified to the Scheduler.. and that 
notification will happen only AFTER the recoverResourceRequest has completed.. 
since it will be handled by the same dispatcher.

>  ResourceRequest should be restored back to scheduler when RMContainer is 
> killed at ALLOCATED
> -
>
> Key: YARN-3535
> URL: https://issues.apache.org/jira/browse/YARN-3535
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Peng Zhang
>Priority: Critical
> Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
> 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
> yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2809) Implement workaround for linux kernel panic when removing cgroup

2015-07-16 Thread wangfeng (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629410#comment-14629410
 ] 

wangfeng commented on YARN-2809:


failed when patching this to hadoop2.6.0,console output:
 patch -u -p0 < YARN-2809-v3.patch

patching file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
Hunk #1 succeeded at 984 (offset -16 lines).
patching file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java
Hunk #1 FAILED at 22.
Hunk #2 succeeded at 33 (offset -4 lines).
Hunk #3 succeeded at 71 (offset -5 lines).
Hunk #4 succeeded at 105 (offset -5 lines).
Hunk #5 succeeded at 266 (offset -10 lines).
Hunk #6 succeeded at 338 (offset -10 lines).
1 out of 6 hunks FAILED -- saving rejects to file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java.rej
patching file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java

> Implement workaround for linux kernel panic when removing cgroup
> 
>
> Key: YARN-2809
> URL: https://issues.apache.org/jira/browse/YARN-2809
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
> Environment:  RHEL 6.4
>Reporter: Nathan Roberts
>Assignee: Nathan Roberts
> Fix For: 2.7.0
>
> Attachments: YARN-2809-v2.patch, YARN-2809-v3.patch, YARN-2809.patch
>
>
> Some older versions of linux have a bug that can cause a kernel panic when 
> the LCE attempts to remove a cgroup. It is a race condition so it's a bit 
> rare but on a few thousand node cluster it can result in a couple of panics 
> per day.
> This is the commit that likely (haven't verified) fixes the problem in linux: 
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.y&id=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267
> Details will be added in comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED

2015-07-16 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629411#comment-14629411
 ] 

Sunil G commented on YARN-3535:
---

Thank you [~peng.zhang] and [~asuresh] for correcting.
bq.that notification will happen only AFTER the recoverResourceRequest has 
completed.. since it will be handled by the same dispatcher
Yes. I missed this. Ordering will be corrected here.  

>  ResourceRequest should be restored back to scheduler when RMContainer is 
> killed at ALLOCATED
> -
>
> Key: YARN-3535
> URL: https://issues.apache.org/jira/browse/YARN-3535
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Peng Zhang
>Priority: Critical
> Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
> 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
> yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3928) launch application master on specific host

2015-07-16 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629412#comment-14629412
 ] 

Varun Saxena commented on YARN-3928:


Duplicate of MAPREDUCE-6402

> launch application master on specific host
> --
>
> Key: YARN-3928
> URL: https://issues.apache.org/jira/browse/YARN-3928
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.6.0
> Environment: Ubuntu 12.04
>Reporter: Wenrui
>
> Hi, 
> Is there a way to launch application master on a specific host ?
> If we can not do this in a managed-AM-launcher? 
> then is it possible to achieve this in unmanaged-AM-launcher?
> I just find it's quite necessary to set application master on a specific host 
> in some  scenes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90

2015-07-16 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629418#comment-14629418
 ] 

Tsuyoshi Ozawa commented on YARN-3805:
--

+1, pending for Jenkins.

> Update the documentation of Disk Checker based on YARN-90
> -
>
> Key: YARN-3805
> URL: https://issues.apache.org/jira/browse/YARN-3805
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Masatake Iwasaki
>Assignee: Masatake Iwasaki
>Priority: Minor
> Attachments: YARN-3805.001.patch, YARN-3805.002.patch
>
>
> NodeManager is able to recover status of the disk once broken and fixed 
> without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90

2015-07-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629423#comment-14629423
 ] 

Hadoop QA commented on YARN-3805:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |   3m 42s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | release audit |   0m 21s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   2m 59s | Site still builds. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| | |   7m  5s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12745590/YARN-3805.002.patch |
| Optional Tests | site |
| git revision | trunk / 90bda9c |
| Java | 1.7.0_55 |
| uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8556/console |


This message was automatically generated.

> Update the documentation of Disk Checker based on YARN-90
> -
>
> Key: YARN-3805
> URL: https://issues.apache.org/jira/browse/YARN-3805
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Masatake Iwasaki
>Assignee: Masatake Iwasaki
>Priority: Minor
> Attachments: YARN-3805.001.patch, YARN-3805.002.patch
>
>
> NodeManager is able to recover status of the disk once broken and fixed 
> without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90

2015-07-16 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629429#comment-14629429
 ] 

Tsuyoshi Ozawa commented on YARN-3805:
--

Checking this in.

> Update the documentation of Disk Checker based on YARN-90
> -
>
> Key: YARN-3805
> URL: https://issues.apache.org/jira/browse/YARN-3805
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Masatake Iwasaki
>Assignee: Masatake Iwasaki
>Priority: Minor
> Attachments: YARN-3805.001.patch, YARN-3805.002.patch
>
>
> NodeManager is able to recover status of the disk once broken and fixed 
> without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3929) Uncleaning option for local app log files with log-aggregation feature

2015-07-16 Thread Dongwook Kwon (JIRA)
Dongwook Kwon created YARN-3929:
---

 Summary: Uncleaning option for local app log files with 
log-aggregation feature
 Key: YARN-3929
 URL: https://issues.apache.org/jira/browse/YARN-3929
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: log-aggregation
Affects Versions: 2.6.0, 2.4.0
Reporter: Dongwook Kwon
Priority: Minor


Although it makes sense to delete local app log files once AppLogAggregator 
copied all files into remote location(HDFS), I have some use cases that need to 
leave local app log files after it's copied to HDFS. Mostly it's for own backup 
purpose. I would like to use log-aggregation feature of YARN and want to back 
up app log files too. Without this option, files has to copy from HDFS to local 
again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level

2015-07-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629446#comment-14629446
 ] 

Hadoop QA commented on YARN-3885:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 12s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 46s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 37s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 50s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 18s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 23s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |  61m 19s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  99m 23s | |
\\
\\
|| Reason || Tests ||
| Timed out tests | 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12745584/YARN-3885.08.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 90bda9c |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8555/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8555/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8555/console |


This message was automatically generated.

> ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 
> level
> --
>
> Key: YARN-3885
> URL: https://issues.apache.org/jira/browse/YARN-3885
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Blocker
> Attachments: YARN-3885.02.patch, YARN-3885.03.patch, 
> YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, 
> YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch
>
>
> when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}}
> this piece of code, to calculate {{untoucable}} doesnt consider al the 
> children, it considers only immediate childern



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED

2015-07-16 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629452#comment-14629452
 ] 

zhihai xu commented on YARN-3535:
-

Also because {{containerCompleted}} and 
{{pullNewlyAllocatedContainersAndNMTokens}} are synchronized, it will guarantee 
if AM gets the container, 
{{ContainerRescheduledEvent}}({{recoverResourceRequestForContainer}}) won't be 
called.


>  ResourceRequest should be restored back to scheduler when RMContainer is 
> killed at ALLOCATED
> -
>
> Key: YARN-3535
> URL: https://issues.apache.org/jira/browse/YARN-3535
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Peng Zhang
>Priority: Critical
> Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 
> 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
> yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown

2015-07-16 Thread Dian Fu (JIRA)
Dian Fu created YARN-3930:
-

 Summary: FileSystemNodeLabelsStore should make sure edit log file 
closed when exception is thrown 
 Key: YARN-3930
 URL: https://issues.apache.org/jira/browse/YARN-3930
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Dian Fu
Assignee: Dian Fu


When I test the node label feature in my local environment, I encountered the 
following exception:
{code}
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)

at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
at java.lang.Thread.run(Thread.java:745)
{code}
The reason is that HDFS throws an exception when calling 
{{ensureAppendEditlogFile}} because of some reason which causes the edit log 
output stream isn't closed. This caused that the next time we call 
{{ensureAppendEditlogFile}}, lease recovery will failed because we are just the 
lease holder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629464#comment-14629464
 ] 

Hudson commented on YARN-3805:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8173 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8173/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md


> Update the documentation of Disk Checker based on YARN-90
> -
>
> Key: YARN-3805
> URL: https://issues.apache.org/jira/browse/YARN-3805
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Masatake Iwasaki
>Assignee: Masatake Iwasaki
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-3805.001.patch, YARN-3805.002.patch
>
>
> NodeManager is able to recover status of the disk once broken and fixed 
> without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629465#comment-14629465
 ] 

Hudson commented on YARN-90:


FAILURE: Integrated in Hadoop-trunk-Commit #8173 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8173/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md


> NodeManager should identify failed disks becoming good again
> 
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Fix For: 2.6.0
>
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
> apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
> apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown

2015-07-16 Thread Dian Fu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dian Fu updated YARN-3930:
--
Attachment: YARN-3930.001.patch

A simple patch attached.

> FileSystemNodeLabelsStore should make sure edit log file closed when 
> exception is thrown 
> -
>
> Key: YARN-3930
> URL: https://issues.apache.org/jira/browse/YARN-3930
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Dian Fu
>Assignee: Dian Fu
> Attachments: YARN-3930.001.patch
>
>
> When I test the node label feature in my local environment, I encountered the 
> following exception:
> {code}
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The reason is that HDFS throws an exception when calling 
> {{ensureAppendEditlogFile}} because of some reason which causes the edit log 
> output stream isn't closed. This caused that the next time we call 
> {{ensureAppendEditlogFile}}, lease recovery will failed because we are just 
> the lease holder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3929) Uncleaning option for local app log files with log-aggregation feature

2015-07-16 Thread Dongwook Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongwook Kwon updated YARN-3929:

Attachment: YARN-3929.01.patch

Could you review this patch, Thanks.

> Uncleaning option for local app log files with log-aggregation feature
> --
>
> Key: YARN-3929
> URL: https://issues.apache.org/jira/browse/YARN-3929
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: log-aggregation
>Affects Versions: 2.4.0, 2.6.0
>Reporter: Dongwook Kwon
>Priority: Minor
> Attachments: YARN-3929.01.patch
>
>
> Although it makes sense to delete local app log files once AppLogAggregator 
> copied all files into remote location(HDFS), I have some use cases that need 
> to leave local app log files after it's copied to HDFS. Mostly it's for own 
> backup purpose. I would like to use log-aggregation feature of YARN and want 
> to back up app log files too. Without this option, files has to copy from 
> HDFS to local again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2015-07-16 Thread kyungwan nam (JIRA)
kyungwan nam created YARN-3931:
--

 Summary: default-node-label-expression doesn’t apply when an 
application is submitted by RM rest api
 Key: YARN-3931
 URL: https://issues.apache.org/jira/browse/YARN-3931
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: hadoop-2.6.0
Reporter: kyungwan nam


* yarn.scheduler.capacity..default-node-label-expression=large_disk
* submit an application using rest api without "app-node-label-expression”, 
"am-container-node-label-expression”
* RM doesn’t allocate containers to the hosts associated with large_disk node 
label




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2015-07-16 Thread kyungwan nam (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629489#comment-14629489
 ] 

kyungwan nam commented on YARN-3931:


node-label-expression is initialized to empty string 
{code}
...
public ApplicationSubmissionContextInfo() {
  applicationId = "";
  applicationName = "";
  containerInfo = new ContainerLaunchContextInfo();
  resource = new ResourceInfo();
  priority = Priority.UNDEFINED.getPriority();
  isUnmanagedAM = false;
  cancelTokensWhenComplete = true;
  keepContainers = false;
  applicationType = "";
  tags = new HashSet();
  appNodeLabelExpression = "";
  amContainerNodeLabelExpression = "";
}
{code}

but, check whether node-label-expression is null or not
{code}
// check labels in the resource request.
String labelExp = resReq.getNodeLabelExpression();

// if queue has default label expression, and RR doesn't have, use the
// default label expression of queue
if (labelExp == null && queueInfo != null) {
  labelExp = queueInfo.getDefaultNodeLabelExpression();
  resReq.setNodeLabelExpression(labelExp);
}
{code}

> default-node-label-expression doesn’t apply when an application is submitted 
> by RM rest api
> ---
>
> Key: YARN-3931
> URL: https://issues.apache.org/jira/browse/YARN-3931
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
> Environment: hadoop-2.6.0
>Reporter: kyungwan nam
>Assignee: Naganarasimha G R
>
> * 
> yarn.scheduler.capacity..default-node-label-expression=large_disk
> * submit an application using rest api without "app-node-label-expression”, 
> "am-container-node-label-expression”
> * RM doesn’t allocate containers to the hosts associated with large_disk node 
> label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2015-07-16 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R reassigned YARN-3931:
---

Assignee: Naganarasimha G R

> default-node-label-expression doesn’t apply when an application is submitted 
> by RM rest api
> ---
>
> Key: YARN-3931
> URL: https://issues.apache.org/jira/browse/YARN-3931
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
> Environment: hadoop-2.6.0
>Reporter: kyungwan nam
>Assignee: Naganarasimha G R
>
> * 
> yarn.scheduler.capacity..default-node-label-expression=large_disk
> * submit an application using rest api without "app-node-label-expression”, 
> "am-container-node-label-expression”
> * RM doesn’t allocate containers to the hosts associated with large_disk node 
> label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2015-07-16 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629491#comment-14629491
 ] 

Naganarasimha G R commented on YARN-3931:
-

Hi [~kyungwan nam], Thanks for raising the issue ... i have assigned this jira 
to my name but if you are interested to further look into this jira and solve 
it . please reassign.

> default-node-label-expression doesn’t apply when an application is submitted 
> by RM rest api
> ---
>
> Key: YARN-3931
> URL: https://issues.apache.org/jira/browse/YARN-3931
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
> Environment: hadoop-2.6.0
>Reporter: kyungwan nam
>Assignee: Naganarasimha G R
>
> * 
> yarn.scheduler.capacity..default-node-label-expression=large_disk
> * submit an application using rest api without "app-node-label-expression”, 
> "am-container-node-label-expression”
> * RM doesn’t allocate containers to the hosts associated with large_disk node 
> label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level

2015-07-16 Thread Ajith S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629555#comment-14629555
 ] 

Ajith S commented on YARN-3885:
---

not because of the patch

> ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 
> level
> --
>
> Key: YARN-3885
> URL: https://issues.apache.org/jira/browse/YARN-3885
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Blocker
> Attachments: YARN-3885.02.patch, YARN-3885.03.patch, 
> YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, 
> YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch
>
>
> when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}}
> this piece of code, to calculate {{untoucable}} doesnt consider al the 
> children, it considers only immediate childern



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629616#comment-14629616
 ] 

Hudson commented on YARN-3174:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #258 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/258/])
YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation 
into one. Contributed by Masatake Iwasaki. (ozawa: rev 
f02dd146f58bcfa0595eec7f2433bafdd857630f)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-project/src/site/site.xml


> Consolidate the NodeManager and NodeManagerRestart documentation into one
> -
>
> Key: YARN-3174
> URL: https://issues.apache.org/jira/browse/YARN-3174
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 2.7.1
>Reporter: Allen Wittenauer
>Assignee: Masatake Iwasaki
> Fix For: 2.8.0
>
> Attachments: YARN-3174.001.patch
>
>
> We really don't need a different document for every individual nodemanager 
> feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629618#comment-14629618
 ] 

Hudson commented on YARN-90:


SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #258 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/258/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md


> NodeManager should identify failed disks becoming good again
> 
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Fix For: 2.6.0
>
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
> apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
> apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629615#comment-14629615
 ] 

Hudson commented on YARN-3805:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #258 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/258/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md


> Update the documentation of Disk Checker based on YARN-90
> -
>
> Key: YARN-3805
> URL: https://issues.apache.org/jira/browse/YARN-3805
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Masatake Iwasaki
>Assignee: Masatake Iwasaki
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-3805.001.patch, YARN-3805.002.patch
>
>
> NodeManager is able to recover status of the disk once broken and fixed 
> without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629623#comment-14629623
 ] 

Hudson commented on YARN-3174:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #988 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/988/])
YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation 
into one. Contributed by Masatake Iwasaki. (ozawa: rev 
f02dd146f58bcfa0595eec7f2433bafdd857630f)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md
* hadoop-project/src/site/site.xml
* hadoop-yarn-project/CHANGES.txt


> Consolidate the NodeManager and NodeManagerRestart documentation into one
> -
>
> Key: YARN-3174
> URL: https://issues.apache.org/jira/browse/YARN-3174
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 2.7.1
>Reporter: Allen Wittenauer
>Assignee: Masatake Iwasaki
> Fix For: 2.8.0
>
> Attachments: YARN-3174.001.patch
>
>
> We really don't need a different document for every individual nodemanager 
> feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629625#comment-14629625
 ] 

Hudson commented on YARN-90:


SUCCESS: Integrated in Hadoop-Yarn-trunk #988 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/988/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt


> NodeManager should identify failed disks becoming good again
> 
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Fix For: 2.6.0
>
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
> apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
> apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629622#comment-14629622
 ] 

Hudson commented on YARN-3805:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #988 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/988/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt


> Update the documentation of Disk Checker based on YARN-90
> -
>
> Key: YARN-3805
> URL: https://issues.apache.org/jira/browse/YARN-3805
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Masatake Iwasaki
>Assignee: Masatake Iwasaki
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-3805.001.patch, YARN-3805.002.patch
>
>
> NodeManager is able to recover status of the disk once broken and fixed 
> without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown

2015-07-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629646#comment-14629646
 ] 

Hadoop QA commented on YARN-3930:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m  8s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 39s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 34s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 52s | The applied patch generated  2 
new checkstyle issues (total was 14, now 15). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 19s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 34s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   1m 56s | Tests passed in 
hadoop-yarn-common. |
| | |  40m  1s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12745596/YARN-3930.001.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 1ba2986 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8557/artifact/patchprocess/diffcheckstylehadoop-yarn-common.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8557/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8557/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8557/console |


This message was automatically generated.

> FileSystemNodeLabelsStore should make sure edit log file closed when 
> exception is thrown 
> -
>
> Key: YARN-3930
> URL: https://issues.apache.org/jira/browse/YARN-3930
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Dian Fu
>Assignee: Dian Fu
> Attachments: YARN-3930.001.patch
>
>
> When I test the node label feature in my local environment, I encountered the 
> following exception:
> {code}
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabels

[jira] [Commented] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions

2015-07-16 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629660#comment-14629660
 ] 

Varun Saxena commented on YARN-3877:


[~chris.douglas], thanks for the review.

Yes, you are correct that this config is not required for test. Will remove it.
Will move the relevant test code to a separate test.

> YarnClientImpl.submitApplication swallows exceptions
> 
>
> Key: YARN-3877
> URL: https://issues.apache.org/jira/browse/YARN-3877
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.7.2
>Reporter: Steve Loughran
>Assignee: Varun Saxena
>Priority: Minor
> Attachments: YARN-3877.01.patch
>
>
> When {{YarnClientImpl.submitApplication}} spins waiting for the application 
> to be accepted, any interruption during its Sleep() calls are logged and 
> swallowed.
> this makes it hard to interrupt the thread during shutdown. Really it should 
> throw some form of exception and let the caller deal with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2015-07-16 Thread kyungwan nam (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629701#comment-14629701
 ] 

kyungwan nam commented on YARN-3931:


hi, i couldn't reassign it to me.
i think i don't have the privilege to assign issue

> default-node-label-expression doesn’t apply when an application is submitted 
> by RM rest api
> ---
>
> Key: YARN-3931
> URL: https://issues.apache.org/jira/browse/YARN-3931
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
> Environment: hadoop-2.6.0
>Reporter: kyungwan nam
>Assignee: Naganarasimha G R
>
> * 
> yarn.scheduler.capacity..default-node-label-expression=large_disk
> * submit an application using rest api without "app-node-label-expression”, 
> "am-container-node-label-expression”
> * RM doesn’t allocate containers to the hosts associated with large_disk node 
> label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions

2015-07-16 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3877:
---
Attachment: YARN-3877.02.patch

> YarnClientImpl.submitApplication swallows exceptions
> 
>
> Key: YARN-3877
> URL: https://issues.apache.org/jira/browse/YARN-3877
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.7.2
>Reporter: Steve Loughran
>Assignee: Varun Saxena
>Priority: Minor
> Attachments: YARN-3877.01.patch, YARN-3877.02.patch
>
>
> When {{YarnClientImpl.submitApplication}} spins waiting for the application 
> to be accepted, any interruption during its Sleep() calls are logged and 
> swallowed.
> this makes it hard to interrupt the thread during shutdown. Really it should 
> throw some form of exception and let the caller deal with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions

2015-07-16 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629714#comment-14629714
 ] 

Varun Saxena commented on YARN-3877:


[~chris.douglas], updated a new patch.
Kindly review.

To avoid timing issues in test, added code to wait for thread to enter 
sleep(enter TIMED_WAITING state) before call to interrupt.

> YarnClientImpl.submitApplication swallows exceptions
> 
>
> Key: YARN-3877
> URL: https://issues.apache.org/jira/browse/YARN-3877
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.7.2
>Reporter: Steve Loughran
>Assignee: Varun Saxena
>Priority: Minor
> Attachments: YARN-3877.01.patch, YARN-3877.02.patch
>
>
> When {{YarnClientImpl.submitApplication}} spins waiting for the application 
> to be accepted, any interruption during its Sleep() calls are logged and 
> swallowed.
> this makes it hard to interrupt the thread during shutdown. Really it should 
> throw some form of exception and let the caller deal with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629722#comment-14629722
 ] 

Hudson commented on YARN-3174:
--

ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #246 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/246/])
YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation 
into one. Contributed by Masatake Iwasaki. (ozawa: rev 
f02dd146f58bcfa0595eec7f2433bafdd857630f)
* hadoop-yarn-project/CHANGES.txt
* hadoop-project/src/site/site.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md


> Consolidate the NodeManager and NodeManagerRestart documentation into one
> -
>
> Key: YARN-3174
> URL: https://issues.apache.org/jira/browse/YARN-3174
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 2.7.1
>Reporter: Allen Wittenauer
>Assignee: Masatake Iwasaki
> Fix For: 2.8.0
>
> Attachments: YARN-3174.001.patch
>
>
> We really don't need a different document for every individual nodemanager 
> feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629728#comment-14629728
 ] 

Hudson commented on YARN-90:


ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #246 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/246/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt


> NodeManager should identify failed disks becoming good again
> 
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Fix For: 2.6.0
>
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
> apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
> apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629724#comment-14629724
 ] 

Hudson commented on YARN-90:


ABORTED: Integrated in Hadoop-Mapreduce-trunk #2204 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2204/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt


> NodeManager should identify failed disks becoming good again
> 
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Fix For: 2.6.0
>
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
> apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
> apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629716#comment-14629716
 ] 

Hudson commented on YARN-3805:
--

ABORTED: Integrated in Hadoop-Mapreduce-trunk #2204 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2204/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt


> Update the documentation of Disk Checker based on YARN-90
> -
>
> Key: YARN-3805
> URL: https://issues.apache.org/jira/browse/YARN-3805
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Masatake Iwasaki
>Assignee: Masatake Iwasaki
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-3805.001.patch, YARN-3805.002.patch
>
>
> NodeManager is able to recover status of the disk once broken and fixed 
> without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629715#comment-14629715
 ] 

Hudson commented on YARN-3805:
--

ABORTED: Integrated in Hadoop-Hdfs-trunk #2185 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2185/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt


> Update the documentation of Disk Checker based on YARN-90
> -
>
> Key: YARN-3805
> URL: https://issues.apache.org/jira/browse/YARN-3805
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Masatake Iwasaki
>Assignee: Masatake Iwasaki
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-3805.001.patch, YARN-3805.002.patch
>
>
> NodeManager is able to recover status of the disk once broken and fixed 
> without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629719#comment-14629719
 ] 

Hudson commented on YARN-3174:
--

ABORTED: Integrated in Hadoop-Mapreduce-trunk #2204 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2204/])
YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation 
into one. Contributed by Masatake Iwasaki. (ozawa: rev 
f02dd146f58bcfa0595eec7f2433bafdd857630f)
* hadoop-project/src/site/site.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md


> Consolidate the NodeManager and NodeManagerRestart documentation into one
> -
>
> Key: YARN-3174
> URL: https://issues.apache.org/jira/browse/YARN-3174
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 2.7.1
>Reporter: Allen Wittenauer
>Assignee: Masatake Iwasaki
> Fix For: 2.8.0
>
> Attachments: YARN-3174.001.patch
>
>
> We really don't need a different document for every individual nodemanager 
> feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3805) Update the documentation of Disk Checker based on YARN-90

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629718#comment-14629718
 ] 

Hudson commented on YARN-3805:
--

ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #246 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/246/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt


> Update the documentation of Disk Checker based on YARN-90
> -
>
> Key: YARN-3805
> URL: https://issues.apache.org/jira/browse/YARN-3805
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Masatake Iwasaki
>Assignee: Masatake Iwasaki
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: YARN-3805.001.patch, YARN-3805.002.patch
>
>
> NodeManager is able to recover status of the disk once broken and fixed 
> without restarting by YARN-90.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629723#comment-14629723
 ] 

Hudson commented on YARN-90:


ABORTED: Integrated in Hadoop-Hdfs-trunk #2185 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2185/])
YARN-3805. Update the documentation of Disk Checker based on YARN-90. 
Contributed by Masatake Iwasaki. (ozawa: rev 
1ba2986dee4bbb64d67ada005f8f132e69575274)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt


> NodeManager should identify failed disks becoming good again
> 
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Fix For: 2.6.0
>
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
> apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
> apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3174) Consolidate the NodeManager and NodeManagerRestart documentation into one

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629717#comment-14629717
 ] 

Hudson commented on YARN-3174:
--

ABORTED: Integrated in Hadoop-Hdfs-trunk #2185 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2185/])
YARN-3174. Consolidate the NodeManager and NodeManagerRestart documentation 
into one. Contributed by Masatake Iwasaki. (ozawa: rev 
f02dd146f58bcfa0595eec7f2433bafdd857630f)
* hadoop-project/src/site/site.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManager.md
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/NodeManagerRestart.md


> Consolidate the NodeManager and NodeManagerRestart documentation into one
> -
>
> Key: YARN-3174
> URL: https://issues.apache.org/jira/browse/YARN-3174
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 2.7.1
>Reporter: Allen Wittenauer
>Assignee: Masatake Iwasaki
> Fix For: 2.8.0
>
> Attachments: YARN-3174.001.patch
>
>
> We really don't need a different document for every individual nodemanager 
> feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3928) launch application master on specific host

2015-07-16 Thread Lei Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629750#comment-14629750
 ] 

Lei Guo commented on YARN-3928:
---

[~varun_saxena], I read this JIRA as a host preference requirement during 
container allocation, it's not a duplicate of MAPREDUCE-6402. [~wenrui], can 
you confirm?

> launch application master on specific host
> --
>
> Key: YARN-3928
> URL: https://issues.apache.org/jira/browse/YARN-3928
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.6.0
> Environment: Ubuntu 12.04
>Reporter: Wenrui
>
> Hi, 
> Is there a way to launch application master on a specific host ?
> If we can not do this in a managed-AM-launcher? 
> then is it possible to achieve this in unmanaged-AM-launcher?
> I just find it's quite necessary to set application master on a specific host 
> in some  scenes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3928) launch application master on specific host

2015-07-16 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629753#comment-14629753
 ] 

Varun Saxena commented on YARN-3928:


Oh, then it is not. Misread the JIRA title.
Apologies.


> launch application master on specific host
> --
>
> Key: YARN-3928
> URL: https://issues.apache.org/jira/browse/YARN-3928
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.6.0
> Environment: Ubuntu 12.04
>Reporter: Wenrui
>
> Hi, 
> Is there a way to launch application master on a specific host ?
> If we can not do this in a managed-AM-launcher? 
> then is it possible to achieve this in unmanaged-AM-launcher?
> I just find it's quite necessary to set application master on a specific host 
> in some  scenes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2015-07-16 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629790#comment-14629790
 ] 

Naganarasimha G R commented on YARN-3931:
-

[~kyungwan nam], Good that you are trying to contribute :), we need to request 
some committer to add you to the list of contributors but in the mean time you 
can upload the patch with test case i can help you in reviewing
[~wangda tan], 
Can you please add [~kyungwan nam] to the contributor list and assign him this 
jira ?

> default-node-label-expression doesn’t apply when an application is submitted 
> by RM rest api
> ---
>
> Key: YARN-3931
> URL: https://issues.apache.org/jira/browse/YARN-3931
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
> Environment: hadoop-2.6.0
>Reporter: kyungwan nam
>Assignee: Naganarasimha G R
>
> * 
> yarn.scheduler.capacity..default-node-label-expression=large_disk
> * submit an application using rest api without "app-node-label-expression”, 
> "am-container-node-label-expression”
> * RM doesn’t allocate containers to the hosts associated with large_disk node 
> label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions

2015-07-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629811#comment-14629811
 ] 

Hadoop QA commented on YARN-3877:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 34s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 41s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 42s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 28s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 20s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   0m 53s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   6m 55s | Tests passed in 
hadoop-yarn-client. |
| | |  43m 31s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12745625/YARN-3877.02.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 1ba2986 |
| hadoop-yarn-client test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8558/artifact/patchprocess/testrun_hadoop-yarn-client.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8558/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8558/console |


This message was automatically generated.

> YarnClientImpl.submitApplication swallows exceptions
> 
>
> Key: YARN-3877
> URL: https://issues.apache.org/jira/browse/YARN-3877
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.7.2
>Reporter: Steve Loughran
>Assignee: Varun Saxena
>Priority: Minor
> Attachments: YARN-3877.01.patch, YARN-3877.02.patch
>
>
> When {{YarnClientImpl.submitApplication}} spins waiting for the application 
> to be accepted, any interruption during its Sleep() calls are logged and 
> swallowed.
> this makes it hard to interrupt the thread during shutdown. Really it should 
> throw some form of exception and let the caller deal with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3784) Indicate preemption timout along with the list of containers to AM (preemption message)

2015-07-16 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3784:
--
Attachment: 0002-YARN-3784.patch

Uploading a new version of the patch. 

Initially RM was sending list of container IDs in the preemption message. This 
patch is now improved that to include timeout also along with container id. New 
timeout is an optional param in proto.
[~chris.douglas] Could you please take a look.

> Indicate preemption timout along with the list of containers to AM 
> (preemption message)
> ---
>
> Key: YARN-3784
> URL: https://issues.apache.org/jira/browse/YARN-3784
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-3784.patch, 0002-YARN-3784.patch
>
>
> Currently during preemption, AM is notified with a list of containers which 
> are marked for preemption. Introducing a timeout duration also along with 
> this container list so that AM can know how much time it will get to do a 
> graceful shutdown to its containers (assuming one of preemption policy is 
> loaded in AM).
> This will help in decommissioning NM scenarios, where NM will be 
> decommissioned after a timeout (also killing containers on it). This timeout 
> will be helpful to indicate AM that those containers can be killed by RM 
> forcefully after the timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2015-07-16 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629940#comment-14629940
 ] 

Ming Ma commented on YARN-2578:
---

Thanks [~iwasakims]. Is it similar to HADOOP-11252? Given your latest patch is 
in hadoop-common, it might be better to fix it as a HADOOP jira instead.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: YARN-2578.002.patch, YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-16 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-3932:
--

 Summary: SchedulerApplicationAttempt#getResourceUsageReport should 
be based on NodeLabel
 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt


Application Resource Report shown wrong when node Label is used.


1.Submit application with NodeLabel
2.Check RM UI for resources used 
Allocated CPU VCores and Allocated Memory MB is always {{zero}}

{code}
 public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
AggregateAppResourceUsage runningResourceUsage =
getRunningAggregateAppResourceUsage();
Resource usedResourceClone =
Resources.clone(attemptResourceUsage.getUsed());
Resource reservedResourceClone =
Resources.clone(attemptResourceUsage.getReserved());
return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
reservedContainers.size(), usedResourceClone, reservedResourceClone,
Resources.add(usedResourceClone, reservedResourceClone),
runningResourceUsage.getMemorySeconds(),
runningResourceUsage.getVcoreSeconds());
  }
{code}
should be {{attemptResourceUsage.getUsed(label)}}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-16 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3932:
---
Attachment: ApplicationReport.jpg

> SchedulerApplicationAttempt#getResourceUsageReport should be based on 
> NodeLabel
> ---
>
> Key: YARN-3932
> URL: https://issues.apache.org/jira/browse/YARN-3932
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: ApplicationReport.jpg
>
>
> Application Resource Report shown wrong when node Label is used.
> 1.Submit application with NodeLabel
> 2.Check RM UI for resources used 
> Allocated CPU VCores and Allocated Memory MB is always {{zero}}
> {code}
>  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
> AggregateAppResourceUsage runningResourceUsage =
> getRunningAggregateAppResourceUsage();
> Resource usedResourceClone =
> Resources.clone(attemptResourceUsage.getUsed());
> Resource reservedResourceClone =
> Resources.clone(attemptResourceUsage.getReserved());
> return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
> reservedContainers.size(), usedResourceClone, reservedResourceClone,
> Resources.add(usedResourceClone, reservedResourceClone),
> runningResourceUsage.getMemorySeconds(),
> runningResourceUsage.getVcoreSeconds());
>   }
> {code}
> should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1644) RM-NM protocol changes and NodeStatusUpdater implementation to support container resizing

2015-07-16 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1644:

Attachment: YARN-1644-YARN-1197.4.patch

Updated this patch as dependent patch has been updated.

> RM-NM protocol changes and NodeStatusUpdater implementation to support 
> container resizing
> -
>
> Key: YARN-1644
> URL: https://issues.apache.org/jira/browse/YARN-1644
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan
>Assignee: MENG DING
> Attachments: YARN-1644-YARN-1197.4.patch, YARN-1644.1.patch, 
> YARN-1644.2.patch, YARN-1644.3.patch, yarn-1644.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3929) Uncleaning option for local app log files with log-aggregation feature

2015-07-16 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629998#comment-14629998
 ] 

Xuan Gong commented on YARN-3929:
-

[~dongwook]
Does this configuration: yarn.nodemanager.delete.debug-delay-sec satisfy your 
requirement ? 

> Uncleaning option for local app log files with log-aggregation feature
> --
>
> Key: YARN-3929
> URL: https://issues.apache.org/jira/browse/YARN-3929
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: log-aggregation
>Affects Versions: 2.4.0, 2.6.0
>Reporter: Dongwook Kwon
>Priority: Minor
> Attachments: YARN-3929.01.patch
>
>
> Although it makes sense to delete local app log files once AppLogAggregator 
> copied all files into remote location(HDFS), I have some use cases that need 
> to leave local app log files after it's copied to HDFS. Mostly it's for own 
> backup purpose. I would like to use log-aggregation feature of YARN and want 
> to back up app log files too. Without this option, files has to copy from 
> HDFS to local again. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()

2015-07-16 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-3893:

Issue Type: Sub-task  (was: Bug)
Parent: YARN-149

> Both RM in active state when Admin#transitionToActive failure from refeshAll()
> --
>
> Key: YARN-3893
> URL: https://issues.apache.org/jira/browse/YARN-3893
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 
> 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml
>
>
> Cases that can cause this.
> # Capacity scheduler xml is wrongly configured during switch
> # Refresh ACL failure due to configuration
> # Refresh User group failure due to configuration
> Continuously both RM will try to be active
> {code}
> dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin>
>  ./yarn rmadmin  -getServiceState rm1
> 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> active
> dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin>
>  ./yarn rmadmin  -getServiceState rm2
> 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> active
> {code}
> # Both Web UI active
> # Status shown as active for both RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2015-07-16 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3931:
-
Assignee: kyungwan nam  (was: Naganarasimha G R)

> default-node-label-expression doesn’t apply when an application is submitted 
> by RM rest api
> ---
>
> Key: YARN-3931
> URL: https://issues.apache.org/jira/browse/YARN-3931
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
> Environment: hadoop-2.6.0
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>
> * 
> yarn.scheduler.capacity..default-node-label-expression=large_disk
> * submit an application using rest api without "app-node-label-expression”, 
> "am-container-node-label-expression”
> * RM doesn’t allocate containers to the hosts associated with large_disk node 
> label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2015-07-16 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630010#comment-14630010
 ] 

Wangda Tan commented on YARN-3931:
--

Thanks for raising the issue [~kyungwan nam], I just added you to contributor 
list and assigned the JIRA to you.

> default-node-label-expression doesn’t apply when an application is submitted 
> by RM rest api
> ---
>
> Key: YARN-3931
> URL: https://issues.apache.org/jira/browse/YARN-3931
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
> Environment: hadoop-2.6.0
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>
> * 
> yarn.scheduler.capacity..default-node-label-expression=large_disk
> * submit an application using rest api without "app-node-label-expression”, 
> "am-container-node-label-expression”
> * RM doesn’t allocate containers to the hosts associated with large_disk node 
> label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown

2015-07-16 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630017#comment-14630017
 ] 

Wangda Tan commented on YARN-3930:
--

[~dian.fu], Thanks for working on the JIRA. Patch looks good, will commit soon.

> FileSystemNodeLabelsStore should make sure edit log file closed when 
> exception is thrown 
> -
>
> Key: YARN-3930
> URL: https://issues.apache.org/jira/browse/YARN-3930
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Dian Fu
>Assignee: Dian Fu
> Attachments: YARN-3930.001.patch
>
>
> When I test the node label feature in my local environment, I encountered the 
> following exception:
> {code}
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The reason is that HDFS throws an exception when calling 
> {{ensureAppendEditlogFile}} because of some reason which causes the edit log 
> output stream isn't closed. This caused that the next time we call 
> {{ensureAppendEditlogFile}}, lease recovery will failed because we are just 
> the lease holder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level

2015-07-16 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630022#comment-14630022
 ] 

Wangda Tan commented on YARN-3885:
--

Patch LGTM, +1, will commit soon. Thanks [~ajithshetty].

> ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 
> level
> --
>
> Key: YARN-3885
> URL: https://issues.apache.org/jira/browse/YARN-3885
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Blocker
> Attachments: YARN-3885.02.patch, YARN-3885.03.patch, 
> YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, 
> YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch
>
>
> when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}}
> this piece of code, to calculate {{untoucable}} doesnt consider al the 
> children, it considers only immediate childern



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2003) Support for Application priority : Changes in RM and Capacity Scheduler

2015-07-16 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630042#comment-14630042
 ] 

Wangda Tan commented on YARN-2003:
--

Thanks [~sunilg] to update, few more comments regarding the latest patch:
- I suggest defer the consideration of queue checking. Currently we're changing 
how to do queue mapping. Ideally, it should be done before submit to scheduler 
(maybe before assigning application priority), see YARN-3635.
- Assumption of queue will be existed before submit to scheduler may be not 
always valid. With queue mapping, scheduler can create queue when accepting 
application. I suggest remove the check of queue's existence. Instead, you can 
have a private method to get priority by queue name. If queue is not existed, 
you can assign default priority to application.
- Comparison of priority should use Priority.compareTo instead of ">/<".

> Support for Application priority : Changes in RM and Capacity Scheduler
> ---
>
> Key: YARN-2003
> URL: https://issues.apache.org/jira/browse/YARN-2003
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-2003.patch, 00010-YARN-2003.patch, 
> 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, 
> 0005-YARN-2003.patch, 0006-YARN-2003.patch, 0007-YARN-2003.patch, 
> 0008-YARN-2003.patch, 0009-YARN-2003.patch, 0011-YARN-2003.patch, 
> 0012-YARN-2003.patch, 0013-YARN-2003.patch, 0014-YARN-2003.patch, 
> 0015-YARN-2003.patch, 0016-YARN-2003.patch, 0017-YARN-2003.patch, 
> 0018-YARN-2003.patch, 0019-YARN-2003.patch, 0020-YARN-2003.patch, 
> 0021-YARN-2003.patch, 0022-YARN-2003.patch
>
>
> AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from 
> Submission Context and store.
> Later this can be used by Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-16 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630090#comment-14630090
 ] 

Bibin A Chundatt commented on YARN-3932:


Hi [~leftnoteasy] i think we should iterate over {{liveContainers}} get sum of 
resource used. Any thoughts??

> SchedulerApplicationAttempt#getResourceUsageReport should be based on 
> NodeLabel
> ---
>
> Key: YARN-3932
> URL: https://issues.apache.org/jira/browse/YARN-3932
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: ApplicationReport.jpg
>
>
> Application Resource Report shown wrong when node Label is used.
> 1.Submit application with NodeLabel
> 2.Check RM UI for resources used 
> Allocated CPU VCores and Allocated Memory MB is always {{zero}}
> {code}
>  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
> AggregateAppResourceUsage runningResourceUsage =
> getRunningAggregateAppResourceUsage();
> Resource usedResourceClone =
> Resources.clone(attemptResourceUsage.getUsed());
> Resource reservedResourceClone =
> Resources.clone(attemptResourceUsage.getReserved());
> return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
> reservedContainers.size(), usedResourceClone, reservedResourceClone,
> Resources.add(usedResourceClone, reservedResourceClone),
> runningResourceUsage.getMemorySeconds(),
> runningResourceUsage.getVcoreSeconds());
>   }
> {code}
> should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3914) Entity created time should be part of the row key of entity table

2015-07-16 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630125#comment-14630125
 ] 

Sangjin Lee commented on YARN-3914:
---

[~zjshen], we have been discussing this. While adding entity creation time to 
the row key may solve this problem, the concern is that it may introduce others.

If the row key is 
(user/cluster/flow/run/app_id/entity_type/created_time/entity_id), then even 
the most basic query for (entity_type + entity_id) will get much more 
complicated, right? We cannot expect readers to provide the creation time every 
time they query for an entity by id.

Also, as you said, we cannot always accommodate different query vectors by 
adding more to the row key, or we would be risking blowing up the row key size 
or breaking other queries. We should be real judicious what goes into the row 
key...

I think it's reasonable to expect that the entity id order would be either 
completely or nearly identical to the chronological order (e.g. app id, or 
container id). So perhaps we could rely on the entity id order to help mitigate 
this problem.

Thoughts?

> Entity created time should be part of the row key of entity table
> -
>
> Key: YARN-3914
> URL: https://issues.apache.org/jira/browse/YARN-3914
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>
> Entity created time should be part of the row key of entity table, between 
> entity type and entity Id. The reason to have it is to index the entities. 
> Though we cannot index the entities for all kinds of information, indexing 
> them according to the created time is very necessary. Without it, every query 
> for the latest entities that belong to an application and a type will scan 
> through all the entities that belong to them. For example, if we want to list 
> the 100 latest started containers in an YARN app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3635) Get-queue-mapping should be a common interface of YarnScheduler

2015-07-16 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630139#comment-14630139
 ] 

Wangda Tan commented on YARN-3635:
--

Hi [~sandyr],
Thanks for your comments, actually I have read 
QueuePlacementPolicy/QueuePlacementRule from FS before working on this patch. 
The generic design fo this patch is based on FS's queue placement policy 
structure, but also with some changes. 

To your comments:

bq. Is a common way of configuration proposed?
No common configuration, it only defined a set of common interfaces. Since 
FS/CS have very different ways to configuration, now rules are created by 
different schedulers, see CapacityScheduler#updatePlacementRules as an example.

bq. What steps are required for the Fair Scheduler to integrate with this?
1) Port existing rules to new APIs defined in the patch, this should be simple
2) Change configuration implementation to instance new defined PlacementRule, 
you may not need to change existing configuration items itself.
3) Change FS workflow, with this patch, queue mapping is happened before submit 
to scheduler. Remove queue mapping related logics from FS and create queue if 
needed.

bq. Each placement rule gets the chance to assign the app to a queue, reject 
the app, or pass. If it passes, the next rule gets a chance.
New APIs are very similar:
Non-null is determined
Null is not determined
Throw exception when rejected.
You can take a look at 
{{org.apache.hadoop.yarn.server.resourcemanager.placement.PlacementRule}}

bq. A placement rule can base its decision on:
bq.
Yes you can do all of them with the new API except "The set of queues given in 
the Fair Scheduler configuration.":
I was thinking necessarity of passing set of queues in the interface. In 
existing implementations of QueuePlacementPolicy, FS queues are only used to 
check mapped queue's existence. I would prefer to delay the check to submit to 
scheduler. See my next comment about "create" flag for more details.
Another reason of not passing queue names set via interface is, queues are very 
dynamic. For example, if user wants to submit application to queue with lowest 
utilization, queue names set may not enough. I would prefer to let rule choose 
to get what need from scheduler.

bq. Rules are marked as "terminal" if they will never pass. This helps to avoid 
misconfigurations where users place rules after terminal rules.
I'm not sure if is it necessary. I think terminal or not should be determined 
by runtime, but I'm OK if you think it's must to have.

bq. Rules have a "create" attribute which determines whether they can create a 
new queue or whether they must assign to existing queues.
I think queue is create-able or not should be determined by scheduler, it 
should be a part of scheduler configuration instead of rule itself.
You can put "create" to your implemented rules without any issue, but I prefer 
not to expose it to public interface.

bq. Currently the set of placement rules is limited to what's implemented in 
YARN. I.e. there's no public pluggable rule support.
Agree, this is one thing we need to do in the future. For now, we can make 
queue mapping happens in a central place first.

bq. Are there places where my summary would not describe what's going on in 
this patch?
I think it should covers most of my patch, you can also take a look at my patch 
to see if anything unexpected :).

> Get-queue-mapping should be a common interface of YarnScheduler
> ---
>
> Key: YARN-3635
> URL: https://issues.apache.org/jira/browse/YARN-3635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Wangda Tan
>Assignee: Tan, Wangda
> Attachments: YARN-3635.1.patch, YARN-3635.2.patch, YARN-3635.3.patch, 
> YARN-3635.4.patch, YARN-3635.5.patch, YARN-3635.6.patch
>
>
> Currently, both of fair/capacity scheduler support queue mapping, which makes 
> scheduler can change queue of an application after submitted to scheduler.
> One issue of doing this in specific scheduler is: If the queue after mapping 
> has different maximum_allocation/default-node-label-expression of the 
> original queue, {{validateAndCreateResourceRequest}} in RMAppManager checks 
> the wrong queue.
> I propose to make the queue mapping as a common interface of scheduler, and 
> RMAppManager set the queue after mapping before doing validations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers

2015-07-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630151#comment-14630151
 ] 

Hadoop QA commented on YARN-433:


\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12740222/YARN-433.2.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 1ba2986 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8559/console |


This message was automatically generated.

> When RM is catching up with node updates then it should not expire acquired 
> containers
> --
>
> Key: YARN-433
> URL: https://issues.apache.org/jira/browse/YARN-433
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Xuan Gong
> Attachments: YARN-433.1.patch, YARN-433.2.patch
>
>
> RM expires containers that are not launched within some time of being 
> allocated. The default is 10mins. When an RM is not keeping up with node 
> updates then it may not be aware of new launched containers. If the expire 
> thread fires for such containers then the RM can expire them even though they 
> may have launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3784) Indicate preemption timout along with the list of containers to AM (preemption message)

2015-07-16 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630167#comment-14630167
 ] 

Wangda Tan commented on YARN-3784:
--

Beyond timeout, another thing we may need consider is: after a container is 
removed from to-be-preempted list, should we notify scheduler/AM about that? 
This could happen if other applications release containers, or other 
queues/applications cancel resource requests.

Now proportionalCPP can notify scheduler many times for a same container, if we 
have to-preempt/remove-from-to-preempt event, we can also reduce number of 
messages send to scheduler (which could cause YARN-3508 happens).

> Indicate preemption timout along with the list of containers to AM 
> (preemption message)
> ---
>
> Key: YARN-3784
> URL: https://issues.apache.org/jira/browse/YARN-3784
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-3784.patch, 0002-YARN-3784.patch
>
>
> Currently during preemption, AM is notified with a list of containers which 
> are marked for preemption. Introducing a timeout duration also along with 
> this container list so that AM can know how much time it will get to do a 
> graceful shutdown to its containers (assuming one of preemption policy is 
> loaded in AM).
> This will help in decommissioning NM scenarios, where NM will be 
> decommissioned after a timeout (also killing containers on it). This timeout 
> will be helpful to indicate AM that those containers can be killed by RM 
> forcefully after the timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3900) Protobuf layout of yarn_security_token causes errors in other protos that include it

2015-07-16 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630209#comment-14630209
 ] 

Anubhav Dhoot commented on YARN-3900:
-

This is needed for YARN-3736. Without this the leveldb state store 
implementation of YARN-3736 actually causes a dump

> Protobuf layout  of yarn_security_token causes errors in other protos that 
> include it
> -
>
> Key: YARN-3900
> URL: https://issues.apache.org/jira/browse/YARN-3900
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3900.001.patch, YARN-3900.001.patch
>
>
> Because of the subdirectory server used in 
> {{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto}}
>  there are errors in other protos that include them.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations

2015-07-16 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630251#comment-14630251
 ] 

Subru Krishnan commented on YARN-3656:
--

Thanks [~asuresh] for reviewing the patch. We did consider allowing declarative 
plugging of planners during the early stages of development but decided against 
it to keep the code base simpler to make it easier to grok as the current 
algorithms themselves are non-trivial. We are open to doing this in the future 
as & when the need arises.

> LowCost: A Cost-Based Placement Agent for YARN Reservations
> ---
>
> Key: YARN-3656
> URL: https://issues.apache.org/jira/browse/YARN-3656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Ishai Menache
>Assignee: Jonathan Yaniv
>  Labels: capacity-scheduler, resourcemanager
> Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.1.patch, 
> YARN-3656-v1.2.patch, YARN-3656-v1.patch, lowcostrayonexternal_v2.pdf
>
>
> YARN-1051 enables SLA support by allowing users to reserve cluster capacity 
> ahead of time. YARN-1710 introduced a greedy agent for placing user 
> reservations. The greedy agent makes fast placement decisions but at the cost 
> of ignoring the cluster committed resources, which might result in blocking 
> the cluster resources for certain periods of time, and in turn rejecting some 
> arriving jobs.
> We propose LowCost – a new cost-based planning algorithm. LowCost “spreads” 
> the demand of the job throughout the allowed time-window according to a 
> global, load-based cost function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers

2015-07-16 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-433:
---
Attachment: YARN-433.3.patch

rebase the patch

> When RM is catching up with node updates then it should not expire acquired 
> containers
> --
>
> Key: YARN-433
> URL: https://issues.apache.org/jira/browse/YARN-433
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Xuan Gong
> Attachments: YARN-433.1.patch, YARN-433.2.patch, YARN-433.3.patch
>
>
> RM expires containers that are not launched within some time of being 
> allocated. The default is 10mins. When an RM is not keeping up with node 
> updates then it may not be aware of new launched containers. If the expire 
> thread fires for such containers then the RM can expire them even though they 
> may have launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3868) ContainerManager recovery for container resizing

2015-07-16 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-3868:

Attachment: YARN-3868-YARN-1197.3.patch

Update patch as dependent patches have been updated.

> ContainerManager recovery for container resizing
> 
>
> Key: YARN-3868
> URL: https://issues.apache.org/jira/browse/YARN-3868
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-3868-YARN-1197.3.patch, YARN-3868.1.patch, 
> YARN-3868.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3908) Bugs in HBaseTimelineWriterImpl

2015-07-16 Thread Joep Rottinghuis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630275#comment-14630275
 ] 

Joep Rottinghuis commented on YARN-3908:


bq. In fact, I'm wondering if we should but info and events into a separate 
column family like what we did for configs/metrics?

In principle we should keep everything in the same column family (fewer store 
files) unless:
a) The items that we store require a different TTL, compression, etc. This is 
the case for metrics where we need a separate TTL.
b) The columns are rather significant in size, and in many queries they'll be 
skipped (and specifically not used in push-down predicate ie. column value 
filters etc). This is the case for configuration. If we have many queries to 
just retrieve info fields and we skip configs in these, then iterating over 
just the rows in the info column family will have a benefit of not needing to 
access the config store files.

Otherwise a separate column family just results in more store files and doesn't 
really gain us anything.
Given the current code setup, switching column family is almost trivial, so 
given that there are no functionality differences,  I'd say let's not even try 
to further optimize this until we have way more code in place.
Then we can run large batches of historical job history files and other inputs 
(perhaps porting data from ATS v1) and then we can see the potential benefit or 
downside.

The other reason to not do premature optimization is that I'm still thinking of 
adding a few more perf tweaks. Those would also just be performance 
optimizations, and not any functionality different, so also not a priority now. 
We should look at tuning all those things much later and together in a coherent 
way. Additional settings that we need to test are RPC compression, encoding of 
the store files and/or compression of the same.

In short, let's focus on completing functionality and then tinker with these 
settings later. 

> Bugs in HBaseTimelineWriterImpl
> ---
>
> Key: YARN-3908
> URL: https://issues.apache.org/jira/browse/YARN-3908
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Vrushali C
> Attachments: YARN-3908-YARN-2928.001.patch, 
> YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch
>
>
> 1. In HBaseTimelineWriterImpl, the info column family contains the basic 
> fields of a timeline entity plus events. However, entity#info map is not 
> stored at all.
> 2 event#timestamp is also not persisted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-16 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630283#comment-14630283
 ] 

Wangda Tan commented on YARN-3932:
--

[~bibinchundatt],
I think we can add a method such as getTotalUsed in ResourceUsage class, which 
will be more efficient than iterating all liveContainers. This can be done in 
the near term.

To make it correct, I think we need to return usage-by-partition object to 
application, which requires to change APIs.

> SchedulerApplicationAttempt#getResourceUsageReport should be based on 
> NodeLabel
> ---
>
> Key: YARN-3932
> URL: https://issues.apache.org/jira/browse/YARN-3932
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: ApplicationReport.jpg
>
>
> Application Resource Report shown wrong when node Label is used.
> 1.Submit application with NodeLabel
> 2.Check RM UI for resources used 
> Allocated CPU VCores and Allocated Memory MB is always {{zero}}
> {code}
>  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
> AggregateAppResourceUsage runningResourceUsage =
> getRunningAggregateAppResourceUsage();
> Resource usedResourceClone =
> Resources.clone(attemptResourceUsage.getUsed());
> Resource reservedResourceClone =
> Resources.clone(attemptResourceUsage.getReserved());
> return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
> reservedContainers.size(), usedResourceClone, reservedResourceClone,
> Resources.add(usedResourceClone, reservedResourceClone),
> runningResourceUsage.getMemorySeconds(),
> runningResourceUsage.getVcoreSeconds());
>   }
> {code}
> should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3908) Bugs in HBaseTimelineWriterImpl

2015-07-16 Thread Joep Rottinghuis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630289#comment-14630289
 ] 

Joep Rottinghuis commented on YARN-3908:


Patch looks good with one comment. I completely overlooked the event info map, 
because it isn't part of the javadoc on the EntityTable. I should have 
double-checked but didn't. Thanks for catching this.

[~sjlee0] I think it would be good to update the javadoc that describes the 
EntityTable in the EntityTable.java file.
The same is probably missing from the doc "Timeline service schema for native 
HBase tables" (not sure which jira the PDF for that doc is attached to), 
because I copied it from the code. I don't think that the application table has 
been copied yet, so it won't be missing from there. 

> Bugs in HBaseTimelineWriterImpl
> ---
>
> Key: YARN-3908
> URL: https://issues.apache.org/jira/browse/YARN-3908
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Vrushali C
> Attachments: YARN-3908-YARN-2928.001.patch, 
> YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch
>
>
> 1. In HBaseTimelineWriterImpl, the info column family contains the basic 
> fields of a timeline entity plus events. However, entity#info map is not 
> stored at all.
> 2 event#timestamp is also not persisted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3908) Bugs in HBaseTimelineWriterImpl

2015-07-16 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-3908:
--
Attachment: YARN-3908-YARN-2928.004.patch

v.4 patch posted

Thanks for your feedback [~jrottinghuis]!

I corrected the {{EventTable}} javadoc to add the info key/value and the event 
timestamp.

I also changed {{ColumnHelper.readTimeseriesResults()}} to use a different 
generic type (V) not to be confused with the main class type (T).

> Bugs in HBaseTimelineWriterImpl
> ---
>
> Key: YARN-3908
> URL: https://issues.apache.org/jira/browse/YARN-3908
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Vrushali C
> Attachments: YARN-3908-YARN-2928.001.patch, 
> YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch, 
> YARN-3908-YARN-2928.004.patch
>
>
> 1. In HBaseTimelineWriterImpl, the info column family contains the basic 
> fields of a timeline entity plus events. However, entity#info map is not 
> stored at all.
> 2 event#timestamp is also not persisted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3908) Bugs in HBaseTimelineWriterImpl

2015-07-16 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-3908:
--
Attachment: YARN-3908-YARN-2928.004.patch

Oops. Forgot ColumnPrefix.

> Bugs in HBaseTimelineWriterImpl
> ---
>
> Key: YARN-3908
> URL: https://issues.apache.org/jira/browse/YARN-3908
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Vrushali C
> Attachments: YARN-3908-YARN-2928.001.patch, 
> YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch, 
> YARN-3908-YARN-2928.004.patch, YARN-3908-YARN-2928.004.patch
>
>
> 1. In HBaseTimelineWriterImpl, the info column family contains the basic 
> fields of a timeline entity plus events. However, entity#info map is not 
> stored at all.
> 2 event#timestamp is also not persisted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart

2015-07-16 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3905:
-
Attachment: YARN-3905.001.patch

> Application History Server UI NPEs when accessing apps run after RM restart
> ---
>
> Key: YARN-3905
> URL: https://issues.apache.org/jira/browse/YARN-3905
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.7.0, 2.8.0, 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3905.001.patch
>
>
> From the Application History URL (http://RmHostName:8188/applicationhistory), 
> clicking on the application ID of an app that was run after the RM daemon has 
> been restarted results in a 500 error:
> {noformat}
> Sorry, got error 500
> Please consult RFC 2616 for meanings of the error code.
> {noformat}
> The stack trace is as follows:
> {code}
> 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO 
> applicationhistoryservice.FileSystemApplicationHistoryStore: Completed 
> reading history information of all application attempts of application 
> application_1436472584878_0001
> 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: 
> Failed to read the AM container of the application attempt 
> appattempt_1436472584878_0001_01.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205)
> at 
> org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272)
> at 
> org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at 
> org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart

2015-07-16 Thread Jonathan Eagles (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Eagles updated YARN-3905:
--
Target Version/s: 2.7.1,   (was: 2.7.1)

> Application History Server UI NPEs when accessing apps run after RM restart
> ---
>
> Key: YARN-3905
> URL: https://issues.apache.org/jira/browse/YARN-3905
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.7.0, 2.8.0, 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3905.001.patch
>
>
> From the Application History URL (http://RmHostName:8188/applicationhistory), 
> clicking on the application ID of an app that was run after the RM daemon has 
> been restarted results in a 500 error:
> {noformat}
> Sorry, got error 500
> Please consult RFC 2616 for meanings of the error code.
> {noformat}
> The stack trace is as follows:
> {code}
> 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO 
> applicationhistoryservice.FileSystemApplicationHistoryStore: Completed 
> reading history information of all application attempts of application 
> application_1436472584878_0001
> 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: 
> Failed to read the AM container of the application attempt 
> appattempt_1436472584878_0001_01.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205)
> at 
> org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272)
> at 
> org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at 
> org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart

2015-07-16 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630452#comment-14630452
 ] 

Jonathan Eagles commented on YARN-3905:
---

+1. [~eepayne], retargetting for 2.7.2 since 2.7.1 is already released.

> Application History Server UI NPEs when accessing apps run after RM restart
> ---
>
> Key: YARN-3905
> URL: https://issues.apache.org/jira/browse/YARN-3905
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.7.0, 2.8.0, 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3905.001.patch
>
>
> From the Application History URL (http://RmHostName:8188/applicationhistory), 
> clicking on the application ID of an app that was run after the RM daemon has 
> been restarted results in a 500 error:
> {noformat}
> Sorry, got error 500
> Please consult RFC 2616 for meanings of the error code.
> {noformat}
> The stack trace is as follows:
> {code}
> 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO 
> applicationhistoryservice.FileSystemApplicationHistoryStore: Completed 
> reading history information of all application attempts of application 
> application_1436472584878_0001
> 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: 
> Failed to read the AM container of the application attempt 
> appattempt_1436472584878_0001_01.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205)
> at 
> org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272)
> at 
> org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at 
> org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart

2015-07-16 Thread Jonathan Eagles (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Eagles updated YARN-3905:
--
Target Version/s: 2.7.2  (was: 2.7.1)

> Application History Server UI NPEs when accessing apps run after RM restart
> ---
>
> Key: YARN-3905
> URL: https://issues.apache.org/jira/browse/YARN-3905
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.7.0, 2.8.0, 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3905.001.patch
>
>
> From the Application History URL (http://RmHostName:8188/applicationhistory), 
> clicking on the application ID of an app that was run after the RM daemon has 
> been restarted results in a 500 error:
> {noformat}
> Sorry, got error 500
> Please consult RFC 2616 for meanings of the error code.
> {noformat}
> The stack trace is as follows:
> {code}
> 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO 
> applicationhistoryservice.FileSystemApplicationHistoryStore: Completed 
> reading history information of all application attempts of application 
> application_1436472584878_0001
> 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: 
> Failed to read the AM container of the application attempt 
> appattempt_1436472584878_0001_01.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205)
> at 
> org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272)
> at 
> org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at 
> org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart

2015-07-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630472#comment-14630472
 ] 

Hadoop QA commented on YARN-3905:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  17m 14s | Pre-patch trunk has 6 extant 
Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   8m 29s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 23s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 21s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 37s | The applied patch generated  1 
new checkstyle issues (total was 39, now 40). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 23s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 35s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m  9s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 25s | Tests passed in 
hadoop-yarn-server-common. |
| | |  40m 39s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12745708/YARN-3905.001.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 0bda84f |
| Pre-patch Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8562/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8562/artifact/patchprocess/diffcheckstylehadoop-yarn-server-common.txt
 |
| hadoop-yarn-server-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8562/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8562/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8562/console |


This message was automatically generated.

> Application History Server UI NPEs when accessing apps run after RM restart
> ---
>
> Key: YARN-3905
> URL: https://issues.apache.org/jira/browse/YARN-3905
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.7.0, 2.8.0, 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3905.001.patch
>
>
> From the Application History URL (http://RmHostName:8188/applicationhistory), 
> clicking on the application ID of an app that was run after the RM daemon has 
> been restarted results in a 500 error:
> {noformat}
> Sorry, got error 500
> Please consult RFC 2616 for meanings of the error code.
> {noformat}
> The stack trace is as follows:
> {code}
> 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO 
> applicationhistoryservice.FileSystemApplicationHistoryStore: Completed 
> reading history information of all application attempts of application 
> application_1436472584878_0001
> 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: 
> Failed to read the AM container of the application attempt 
> appattempt_1436472584878_0001_01.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205)
> at 
> org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272)
> at 
> org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apach

[jira] [Commented] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers

2015-07-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630473#comment-14630473
 ] 

Hadoop QA commented on YARN-433:


\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  15m 52s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   9m  9s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m  9s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 24s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  0s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 27s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 30s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  55m 12s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  94m 44s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12745685/YARN-433.3.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 0bda84f |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/8560/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8560/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8560/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8560/console |


This message was automatically generated.

> When RM is catching up with node updates then it should not expire acquired 
> containers
> --
>
> Key: YARN-433
> URL: https://issues.apache.org/jira/browse/YARN-433
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Xuan Gong
> Attachments: YARN-433.1.patch, YARN-433.2.patch, YARN-433.3.patch
>
>
> RM expires containers that are not launched within some time of being 
> allocated. The default is 10mins. When an RM is not keeping up with node 
> updates then it may not be aware of new launched containers. If the expire 
> thread fires for such containers then the RM can expire them even though they 
> may have launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3908) Bugs in HBaseTimelineWriterImpl

2015-07-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630479#comment-14630479
 ] 

Hadoop QA commented on YARN-3908:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  18m  5s | Findbugs (version ) appears to 
be broken on YARN-2928. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   8m 23s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 23s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 22s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 28s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 41s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 24s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 26s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   1m 21s | Tests passed in 
hadoop-yarn-server-timelineservice. |
| | |  45m  2s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12745704/YARN-3908-YARN-2928.004.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | YARN-2928 / eb1932d |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8563/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-server-timelineservice test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8563/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8563/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8563/console |


This message was automatically generated.

> Bugs in HBaseTimelineWriterImpl
> ---
>
> Key: YARN-3908
> URL: https://issues.apache.org/jira/browse/YARN-3908
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Vrushali C
> Attachments: YARN-3908-YARN-2928.001.patch, 
> YARN-3908-YARN-2928.002.patch, YARN-3908-YARN-2928.003.patch, 
> YARN-3908-YARN-2928.004.patch, YARN-3908-YARN-2928.004.patch
>
>
> 1. In HBaseTimelineWriterImpl, the info column family contains the basic 
> fields of a timeline entity plus events. However, entity#info map is not 
> stored at all.
> 2 event#timestamp is also not persisted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630500#comment-14630500
 ] 

Hudson commented on YARN-3930:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8176 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8176/])
YARN-3930. FileSystemNodeLabelsStore should make sure edit log file closed when 
exception is thrown. (Dian Fu via wangda) (wangda: rev 
fa2b63ed162410ba05eadf211a1da068351b293a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java
* hadoop-yarn-project/CHANGES.txt


> FileSystemNodeLabelsStore should make sure edit log file closed when 
> exception is thrown 
> -
>
> Key: YARN-3930
> URL: https://issues.apache.org/jira/browse/YARN-3930
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Dian Fu
>Assignee: Dian Fu
> Fix For: 2.8.0
>
> Attachments: YARN-3930.001.patch
>
>
> When I test the node label feature in my local environment, I encountered the 
> following exception:
> {code}
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The reason is that HDFS throws an exception when calling 
> {{ensureAppendEditlogFile}} because of some reason which causes the edit log 
> output stream isn't closed. This caused that the next time we call 
> {{ensureAppendEditlogFile}}, lease recovery will failed because we are just 
> the lease holder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level

2015-07-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630509#comment-14630509
 ] 

Hudson commented on YARN-3885:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8177 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8177/])
YARN-3885. ProportionalCapacityPreemptionPolicy doesn't preempt if queue is 
more than 2 level. (Ajith S via wangda) (wangda: rev 
3540d5fe4b1da942ea80c9e7ca1126b1abb8a68a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java
* hadoop-yarn-project/CHANGES.txt


> ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 
> level
> --
>
> Key: YARN-3885
> URL: https://issues.apache.org/jira/browse/YARN-3885
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Blocker
> Fix For: 2.8.0
>
> Attachments: YARN-3885.02.patch, YARN-3885.03.patch, 
> YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, 
> YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch
>
>
> when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}}
> this piece of code, to calculate {{untoucable}} doesnt consider al the 
> children, it considers only immediate childern



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2015-07-16 Thread Masatake Iwasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630532#comment-14630532
 ] 

Masatake Iwasaki commented on YARN-2578:


Yes, it is the same fix. I agree it should fixed in hadoop-common JIRA. Thanks.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: YARN-2578.002.patch, YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3900) Protobuf layout of yarn_security_token causes errors in other protos that include it

2015-07-16 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3900:

Description: 
Because of the subdirectory server used in 
{{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto}}
 there are errors in other protos that include them.
As per the docs http://sergei-ivanov.github.io/maven-protoc-plugin/usage.html 
{noformat} Any subdirectories under src/main/proto/ are treated as package 
structure for protobuf definition imports.{noformat}
 

  was:
Because of the subdirectory server used in 
{{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto}}
 there are errors in other protos that include them.
 


> Protobuf layout  of yarn_security_token causes errors in other protos that 
> include it
> -
>
> Key: YARN-3900
> URL: https://issues.apache.org/jira/browse/YARN-3900
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3900.001.patch, YARN-3900.001.patch
>
>
> Because of the subdirectory server used in 
> {{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto}}
>  there are errors in other protos that include them.
> As per the docs http://sergei-ivanov.github.io/maven-protoc-plugin/usage.html 
> {noformat} Any subdirectories under src/main/proto/ are treated as package 
> structure for protobuf definition imports.{noformat}
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2015-07-16 Thread Masatake Iwasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630535#comment-14630535
 ] 

Masatake Iwasaki commented on YARN-2578:


bq. 2. Would you tell me why Client.getRpcTimeout returns 0 if ipc.client.ping 
is false?

Just to make it clear that the timeout has no effect without setting 
{{ipc.client.ping}} to true.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: YARN-2578.002.patch, YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3900) Protobuf layout of yarn_security_token causes errors in other protos that include it

2015-07-16 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3900:

Attachment: YARN-3900.002.patch

Updated patch for recent changes in ContainerTokenIdentifierProto

> Protobuf layout  of yarn_security_token causes errors in other protos that 
> include it
> -
>
> Key: YARN-3900
> URL: https://issues.apache.org/jira/browse/YARN-3900
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3900.001.patch, YARN-3900.001.patch, 
> YARN-3900.002.patch
>
>
> Because of the subdirectory server used in 
> {{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto}}
>  there are errors in other protos that include them.
> As per the docs http://sergei-ivanov.github.io/maven-protoc-plugin/usage.html 
> {noformat} Any subdirectories under src/main/proto/ are treated as package 
> structure for protobuf definition imports.{noformat}
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3906) split the application table from the entity table

2015-07-16 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630593#comment-14630593
 ] 

Sangjin Lee commented on YARN-3906:
---

The bulk of the work is done, but I'd like to wait until YARN-3908 is committed 
and update the changes.

> split the application table from the entity table
> -
>
> Key: YARN-3906
> URL: https://issues.apache.org/jira/browse/YARN-3906
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
>
> Per discussions on YARN-3815, we need to split the application entities from 
> the main entity table into its own table (application).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3930) FileSystemNodeLabelsStore should make sure edit log file closed when exception is thrown

2015-07-16 Thread Dian Fu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630615#comment-14630615
 ] 

Dian Fu commented on YARN-3930:
---

Thanks [~leftnoteasy] for review and commit.

> FileSystemNodeLabelsStore should make sure edit log file closed when 
> exception is thrown 
> -
>
> Key: YARN-3930
> URL: https://issues.apache.org/jira/browse/YARN-3930
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Dian Fu
>Assignee: Dian Fu
> Fix For: 2.8.0
>
> Attachments: YARN-3930.001.patch
>
>
> When I test the node label feature in my local environment, I encountered the 
> following exception:
> {code}
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2426)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2523)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2498)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:662)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:418)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.handleStoreEvent(CommonNodeLabelsManager.java:196)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:168)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager$ForwardingEventHandler.handle(CommonNodeLabelsManager.java:163)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:176)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The reason is that HDFS throws an exception when calling 
> {{ensureAppendEditlogFile}} because of some reason which causes the edit log 
> output stream isn't closed. This caused that the next time we call 
> {{ensureAppendEditlogFile}}, lease recovery will failed because we are just 
> the lease holder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2015-07-16 Thread kyungwan nam (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-3931:
---
Attachment: YARN-3931.001.patch

I attached the patch. it work well in my cluster... :)

> default-node-label-expression doesn’t apply when an application is submitted 
> by RM rest api
> ---
>
> Key: YARN-3931
> URL: https://issues.apache.org/jira/browse/YARN-3931
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
> Environment: hadoop-2.6.0
>Reporter: kyungwan nam
>Assignee: kyungwan nam
> Attachments: YARN-3931.001.patch
>
>
> * 
> yarn.scheduler.capacity..default-node-label-expression=large_disk
> * submit an application using rest api without "app-node-label-expression”, 
> "am-container-node-label-expression”
> * RM doesn’t allocate containers to the hosts associated with large_disk node 
> label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3900) Protobuf layout of yarn_security_token causes errors in other protos that include it

2015-07-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630643#comment-14630643
 ] 

Hadoop QA commented on YARN-3900:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  15m 51s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 43s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 44s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 21s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 36s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 20s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   3m 59s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   1m 55s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   3m 11s | Tests passed in 
hadoop-yarn-server-applicationhistoryservice. |
| {color:green}+1{color} | yarn tests |  51m  8s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  97m 24s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12745719/YARN-3900.002.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 3540d5f |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8564/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-applicationhistoryservice test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8564/artifact/patchprocess/testrun_hadoop-yarn-server-applicationhistoryservice.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8564/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8564/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8564/console |


This message was automatically generated.

> Protobuf layout  of yarn_security_token causes errors in other protos that 
> include it
> -
>
> Key: YARN-3900
> URL: https://issues.apache.org/jira/browse/YARN-3900
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3900.001.patch, YARN-3900.001.patch, 
> YARN-3900.002.patch
>
>
> Because of the subdirectory server used in 
> {{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto}}
>  there are errors in other protos that include them.
> As per the docs http://sergei-ivanov.github.io/maven-protoc-plugin/usage.html 
> {noformat} Any subdirectories under src/main/proto/ are treated as package 
> structure for protobuf definition imports.{noformat}
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3931) default-node-label-expression doesn’t apply when an application is submitted by RM rest api

2015-07-16 Thread Xianyin Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630659#comment-14630659
 ] 

Xianyin Xin commented on YARN-3931:
---

This reminds me an earlier trouble i have met. Hi [~Naganarasimha], can we 
consider to remove the "" node label expression in the code? It seems not make 
sense we set a node label as "". For node label expression, it should be 
"some_label" or null. 

Just an unrigorous thoughts, what do you think?

> default-node-label-expression doesn’t apply when an application is submitted 
> by RM rest api
> ---
>
> Key: YARN-3931
> URL: https://issues.apache.org/jira/browse/YARN-3931
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
> Environment: hadoop-2.6.0
>Reporter: kyungwan nam
>Assignee: kyungwan nam
> Attachments: YARN-3931.001.patch
>
>
> * 
> yarn.scheduler.capacity..default-node-label-expression=large_disk
> * submit an application using rest api without "app-node-label-expression”, 
> "am-container-node-label-expression”
> * RM doesn’t allocate containers to the hosts associated with large_disk node 
> label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3885) ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 level

2015-07-16 Thread Ajith S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630669#comment-14630669
 ] 

Ajith S commented on YARN-3885:
---

Thanks [~leftnoteasy] , [~xinxianyin] and [~sunilg] :)

> ProportionalCapacityPreemptionPolicy doesn't preempt if queue is more than 2 
> level
> --
>
> Key: YARN-3885
> URL: https://issues.apache.org/jira/browse/YARN-3885
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Blocker
> Fix For: 2.8.0
>
> Attachments: YARN-3885.02.patch, YARN-3885.03.patch, 
> YARN-3885.04.patch, YARN-3885.05.patch, YARN-3885.06.patch, 
> YARN-3885.07.patch, YARN-3885.08.patch, YARN-3885.patch
>
>
> when preemption policy is {{ProportionalCapacityPreemptionPolicy.cloneQueues}}
> this piece of code, to calculate {{untoucable}} doesnt consider al the 
> children, it considers only immediate childern



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3736) Persist the Plan information, ie. accepted reservations to the RMStateStore for failover

2015-07-16 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3736:

Attachment: YARN-3736.001.patch

Patch that adds implementation of ReservationSystem state to all the state 
stores. Actually persisting information is next 

> Persist the Plan information, ie. accepted reservations to the RMStateStore 
> for failover
> 
>
> Key: YARN-3736
> URL: https://issues.apache.org/jira/browse/YARN-3736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Anubhav Dhoot
> Attachments: YARN-3736.001.patch
>
>
> We need to persist the current state of the plan, i.e. the accepted 
> ReservationAllocations & corresponding RLESpareseResourceAllocations  to the 
> RMStateStore so that we can recover them on RM failover. This involves making 
> all the reservation system data structures protobuf friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3897) "Too many links" in NM log dir

2015-07-16 Thread Hong Zhiguo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Zhiguo updated YARN-3897:
--
Description: 
Users need to left container logs more than one day. On some nodes of our busy 
cluster, the number of subdirs of {yarn.nodemanager.log-dirs} may reach 32000, 
which is the defaul limit of ext3 file system. As a result, we got errors when 
initiating containers:
"Failed to create directory 
{yarn.nodemanager.log-dirs}/application_1435111082717_1341740 - Too many links"

log aggregation is not an option for us because of the heavy pressure on 
namenode. With a cluster of 5K nodes and 20k log files per node, it's not 
acceptable to aggregate so many files to hdfs.

Since ext3 is still widely used, we'd better do something to avoid such error.

  was:
Users need to left container logs more than one day. On some nodes of our busy 
cluster, the number of subdirs of {yarn.nodemanager.log-dirs} may reach 32000, 
which is the defaul limit of ext3 file system. As a result, we got errors when 
initiating containers:
"Failed to create directory 
{yarn.nodemanager.log-dirs}/logs/application_1435111082717_1341740 - Too many 
links"

log aggregation is not an option for us because of the heavy pressure on 
namenode. With a cluster of 5K nodes and 20k log files per node, it's not 
acceptable to aggregate so many files to hdfs.

Since ext3 is still widely used, we'd better do something to avoid such error.


> "Too many links" in NM log dir
> --
>
> Key: YARN-3897
> URL: https://issues.apache.org/jira/browse/YARN-3897
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Hong Zhiguo
>Assignee: Hong Zhiguo
>Priority: Minor
>
> Users need to left container logs more than one day. On some nodes of our 
> busy cluster, the number of subdirs of {yarn.nodemanager.log-dirs} may reach 
> 32000, which is the defaul limit of ext3 file system. As a result, we got 
> errors when initiating containers:
> "Failed to create directory 
> {yarn.nodemanager.log-dirs}/application_1435111082717_1341740 - Too many 
> links"
> log aggregation is not an option for us because of the heavy pressure on 
> namenode. With a cluster of 5K nodes and 20k log files per node, it's not 
> acceptable to aggregate so many files to hdfs.
> Since ext3 is still widely used, we'd better do something to avoid such error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3736) Persist the Plan information, ie. accepted reservations to the RMStateStore for failover

2015-07-16 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3736:

Attachment: YARN-3736.001.patch

> Persist the Plan information, ie. accepted reservations to the RMStateStore 
> for failover
> 
>
> Key: YARN-3736
> URL: https://issues.apache.org/jira/browse/YARN-3736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Anubhav Dhoot
> Attachments: YARN-3736.001.patch, YARN-3736.001.patch
>
>
> We need to persist the current state of the plan, i.e. the accepted 
> ReservationAllocations & corresponding RLESpareseResourceAllocations  to the 
> RMStateStore so that we can recover them on RM failover. This involves making 
> all the reservation system data structures protobuf friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2768) optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% of computing time of update thread

2015-07-16 Thread Hong Zhiguo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630688#comment-14630688
 ] 

Hong Zhiguo commented on YARN-2768:
---

[~kasha], could you please review the patch?

> optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% 
> of computing time of update thread
> 
>
> Key: YARN-2768
> URL: https://issues.apache.org/jira/browse/YARN-2768
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Hong Zhiguo
>Assignee: Hong Zhiguo
>Priority: Minor
> Attachments: YARN-2768.patch, profiling_FairScheduler_update.png
>
>
> See the attached picture of profiling result. The clone of Resource object 
> within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the 
> function FairScheduler.update().
> The code of FSAppAttempt.updateDemand:
> {code}
> public void updateDemand() {
> demand = Resources.createResource(0);
> // Demand is current consumption plus outstanding requests
> Resources.addTo(demand, app.getCurrentConsumption());
> // Add up outstanding resource requests
> synchronized (app) {
>   for (Priority p : app.getPriorities()) {
> for (ResourceRequest r : app.getResourceRequests(p).values()) {
>   Resource total = Resources.multiply(r.getCapability(), 
> r.getNumContainers());
>   Resources.addTo(demand, total);
> }
>   }
> }
>   }
> {code}
> The code of Resources.multiply:
> {code}
> public static Resource multiply(Resource lhs, double by) {
> return multiplyTo(clone(lhs), by);
> }
> {code}
> The clone could be skipped by directly update the value of this.demand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2306) leak of reservation metrics (fair scheduler)

2015-07-16 Thread Hong Zhiguo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630694#comment-14630694
 ] 

Hong Zhiguo commented on YARN-2306:
---

hi, [~rchiang], do you mean running the unit test in patch againt trunk?

> leak of reservation metrics (fair scheduler)
> 
>
> Key: YARN-2306
> URL: https://issues.apache.org/jira/browse/YARN-2306
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Hong Zhiguo
>Assignee: Hong Zhiguo
>Priority: Minor
> Attachments: YARN-2306-2.patch, YARN-2306.patch
>
>
> This only applies to fair scheduler. Capacity scheduler is OK.
> When appAttempt or node is removed, the metrics for 
> reservation(reservedContainers, reservedMB, reservedVCores) is not reduced 
> back.
> These are important metrics for administrator. The wrong metrics confuses may 
> confuse them. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3845) [YARN] YARN status in web ui does not show correctly in IE 11

2015-07-16 Thread Mohammad Shahid Khan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammad Shahid Khan updated YARN-3845:
---
Attachment: YARN-3845.patch

> [YARN] YARN status in web ui does not show correctly in IE 11
> -
>
> Key: YARN-3845
> URL: https://issues.apache.org/jira/browse/YARN-3845
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jagadesh Kiran N
>Assignee: Mohammad Shahid Khan
>Priority: Trivial
> Attachments: IE11_yarn.gif, YARN-3845.patch
>
>
> In IE 11 , the color display is not proper for the scheduler . In other 
> browser it is showing correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2306) leak of reservation metrics (fair scheduler)

2015-07-16 Thread Ray Chiang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630733#comment-14630733
 ] 

Ray Chiang commented on YARN-2306:
--

Heh.  That was two months ago.  I believe I was referring to the unit test.

> leak of reservation metrics (fair scheduler)
> 
>
> Key: YARN-2306
> URL: https://issues.apache.org/jira/browse/YARN-2306
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Hong Zhiguo
>Assignee: Hong Zhiguo
>Priority: Minor
> Attachments: YARN-2306-2.patch, YARN-2306.patch
>
>
> This only applies to fair scheduler. Capacity scheduler is OK.
> When appAttempt or node is removed, the metrics for 
> reservation(reservedContainers, reservedMB, reservedVCores) is not reduced 
> back.
> These are important metrics for administrator. The wrong metrics confuses may 
> confuse them. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >