[jira] [Created] (YARN-8669) Yarn application has already ended! It might have been killed or unable to launch application master.

2018-08-15 Thread Bheemidi Vikram Reddy (JIRA)
Bheemidi Vikram Reddy created YARN-8669:
---

 Summary: Yarn application has already ended! It might have been 
killed or unable to launch application master.
 Key: YARN-8669
 URL: https://issues.apache.org/jira/browse/YARN-8669
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications/unmanaged-AM-launcher
Affects Versions: 2.7.3
 Environment: Ubuntu-16.04

RAM-32gb

Cores-8
Reporter: Bheemidi Vikram Reddy
 Attachments: yarn-testuser-resourcemanager-coea18.log

When I submit the Spark job to the yarn cluster through Zeppelin notebook, I'm 
facing the AM Killing. Sp Pls can one help me in the yarn configuration?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8613) Old RM UI shows wrong vcores total value

2018-08-15 Thread Sen Zhao (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sen Zhao reassigned YARN-8613:
--

Assignee: (was: Sen Zhao)

> Old RM UI shows wrong vcores total value
> 
>
> Key: YARN-8613
> URL: https://issues.apache.org/jira/browse/YARN-8613
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Akhil PB
>Priority: Major
> Attachments: Screen Shot 2018-08-02 at 12.12.41 PM.png, Screen Shot 
> 2018-08-02 at 12.16.53 PM.png, YARN-8613.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8668) Inconsistency between capacity and fair scheduler in the aspect of computing node available resource

2018-08-15 Thread Yeliang Cang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581866#comment-16581866
 ] 

Yeliang Cang commented on YARN-8668:


Thanks [~leftnoteasy] for clarifying this, close this Jira as not a problem!

> Inconsistency between capacity and fair scheduler in the aspect of computing 
> node available resource
> 
>
> Key: YARN-8668
> URL: https://issues.apache.org/jira/browse/YARN-8668
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yeliang Cang
>Assignee: Yeliang Cang
>Priority: Major
>  Labels: capacityscheduler
> Attachments: YARN-8668.001.patch
>
>
> We have observed that given capacityScheduler and defaultResourceCalculor,  
> when there are many memory resources in a node, running heavy workload, then 
> the available vcores of this node will be negative!
> I noticed that in capacityScheduler.java, use code below to calculate the 
> available resources for allocating containers:
> {code}
> if (calculator.computeAvailableContainers(Resources
>  .add(node.getUnallocatedResource(), node.getTotalKillableResources()),
>  minimumAllocation) <= 0) {
>  if (LOG.isDebugEnabled()) {
>  LOG.debug("This node or this node partition doesn't have available or"
>  + "killable resource");
>  }
> {code}
> while in fairscheduler FsAppAttempt.java, similar code was found:
> {code}
> // Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
> ...
> }
> {code}
> Why is the inconsistency? I think we should use 
> Resources.fitsIn(smaller,bigger) instead in capacityScheduler !!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8667) Container Relaunch fails with "find: File system loop detected;" for tar ball artifacts

2018-08-15 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581842#comment-16581842
 ] 

genericqa commented on YARN-8667:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
37s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 17s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
54s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
24s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 22s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 1 new + 63 unchanged - 1 fixed = 64 total (was 64) {color} 
|
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 30s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 18m 
58s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
26s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 75m 29s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 |
| JIRA Issue | YARN-8667 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12935786/YARN-8667.001.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 397b6ab4ff6a 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 
08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 7dc79a8 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/21610/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/21610/testReport/ |
| Max. process+thread count | 448 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 

[jira] [Updated] (YARN-8662) Fair Scheduler stops scheduling when a queue is configured only CPU and memory

2018-08-15 Thread Sen Zhao (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sen Zhao updated YARN-8662:
---
Component/s: fairscheduler

> Fair Scheduler stops scheduling when a queue is configured only CPU and memory
> --
>
> Key: YARN-8662
> URL: https://issues.apache.org/jira/browse/YARN-8662
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Sen Zhao
>Assignee: Sen Zhao
>Priority: Major
> Attachments: NonResourceToSchedule.png, YARN-8662.001.patch
>
>
> Add a new resource type in resource-types.xml, eg: resource1. 
> In Fair scheduler when queue's MaxResources is configured like: 
> {code}4096 mb, 4 vcores{code}
> When submit a application which need resource like:
> {code} 1536 mb, 1 vcores, 10 resource1{code}
> The application will be pending. Because there is no resource1 in this queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8597) Build Worker utility for MaWo Application

2018-08-15 Thread Yesha Vora (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesha Vora updated YARN-8597:
-
Attachment: YARN-8597.001.patch

> Build Worker utility for MaWo Application
> -
>
> Key: YARN-8597
> URL: https://issues.apache.org/jira/browse/YARN-8597
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Yesha Vora
>Assignee: Yesha Vora
>Priority: Major
> Attachments: YARN-8597.001.patch
>
>
> The worker is responsible for executing Tasks. 
>  * Worker
>  ** Create a worker class which drives worker life cycle
>  ** Create WorkAssignment Protocol. It should be handle Register/deregister 
> worker, send heartbeat 
>  ** Lifecycle: Register worker, Run Setup Task, Get Task from master and 
> execute it using TaskRunner, Run Teardown Task
>  *  TaskRunner
>  ** Simple Task Runner : This runner should be able to execute a simple task
>  ** Composite Task Runner: This runner should be able to execute composite 
> task
>  * TaskWallTimeLimiter
>  ** Create a utility which can abort the task if the execution time exceeds 
> task timeout. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8667) Container Relaunch fails with "find: File system loop detected;" for tar ball artifacts

2018-08-15 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581802#comment-16581802
 ] 

Chandni Singh commented on YARN-8667:
-

Patch 1 contains a fix and a unit test. 

[~billie.rinaldi] [~eyang] please review

> Container Relaunch fails with "find: File system loop detected;" for tar ball 
> artifacts
> ---
>
> Key: YARN-8667
> URL: https://issues.apache.org/jira/browse/YARN-8667
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-8667.001.patch
>
>
> Service is launched with TAR BALL artifacts. If a container is exited due to 
> any reasons, container relaunch policy try to relaunch the container on same 
> node with same container work space. As a result, container relaunch is keep 
> on failing. 
> If container relaunch max-retry policy is disabled, then  container never 
> launched in any other nodes also rather it keep on retrying on same node 
> manager which never succeeds.
> {code}
> Relaunching Container container_e05_1533635581781_0001_01_02. Remaining 
> retry attempts(after relaunch) : -4816.
> {code}
> There are two issues
> # Container relaunch is keep on failing
> # Log message is misleading



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8667) Container Relaunch fails with "find: File system loop detected;" for tar ball artifacts

2018-08-15 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-8667:

Attachment: YARN-8667.001.patch

> Container Relaunch fails with "find: File system loop detected;" for tar ball 
> artifacts
> ---
>
> Key: YARN-8667
> URL: https://issues.apache.org/jira/browse/YARN-8667
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-8667.001.patch
>
>
> Service is launched with TAR BALL artifacts. If a container is exited due to 
> any reasons, container relaunch policy try to relaunch the container on same 
> node with same container work space. As a result, container relaunch is keep 
> on failing. 
> If container relaunch max-retry policy is disabled, then  container never 
> launched in any other nodes also rather it keep on retrying on same node 
> manager which never succeeds.
> {code}
> Relaunching Container container_e05_1533635581781_0001_01_02. Remaining 
> retry attempts(after relaunch) : -4816.
> {code}
> There are two issues
> # Container relaunch is keep on failing
> # Log message is misleading



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8569) Create an interface to provide cluster information to application

2018-08-15 Thread Eric Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang reassigned YARN-8569:
---

Assignee: Eric Yang

> Create an interface to provide cluster information to application
> -
>
> Key: YARN-8569
> URL: https://issues.apache.org/jira/browse/YARN-8569
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
>  Labels: Docker
>
> Some program requires container hostnames to be known for application to run. 
>  For example, distributed tensorflow requires launch_command that looks like:
> {code}
> # On ps0.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=ps --task_index=0
> # On ps1.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=ps --task_index=1
> # On worker0.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=worker --task_index=0
> # On worker1.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=worker --task_index=1
> {code}
> This is a bit cumbersome to orchestrate via Distributed Shell, or YARN 
> services launch_command.  In addition, the dynamic parameters do not work 
> with YARN flex command.  This is the classic pain point for application 
> developer attempt to automate system environment settings as parameter to end 
> user application.
> It would be great if YARN Docker integration can provide a simple option to 
> expose hostnames of the yarn service via a mounted file.  The file content 
> gets updated when flex command is performed.  This allows application 
> developer to consume system environment settings via a standard interface.  
> It is like /proc/devices for Linux, but for Hadoop.  This may involve 
> updating a file in distributed cache, and allow mounting of the file via 
> container-executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8488) YARN service/components/instances should have SUCCEEDED/FAILED states

2018-08-15 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581598#comment-16581598
 ] 

Eric Yang commented on YARN-8488:
-

[~suma.shivaprasad], thank you for the patch.  A few minor nitpicks:
 # Introduce synchronized boolean getTimelineServiceEnabled method to make this 
class thread safe.
 # 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/component/Component.java
 changes is unnecessary.
 # ComponentInstance.java, near line 265, } else {
 # It might be useful to pass in a real diagnostic string to 
handleComponentInstanceRelaunch to make sure the down stream classes isn't 
failing to due NPE.

The new state works fine.

> YARN service/components/instances should have SUCCEEDED/FAILED states
> -
>
> Key: YARN-8488
> URL: https://issues.apache.org/jira/browse/YARN-8488
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Suma Shivaprasad
>Priority: Major
> Attachments: YARN-8488.1.patch, YARN-8488.2.patch, YARN-8488.3.patch, 
> YARN-8488.4.patch, YARN-8488.5.patch
>
>
> Existing YARN service has following states:
> {code} 
> public enum ServiceState {
>   ACCEPTED, STARTED, STABLE, STOPPED, FAILED, FLEX, UPGRADING,
>   UPGRADING_AUTO_FINALIZE;
> }
> {code} 
> Ideally we should add "SUCCEEDED" state in order to support long running 
> applications like Tensorflow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8668) Inconsistency between capacity and fair scheduler in the aspect of computing node available resource

2018-08-15 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581574#comment-16581574
 ] 

Wangda Tan edited comment on YARN-8668 at 8/15/18 8:34 PM:
---

Thanks [~Cyl] for reporting the issue, this is by design in CS. 

Using computeAvailableContainers can get correct result when both 
DominantResourceCalculator and DefaultResourceCalculator enabled. Using 
fitsIn(res, res) only works when DominantResourceCalculator is enabled. To me, 
the correct solution is to use fits(resourceCalculator, res, res)

I don't think fix required in CS.


was (Author: leftnoteasy):
Thanks [~Cyl] for reporting the issue, this is by design in CS. 

Using computeAvailableContainers can get correct result when both 
DominantResourceCalculator and DefaultResourceCalculator enabled. Using fitsIn 
only works when DominantResourceCalculator is enabled.

I don't think fix required in CS.

> Inconsistency between capacity and fair scheduler in the aspect of computing 
> node available resource
> 
>
> Key: YARN-8668
> URL: https://issues.apache.org/jira/browse/YARN-8668
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yeliang Cang
>Assignee: Yeliang Cang
>Priority: Major
>  Labels: capacityscheduler
> Attachments: YARN-8668.001.patch
>
>
> We have observed that given capacityScheduler and defaultResourceCalculor,  
> when there are many memory resources in a node, running heavy workload, then 
> the available vcores of this node will be negative!
> I noticed that in capacityScheduler.java, use code below to calculate the 
> available resources for allocating containers:
> {code}
> if (calculator.computeAvailableContainers(Resources
>  .add(node.getUnallocatedResource(), node.getTotalKillableResources()),
>  minimumAllocation) <= 0) {
>  if (LOG.isDebugEnabled()) {
>  LOG.debug("This node or this node partition doesn't have available or"
>  + "killable resource");
>  }
> {code}
> while in fairscheduler FsAppAttempt.java, similar code was found:
> {code}
> // Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
> ...
> }
> {code}
> Why is the inconsistency? I think we should use 
> Resources.fitsIn(smaller,bigger) instead in capacityScheduler !!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8509) Total pending resource calculation in preemption should use user-limit factor instead of minimum-user-limit-percent

2018-08-15 Thread Zian Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581573#comment-16581573
 ] 

Zian Chen commented on YARN-8509:
-

Offline discussed with Eric and Wangda, will upload a new patch to evaluate the 
algorithm we provided here works as expected and not cause any over-preemption.

> Total pending resource calculation in preemption should use user-limit factor 
> instead of minimum-user-limit-percent
> ---
>
> Key: YARN-8509
> URL: https://issues.apache.org/jira/browse/YARN-8509
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
> Attachments: YARN-8509.001.patch, YARN-8509.002.patch, 
> YARN-8509.003.patch
>
>
> In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total 
> pending resource based on user-limit percent and user-limit factor which will 
> cap pending resource for each user to the minimum of user-limit pending and 
> actual pending. This will prevent queue from taking more pending resource to 
> achieve queue balance after all queue satisfied with its ideal allocation.
>   
>  We need to change the logic to let queue pending can go beyond userlimit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8668) Inconsistency between capacity and fair scheduler in the aspect of computing node available resource

2018-08-15 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581574#comment-16581574
 ] 

Wangda Tan commented on YARN-8668:
--

Thanks [~Cyl] for reporting the issue, this is by design in CS. 

Using computeAvailableContainers can get correct result when both 
DominantResourceCalculator and DefaultResourceCalculator enabled. Using fitsIn 
only works when DominantResourceCalculator is enabled.

I don't think fix required in CS.

> Inconsistency between capacity and fair scheduler in the aspect of computing 
> node available resource
> 
>
> Key: YARN-8668
> URL: https://issues.apache.org/jira/browse/YARN-8668
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yeliang Cang
>Assignee: Yeliang Cang
>Priority: Major
>  Labels: capacityscheduler
> Attachments: YARN-8668.001.patch
>
>
> We have observed that given capacityScheduler and defaultResourceCalculor,  
> when there are many memory resources in a node, running heavy workload, then 
> the available vcores of this node will be negative!
> I noticed that in capacityScheduler.java, use code below to calculate the 
> available resources for allocating containers:
> {code}
> if (calculator.computeAvailableContainers(Resources
>  .add(node.getUnallocatedResource(), node.getTotalKillableResources()),
>  minimumAllocation) <= 0) {
>  if (LOG.isDebugEnabled()) {
>  LOG.debug("This node or this node partition doesn't have available or"
>  + "killable resource");
>  }
> {code}
> while in fairscheduler FsAppAttempt.java, similar code was found:
> {code}
> // Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
> ...
> }
> {code}
> Why is the inconsistency? I think we should use 
> Resources.fitsIn(smaller,bigger) instead in capacityScheduler !!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8474) sleeper service fails to launch with "Authentication Required"

2018-08-15 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581558#comment-16581558
 ] 

genericqa commented on YARN-8474:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
34s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 32m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 54s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
18s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 12s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-api:
 The patch generated 19 new + 4 unchanged - 0 fixed = 23 total (was 4) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 36s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
42s{color} | {color:green} hadoop-yarn-services-api in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 64m 38s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 |
| JIRA Issue | YARN-8474 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12935740/YARN-8474.006.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  xml  findbugs  checkstyle  |
| uname | Linux 6706e194e545 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / d951af2 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/21609/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-api.txt
 |
|  Test Results | 

[jira] [Commented] (YARN-8667) Container Relaunch fails with "find: File system loop detected;" for tar ball artifacts

2018-08-15 Thread Billie Rinaldi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581548#comment-16581548
 ] 

Billie Rinaldi commented on YARN-8667:
--

That sounds like the issue. Thanks for figuring out the problem, [~csingh]! It 
will be good to get this bug fixed.

> Container Relaunch fails with "find: File system loop detected;" for tar ball 
> artifacts
> ---
>
> Key: YARN-8667
> URL: https://issues.apache.org/jira/browse/YARN-8667
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Chandni Singh
>Priority: Major
>
> Service is launched with TAR BALL artifacts. If a container is exited due to 
> any reasons, container relaunch policy try to relaunch the container on same 
> node with same container work space. As a result, container relaunch is keep 
> on failing. 
> If container relaunch max-retry policy is disabled, then  container never 
> launched in any other nodes also rather it keep on retrying on same node 
> manager which never succeeds.
> {code}
> Relaunching Container container_e05_1533635581781_0001_01_02. Remaining 
> retry attempts(after relaunch) : -4816.
> {code}
> There are two issues
> # Container relaunch is keep on failing
> # Log message is misleading



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8667) Container Relaunch fails with "find: File system loop detected;" for tar ball artifacts

2018-08-15 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581524#comment-16581524
 ] 

Chandni Singh commented on YARN-8667:
-

Before relaunch, container script and container tokens file is deleted from the 
container's working directory.
{code:java}
protected void cleanupContainerFiles(Path containerWorkDir) {
  LOG.debug("cleanup container {} files", containerWorkDir);
  // delete ContainerScriptPath
  deleteAsUser(new Path(containerWorkDir, CONTAINER_SCRIPT));
  // delete TokensPath
  deleteAsUser(new Path(containerWorkDir, FINAL_CONTAINER_TOKENS_FILE));
}{code}
Seems like we might have to delete any symlinks from the container's working 
directory as well?

cc. [~billie.rinaldi] [~shaneku...@gmail.com] [~eyang]

> Container Relaunch fails with "find: File system loop detected;" for tar ball 
> artifacts
> ---
>
> Key: YARN-8667
> URL: https://issues.apache.org/jira/browse/YARN-8667
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Chandni Singh
>Priority: Major
>
> Service is launched with TAR BALL artifacts. If a container is exited due to 
> any reasons, container relaunch policy try to relaunch the container on same 
> node with same container work space. As a result, container relaunch is keep 
> on failing. 
> If container relaunch max-retry policy is disabled, then  container never 
> launched in any other nodes also rather it keep on retrying on same node 
> manager which never succeeds.
> {code}
> Relaunching Container container_e05_1533635581781_0001_01_02. Remaining 
> retry attempts(after relaunch) : -4816.
> {code}
> There are two issues
> # Container relaunch is keep on failing
> # Log message is misleading



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8474) sleeper service fails to launch with "Authentication Required"

2018-08-15 Thread Billie Rinaldi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated YARN-8474:
-
Attachment: YARN-8474.006.patch

> sleeper service fails to launch with "Authentication Required"
> --
>
> Key: YARN-8474
> URL: https://issues.apache.org/jira/browse/YARN-8474
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Sumana Sathish
>Assignee: Billie Rinaldi
>Priority: Critical
> Attachments: YARN-8474.001.patch, YARN-8474.002.patch, 
> YARN-8474.003.patch, YARN-8474.004.patch, YARN-8474.005.patch, 
> YARN-8474.006.patch
>
>
> Sleeper job fails with Authentication required.
> {code}
> yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json
>  18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition 
> from local FS: /a/YarnServiceLogs/sleeper-orig.json
>  18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8474) sleeper service fails to launch with "Authentication Required"

2018-08-15 Thread Billie Rinaldi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581488#comment-16581488
 ] 

Billie Rinaldi commented on YARN-8474:
--

Patch 6 fixes checkstyle issues.

> sleeper service fails to launch with "Authentication Required"
> --
>
> Key: YARN-8474
> URL: https://issues.apache.org/jira/browse/YARN-8474
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Sumana Sathish
>Assignee: Billie Rinaldi
>Priority: Critical
> Attachments: YARN-8474.001.patch, YARN-8474.002.patch, 
> YARN-8474.003.patch, YARN-8474.004.patch, YARN-8474.005.patch, 
> YARN-8474.006.patch
>
>
> Sleeper job fails with Authentication required.
> {code}
> yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json
>  18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition 
> from local FS: /a/YarnServiceLogs/sleeper-orig.json
>  18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8474) sleeper service fails to launch with "Authentication Required"

2018-08-15 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581470#comment-16581470
 ] 

genericqa commented on YARN-8474:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
21s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 28m 
54s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 31s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
31s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
17s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 10s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-api:
 The patch generated 15 new + 4 unchanged - 0 fixed = 19 total (was 4) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 53s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
13s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
42s{color} | {color:green} hadoop-yarn-services-api in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 57m  0s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 |
| JIRA Issue | YARN-8474 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12935725/YARN-8474.005.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  xml  findbugs  checkstyle  |
| uname | Linux 134c2a1fa8d2 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 
08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / c918d88 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/21608/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-api.txt
 |
|  Test Results | 

[jira] [Assigned] (YARN-8667) Container Relaunch fails with "find: File system loop detected;" for tar ball artifacts

2018-08-15 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh reassigned YARN-8667:
---

Assignee: Chandni Singh

> Container Relaunch fails with "find: File system loop detected;" for tar ball 
> artifacts
> ---
>
> Key: YARN-8667
> URL: https://issues.apache.org/jira/browse/YARN-8667
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Chandni Singh
>Priority: Major
>
> Service is launched with TAR BALL artifacts. If a container is exited due to 
> any reasons, container relaunch policy try to relaunch the container on same 
> node with same container work space. As a result, container relaunch is keep 
> on failing. 
> If container relaunch max-retry policy is disabled, then  container never 
> launched in any other nodes also rather it keep on retrying on same node 
> manager which never succeeds.
> {code}
> Relaunching Container container_e05_1533635581781_0001_01_02. Remaining 
> retry attempts(after relaunch) : -4816.
> {code}
> There are two issues
> # Container relaunch is keep on failing
> # Log message is misleading



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8242) YARN NM: OOM error while reading back the state store on recovery

2018-08-15 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581419#comment-16581419
 ] 

Jason Lowe commented on YARN-8242:
--

Thanks for updating the patch!

bq. The problem/issue that I faced with that is seeking/skipping to next user 
entry in the localization state is complex, as we do not know who next user is 
or how much information (key/values) is associated with a respective user 
without iterating.

Rather than a full re-iteration, we can seek to a key that we know is after a 
user's localization entries but necessarily before any other user's entry.  
Seeking is very fast and done all the time during recovery, so it would be much 
faster than iterating.  For example, userA's private localization entries will 
have a key prefix of "Localization/private/userA/" and have entries with a 
prefix of either "Localization/private/userA/filecache/" or 
"Localization/private/userA/appcache/".  If we seek to a key that occurs 
lexicographically after those prefixes, like "Localization/private/userA/zzz", 
then we will have an iterator starting after the localization records for userA 
but necessarily before any user that occurs after userA lexicographically.  
That avoids the double-iteration performance problem nor does it rely on 
approaches that would require the previous user iterator to be fully consumed 
to function properly.

bq. So, reading LocalResourceTrackerState might require two different keys.

Yes, one way to solve that is have two iterators for the two payloads, one for 
completed resources and one for started resources.  We know the prefix to seek 
for on each one, so they are easy to setup.

It's a bit trickier to do the full iteration for localized resource state, but 
it should be possible.  I would be fine with punting that to a followup JIRA 
since this current work is still a significant improvement over the old method 
of loading everything at once.

Other comments on the patch:

getLeveldbIterator calls constructors and methods that can can throw 
DBException which is a runtime exception.  Those need to be caught and 
translated to IOException as was done with iterators before this patch.

Some lines were reformatted to split else blocks onto separate lines and remove 
spaces before opening braces which is inconsistent with the coding style.  New 
methods and conditionals were added without whitespace between the parameters 
and the opening brace.  Checkstyle is currently passing with false positives, 
otherwise I would expect it to complain.

typo: getConstainerStateIterator

Rather than redundantly re-parsing a container ID from the key, it would be 
cleaner and more intuitive to have RecoveredContainerState track the container 
ID.  RecoveredContainerState didn't need to explicitly track it before since it 
was always paired with a container ID in a map, but now that we're returning a 
series of objects via an iterator it makes sense to move that key into the 
value object, in this case the RecoveredContainerState.

This comment was not addressed, intentional?
bq. Nit: RCSIterator would be more readable as ContainerStateIterator, e.g.: 
getContainerStateIterator instead of getRCSIterator. Similar comments for the 
other acronym iterator classes.

getNextRecoveredLocalizationEntry implies it could be called for all types of 
localization entries but it only works for private resources.  The name should 
reflect that or it could simply be pulled into RURIterator#getNextItem directly.

getMasterKey is more complicated than it needs to be.  No iterator needed since 
we can lookup keys in the database directly, e.g.:
{code}
  private MasterKey getMasterKey(String dbKey) throws IOException {
byte[] data = db.get(bytes(dbKey));
if (data == null || data.length == 0) {
  return null;
}
return parseMasterKey(data);
  }
{code}

The synchronization on the various load methods for the memory state store is a 
false promise of safety as they return iterators that can access state 
asynchronously with other state store operations.  For real safety here it 
would need to return an iterator on a copy of the underlying state rather than 
an iterator on the state directly.  leveldb is async-safe but the memory store 
is not.

Why does TestNMLeveldbStateStoreService#loadContainersState explicitly check 
for and skip recovered containers without a start request?  Isn't it the job of 
the iterator to not return those types of entries?



> YARN NM: OOM error while reading back the state store on recovery
> -
>
> Key: YARN-8242
> URL: https://issues.apache.org/jira/browse/YARN-8242
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.6.0, 2.9.0, 2.6.5, 2.8.3, 3.1.0, 2.7.6, 3.0.2
>Reporter: Kanwaljeet 

[jira] [Updated] (YARN-8474) sleeper service fails to launch with "Authentication Required"

2018-08-15 Thread Billie Rinaldi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated YARN-8474:
-
Attachment: YARN-8474.005.patch

> sleeper service fails to launch with "Authentication Required"
> --
>
> Key: YARN-8474
> URL: https://issues.apache.org/jira/browse/YARN-8474
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Sumana Sathish
>Assignee: Eric Yang
>Priority: Critical
> Attachments: YARN-8474.001.patch, YARN-8474.002.patch, 
> YARN-8474.003.patch, YARN-8474.004.patch, YARN-8474.005.patch
>
>
> Sleeper job fails with Authentication required.
> {code}
> yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json
>  18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition 
> from local FS: /a/YarnServiceLogs/sleeper-orig.json
>  18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8474) sleeper service fails to launch with "Authentication Required"

2018-08-15 Thread Billie Rinaldi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581410#comment-16581410
 ] 

Billie Rinaldi commented on YARN-8474:
--

Attached patch 5 based on patch 4 plus dependency cleanup.

> sleeper service fails to launch with "Authentication Required"
> --
>
> Key: YARN-8474
> URL: https://issues.apache.org/jira/browse/YARN-8474
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Sumana Sathish
>Assignee: Billie Rinaldi
>Priority: Critical
> Attachments: YARN-8474.001.patch, YARN-8474.002.patch, 
> YARN-8474.003.patch, YARN-8474.004.patch, YARN-8474.005.patch
>
>
> Sleeper job fails with Authentication required.
> {code}
> yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json
>  18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition 
> from local FS: /a/YarnServiceLogs/sleeper-orig.json
>  18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8474) sleeper service fails to launch with "Authentication Required"

2018-08-15 Thread Billie Rinaldi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi reassigned YARN-8474:


Assignee: Billie Rinaldi  (was: Eric Yang)

> sleeper service fails to launch with "Authentication Required"
> --
>
> Key: YARN-8474
> URL: https://issues.apache.org/jira/browse/YARN-8474
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Sumana Sathish
>Assignee: Billie Rinaldi
>Priority: Critical
> Attachments: YARN-8474.001.patch, YARN-8474.002.patch, 
> YARN-8474.003.patch, YARN-8474.004.patch, YARN-8474.005.patch
>
>
> Sleeper job fails with Authentication required.
> {code}
> yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json
>  18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition 
> from local FS: /a/YarnServiceLogs/sleeper-orig.json
>  18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7708) [GPG] Load based policy generator

2018-08-15 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581348#comment-16581348
 ] 

Botong Huang commented on YARN-7708:


Committed to YARN-7402. Thanks [~youchen] for the patch! 

> [GPG] Load based policy generator
> -
>
> Key: YARN-7708
> URL: https://issues.apache.org/jira/browse/YARN-7708
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Carlo Curino
>Assignee: Young Chen
>Priority: Major
> Attachments: YARN-7708-YARN-7402.01.cumulative.patch, 
> YARN-7708-YARN-7402.01.patch, YARN-7708-YARN-7402.02.cumulative.patch, 
> YARN-7708-YARN-7402.02.patch, YARN-7708-YARN-7402.03.cumulative.patch, 
> YARN-7708-YARN-7402.03.patch, YARN-7708-YARN-7402.03.patch, 
> YARN-7708-YARN-7402.04.cumulative.patch, YARN-7708-YARN-7402.04.patch, 
> YARN-7708-YARN-7402.04.patch, YARN-7708-YARN-7402.05.cumulative.patch, 
> YARN-7708-YARN-7402.05.patch, YARN-7708-YARN-7402.06.cumulative.patch, 
> YARN-7708-YARN-7402.07.cumulative.patch
>
>
> This policy reads load from the "pendingQueueLength" metrics and provides 
> scaling into a set of weights that influence the AMRMProxy and Router 
> behaviors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7708) [GPG] Load based policy generator

2018-08-15 Thread Young Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581323#comment-16581323
 ] 

Young Chen commented on YARN-7708:
--

Unit test failure is unrelated.

> [GPG] Load based policy generator
> -
>
> Key: YARN-7708
> URL: https://issues.apache.org/jira/browse/YARN-7708
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Carlo Curino
>Assignee: Young Chen
>Priority: Major
> Attachments: YARN-7708-YARN-7402.01.cumulative.patch, 
> YARN-7708-YARN-7402.01.patch, YARN-7708-YARN-7402.02.cumulative.patch, 
> YARN-7708-YARN-7402.02.patch, YARN-7708-YARN-7402.03.cumulative.patch, 
> YARN-7708-YARN-7402.03.patch, YARN-7708-YARN-7402.03.patch, 
> YARN-7708-YARN-7402.04.cumulative.patch, YARN-7708-YARN-7402.04.patch, 
> YARN-7708-YARN-7402.04.patch, YARN-7708-YARN-7402.05.cumulative.patch, 
> YARN-7708-YARN-7402.05.patch, YARN-7708-YARN-7402.06.cumulative.patch, 
> YARN-7708-YARN-7402.07.cumulative.patch
>
>
> This policy reads load from the "pendingQueueLength" metrics and provides 
> scaling into a set of weights that influence the AMRMProxy and Router 
> behaviors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8129) Improve error message for invalid value in fields attribute

2018-08-15 Thread Suma Shivaprasad (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581269#comment-16581269
 ] 

Suma Shivaprasad commented on YARN-8129:


Thanks for the patch [~abmodi] Patch LGTM . +1

> Improve error message for invalid value in fields attribute
> ---
>
> Key: YARN-8129
> URL: https://issues.apache.org/jira/browse/YARN-8129
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Reporter: Charan Hebri
>Assignee: Abhishek Modi
>Priority: Minor
> Attachments: YARN-8129.001.patch
>
>
> Query with invalid values for the 'fields' attributes throws a message that 
> isn't very informative.
> Reader log,
> {noformat}
> 2018-04-09 08:59:46,069 INFO  reader.TimelineReaderWebServices 
> (TimelineReaderWebServices.java:getEntities(595)) - Received URL 
> /ws/v2/timeline/users/hrt_qa/flows/test_flow/apps?limit=3=INFOS from 
> user hrt_qa
> 2018-04-09 08:59:46,070 INFO  reader.TimelineReaderWebServices 
> (TimelineReaderWebServices.java:handleException(173)) - Processed URL 
> /ws/v2/timeline/users/hrt_qa/flows/test_flow/apps?limit=3=INFOS but 
> encountered exception (Took 1 ms.){noformat}
> Here INFOS is the invalid value for the fields attribute.
> Response,
> {noformat}
> {
>   "exception": "BadRequestException",
>   "message": "java.lang.Exception: No enum constant 
> org.apache.hadoop.yarn.server.timelineservice.storage.TimelineReader.Field.INFOS",
>   "javaClassName": "org.apache.hadoop.yarn.webapp.BadRequestException"
> }{noformat}
> The message shouldn't ideally contain the enum information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8474) sleeper service fails to launch with "Authentication Required"

2018-08-15 Thread Billie Rinaldi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581217#comment-16581217
 ] 

Billie Rinaldi commented on YARN-8474:
--

I have done some testing with patch 4 and it looks pretty good. It needs some 
dependency cleanup, because the services-api module has a lot of undeclared 
dependencies (only some of which are introduced by this patch). Also, I would 
suggest using javax.ws.rs.core.HttpHeaders instead of the org.apache.http 
version, since we already have javax.ws.rs:jsr311-api as a dependency.

> sleeper service fails to launch with "Authentication Required"
> --
>
> Key: YARN-8474
> URL: https://issues.apache.org/jira/browse/YARN-8474
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Sumana Sathish
>Assignee: Eric Yang
>Priority: Critical
> Attachments: YARN-8474.001.patch, YARN-8474.002.patch, 
> YARN-8474.003.patch, YARN-8474.004.patch
>
>
> Sleeper job fails with Authentication required.
> {code}
> yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json
>  18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition 
> from local FS: /a/YarnServiceLogs/sleeper-orig.json
>  18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8656) container-executor should not write cgroup tasks files for docker containers

2018-08-15 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581110#comment-16581110
 ] 

Jim Brennan commented on YARN-8656:
---

I am unable to repro the unit test failure in 
TestContainerManager#testLocalingResourceWhileContainerRunning.   I don't think 
it is related to my change.

> container-executor should not write cgroup tasks files for docker containers
> 
>
> Key: YARN-8656
> URL: https://issues.apache.org/jira/browse/YARN-8656
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8656.001.patch, YARN-8656.002.patch
>
>
> If cgroups are enabled, we pass the {{--cgroup-parent}} option to {{docker 
> run}} to ensure that all processes for the container are placed into a cgroup 
> under (for example) {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. 
> Docker creates a cgroup there with the docker container id as the name and 
> all of the processes in the container go into that cgroup.
> container-executor has code in {{launch_docker_container_as_user()}} that 
> then cherry-picks the PID of the docker container (usually the launch shell) 
> and writes that into the 
> {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/tasks}} file, effectively 
> moving it from 
> {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/docker_container_id}} to 
> {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}.  So you end up with 
> one process out of the container in the {{container_id}} cgroup, and the rest 
> in the {{container_id/docker_container_id}} cgroup.
> Since we are passing the {{--cgroup-parent}} to docker, there is no need to 
> manually write the container pid to the tasks file - we can just remove the 
> code that does this in the docker case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8668) Inconsistency between capacity and fair scheduler in the aspect of computing node available resource

2018-08-15 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-8668:
-
Labels: capacityscheduler  (was: )

> Inconsistency between capacity and fair scheduler in the aspect of computing 
> node available resource
> 
>
> Key: YARN-8668
> URL: https://issues.apache.org/jira/browse/YARN-8668
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yeliang Cang
>Assignee: Yeliang Cang
>Priority: Major
>  Labels: capacityscheduler
> Attachments: YARN-8668.001.patch
>
>
> We have observed that given capacityScheduler and defaultResourceCalculor,  
> when there are many memory resources in a node, running heavy workload, then 
> the available vcores of this node will be negative!
> I noticed that in capacityScheduler.java, use code below to calculate the 
> available resources for allocating containers:
> {code}
> if (calculator.computeAvailableContainers(Resources
>  .add(node.getUnallocatedResource(), node.getTotalKillableResources()),
>  minimumAllocation) <= 0) {
>  if (LOG.isDebugEnabled()) {
>  LOG.debug("This node or this node partition doesn't have available or"
>  + "killable resource");
>  }
> {code}
> while in fairscheduler FsAppAttempt.java, similar code was found:
> {code}
> // Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
> ...
> }
> {code}
> Why is the inconsistency? I think we should use 
> Resources.fitsIn(smaller,bigger) instead in capacityScheduler !!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting

2018-08-15 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8664:

Description: 
ResourceManager logs about exception is:
{code:java}
2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
11.13.73.101:51083
java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402)
        at 
org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642)
        at 
org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254)
        at 
org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61)
        at 
org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313)
        at 
org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264)
        at 
org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287)
        at 
org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669)
        at 
com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
        at 
com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
        at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97)
        at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
        at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
{code}
ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes 
when NM losting, and AllocateResponse#getProto will call 
ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of 
PB . Because ResourcePBImpl is not thread safe and 
multiple AM will call allocate at the same time, ResourcePBImpl#getProto may 
throw NullPointerException or UnsupportedOperationException.
I wrote a test code which can reproduce exception.
{code:java}
@Test
  public void testResource1() throws InterruptedException {
ResourcePBImpl resource = (ResourcePBImpl) Resource.newInstance(1, 1);
for (int i =0;i<10;i++ ) {
  Thread thread = new PBThread(resource);
  thread.setName("t"+i);
  thread.start();
}
Thread.sleep(1);
  }

  class PBThread extends Thread {
ResourcePBImpl resourcePB;

public PBThread(ResourcePBImpl resourcePB) {
  this.resourcePB = resourcePB;
}

@Override 
public void run() {
  while(true) {
this.resourcePB.getProto();
  }
}
  }
{code}

  was:
ResourceManager logs about exception is:
{code:java}
2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
11.13.73.101:51083
java.lang.NullPointerException
  

[jira] [Commented] (YARN-8668) Inconsistency between capacity and fair scheduler in the aspect of computing node available resource

2018-08-15 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580937#comment-16580937
 ] 

genericqa commented on YARN-8668:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 27m 
16s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 31m 
52s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m  1s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
17s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
17s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch 
failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
16s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch 
failed. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 16s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
 8s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} mvnsite {color} | {color:red}  0m 
18s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch 
failed. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} shadedclient {color} | {color:red}  3m 
44s{color} | {color:red} patch has errors when building and testing our client 
artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 
18s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch 
failed. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
14s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch 
failed. {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 18s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
35s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 79m 35s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 |
| JIRA Issue | YARN-8668 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12935673/YARN-8668.001.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux c83e944cc88c 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 
08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 8dc07b4 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
| mvninstall | 
https://builds.apache.org/job/PreCommit-YARN-Build/21607/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
 |
| compile | 

[jira] [Commented] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting

2018-08-15 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580902#comment-16580902
 ] 

Weiwei Yang commented on YARN-8664:
---

Hi [~yangjiandan] Yeah, seems like the jenkins env is broken on this branch, 
not sure why, I will check with some other folks about this. Will keep you 
posted !

> ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
> -
>
> Key: YARN-8664
> URL: https://issues.apache.org/jira/browse/YARN-8664
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.2
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8664-branch-2.8.001.pathch, 
> YARN-8664-branch-2.8.2.001.patch, YARN-8664-branch-2.8.2.002.patch
>
>
> ResourceManager logs about exception is:
> {code:java}
> 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
> org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
> Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 11.13.73.101:51083
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402)
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>         at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
> {code}
> ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes 
> when NM losting, and AllocateResponse#getProto will call 
> ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of 
> PB . Because ResourcePBImpl is not thread safe and 
> multiple AM will call allocate at the same time, ResourcePBImpl#getProto may 
> throw NullPointerException or UnsupportedOperationException.
> I wrote a test code 

[jira] [Commented] (YARN-8668) Inconsistency between capacity and fair scheduler in the aspect of computing node available resource

2018-08-15 Thread Yeliang Cang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580892#comment-16580892
 ] 

Yeliang Cang commented on YARN-8668:


Submit a patch to resolve this!

> Inconsistency between capacity and fair scheduler in the aspect of computing 
> node available resource
> 
>
> Key: YARN-8668
> URL: https://issues.apache.org/jira/browse/YARN-8668
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yeliang Cang
>Assignee: Yeliang Cang
>Priority: Major
> Attachments: YARN-8668.001.patch
>
>
> We have observed that given capacityScheduler and defaultResourceCalculor,  
> when there are many memory resources in a node, running heavy workload, then 
> the available vcores of this node will be negative!
> I noticed that in capacityScheduler.java, use code below to calculate the 
> available resources for allocating containers:
> {code}
> if (calculator.computeAvailableContainers(Resources
>  .add(node.getUnallocatedResource(), node.getTotalKillableResources()),
>  minimumAllocation) <= 0) {
>  if (LOG.isDebugEnabled()) {
>  LOG.debug("This node or this node partition doesn't have available or"
>  + "killable resource");
>  }
> {code}
> while in fairscheduler FsAppAttempt.java, similar code was found:
> {code}
> // Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
> ...
> }
> {code}
> Why is the inconsistency? I think we should use 
> Resources.fitsIn(smaller,bigger) instead in capacityScheduler !!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8668) Inconsistency between capacity and fair scheduler in the aspect of computing node available resource

2018-08-15 Thread Yeliang Cang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yeliang Cang updated YARN-8668:
---
Attachment: YARN-8668.001.patch

> Inconsistency between capacity and fair scheduler in the aspect of computing 
> node available resource
> 
>
> Key: YARN-8668
> URL: https://issues.apache.org/jira/browse/YARN-8668
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yeliang Cang
>Assignee: Yeliang Cang
>Priority: Major
> Attachments: YARN-8668.001.patch
>
>
> We have observed that given capacityScheduler and defaultResourceCalculor,  
> when there are many memory resources in a node, running heavy workload, then 
> the available vcores of this node will be negative!
> I noticed that in capacityScheduler.java, use code below to calculate the 
> available resources for allocating containers:
> {code}
> if (calculator.computeAvailableContainers(Resources
>  .add(node.getUnallocatedResource(), node.getTotalKillableResources()),
>  minimumAllocation) <= 0) {
>  if (LOG.isDebugEnabled()) {
>  LOG.debug("This node or this node partition doesn't have available or"
>  + "killable resource");
>  }
> {code}
> while in fairscheduler FsAppAttempt.java, similar code was found:
> {code}
> // Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
> ...
> }
> {code}
> Why is the inconsistency? I think we should use 
> Resources.fitsIn(smaller,bigger) instead in capacityScheduler !!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8668) Inconsistency between capacity and fair scheduler in the aspect of computing node available resource

2018-08-15 Thread Yeliang Cang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yeliang Cang updated YARN-8668:
---
Description: 
We have observed that given capacityScheduler and defaultResourceCalculor,  
when there are many memory resources in a node, running heavy workload, then 
the available vcores of this node will be negative!

I noticed that in capacityScheduler.java, use code below to calculate the 
available resources for allocating containers:

{code}

if (calculator.computeAvailableContainers(Resources
 .add(node.getUnallocatedResource(), node.getTotalKillableResources()),
 minimumAllocation) <= 0) {
 if (LOG.isDebugEnabled()) {
 LOG.debug("This node or this node partition doesn't have available or"
 + "killable resource");
 }

{code}

while in fairscheduler FsAppAttempt.java, similar code was found:

{code}

// Can we allocate a container on this node?
if (Resources.fitsIn(capability, available)) {

...

}

{code}

Why is the inconsistency? I think we should use 
Resources.fitsIn(smaller,bigger) instead in capacityScheduler !!!

 

> Inconsistency between capacity and fair scheduler in the aspect of computing 
> node available resource
> 
>
> Key: YARN-8668
> URL: https://issues.apache.org/jira/browse/YARN-8668
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yeliang Cang
>Assignee: Yeliang Cang
>Priority: Major
>
> We have observed that given capacityScheduler and defaultResourceCalculor,  
> when there are many memory resources in a node, running heavy workload, then 
> the available vcores of this node will be negative!
> I noticed that in capacityScheduler.java, use code below to calculate the 
> available resources for allocating containers:
> {code}
> if (calculator.computeAvailableContainers(Resources
>  .add(node.getUnallocatedResource(), node.getTotalKillableResources()),
>  minimumAllocation) <= 0) {
>  if (LOG.isDebugEnabled()) {
>  LOG.debug("This node or this node partition doesn't have available or"
>  + "killable resource");
>  }
> {code}
> while in fairscheduler FsAppAttempt.java, similar code was found:
> {code}
> // Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
> ...
> }
> {code}
> Why is the inconsistency? I think we should use 
> Resources.fitsIn(smaller,bigger) instead in capacityScheduler !!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8668) Inconsistency between capacity and fair scheduler in the aspect of computing node available resource

2018-08-15 Thread Yeliang Cang (JIRA)
Yeliang Cang created YARN-8668:
--

 Summary: Inconsistency between capacity and fair scheduler in the 
aspect of computing node available resource
 Key: YARN-8668
 URL: https://issues.apache.org/jira/browse/YARN-8668
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yeliang Cang
Assignee: Yeliang Cang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting

2018-08-15 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580881#comment-16580881
 ] 

Jiandan Yang  commented on YARN-8664:
-

[~cheersyang] Jenkins is probably not OK.
Would you please fix it?

> ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
> -
>
> Key: YARN-8664
> URL: https://issues.apache.org/jira/browse/YARN-8664
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.2
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8664-branch-2.8.001.pathch, 
> YARN-8664-branch-2.8.2.001.patch, YARN-8664-branch-2.8.2.002.patch
>
>
> ResourceManager logs about exception is:
> {code:java}
> 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
> org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
> Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 11.13.73.101:51083
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402)
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>         at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
> {code}
> ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes 
> when NM losting, and AllocateResponse#getProto will call 
> ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of 
> PB . Because ResourcePBImpl is not thread safe and 
> multiple AM will call allocate at the same time, ResourcePBImpl#getProto may 
> throw NullPointerException or UnsupportedOperationException.
> I wrote a test code which can reproduce exception.
> {code:java}
> @Test
>   public void testResource1() throws 

[jira] [Commented] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized

2018-08-15 Thread Chen Yufei (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580862#comment-16580862
 ] 

Chen Yufei commented on YARN-8513:
--

We got infinite loops two times recently with 2.9.1, restarting ResourceManager 
fixed the issue again.

 

As the cause of the problem is still not clear, we have upgraded to Hadoop 
3.1.0. I'll give further updates in case we encounter this issue again.

> CapacityScheduler infinite loop when queue is near fully utilized
> -
>
> Key: YARN-8513
> URL: https://issues.apache.org/jira/browse/YARN-8513
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.1
> Environment: Ubuntu 14.04.5
> YARN is configured with one label and 5 queues.
>Reporter: Chen Yufei
>Priority: Major
> Attachments: jstack-1.log, jstack-2.log, jstack-3.log, jstack-4.log, 
> jstack-5.log, top-during-lock.log, top-when-normal.log
>
>
> ResourceManager does not respond to any request when queue is near fully 
> utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM 
> restart, it can recover running jobs and start accepting new ones.
>  
> Seems like CapacityScheduler is in an infinite loop printing out the 
> following log messages (more than 25,000 lines in a second):
>  
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.99816763 
> absoluteUsedCapacity=0.99816763 used= 
> cluster=}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1530619767030_1652_01 
> container=null 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943
>  clusterResource= type=NODE_LOCAL 
> requestedPartition=}}
>  
> I encounter this problem several times after upgrading to YARN 2.9.1, while 
> the same configuration works fine under version 2.7.3.
>  
> YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a 
> similar problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8667) Container Relaunch fails with "find: File system loop detected;" for tar ball artifacts

2018-08-15 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580751#comment-16580751
 ] 

Rohith Sharma K S commented on YARN-8667:
-

Container Relaunch shares same working directory. As a result, launch container 
script create a symlink which is already created leading this issue. In order 
to debug the issue, follow the below steps
 # launch a sleeper service with below spec. Note that I am providing TAR BALL 
artifacts
{code}
curl --negotiate -u: -H "Content-Type: application/json" -X POST 
http://localhost:8088/app/v1/services?user.name=yarn-ats -d '{
  "name": "sleeper",
  "version": "1.0.0",
  "queue": "default",
  "artifact": {
"id": "/mapreduce/mapreduce.tar.gz",
"type": "TARBALL"
  },
  "components" :
  [
{
  "name": "sleeper1",
  "number_of_containers": 1,
  "launch_command": "sleep infinity",
  "resource": {
"cpus": 1,
"memory": "2048"
  }
}
  ]
}'
{code}
# After sleeper service is launched, goto working directory of container. There 
you see below files
{noformat}
[root@ctr-e138-1518143905142-431547-01-04 
container_e04_1534244457405_0004_01_02]# ll
total 24
-rw-r--r-- 1 yarn hadoop7 Aug 15 05:47 container_tokens
-rwx-- 1 yarn hadoop  656 Aug 15 05:47 default_container_executor_session.sh
-rwx-- 1 yarn hadoop  711 Aug 15 05:47 default_container_executor.sh
-rwx-- 1 yarn hadoop 3817 Aug 15 05:47 launch_container.sh
lrwxrwxrwx 1 yarn hadoop  107 Aug 15 05:47 lib -> 
/hadoop/yarn/local/usercache/yarn-ats/appcache/application_1534244457405_0004/filecache/10/mapreduce.tar.gz
drwx--x--- 2 yarn hadoop 4096 Aug 15 05:47 tmp
{noformat}
# You can try to execute launch_container.sh again manually which fails with 
below error.
{code:}
find: File system loop detected; ‘./lib/mapreduce.tar.gz’ is part of the same 
file system loop as ‘./lib’.{code}

In container relaunch also the same working directory is shared which executes 
launch_container.sh again. This is causing an error which terminates 
launch_container.sh with error code 1

> Container Relaunch fails with "find: File system loop detected;" for tar ball 
> artifacts
> ---
>
> Key: YARN-8667
> URL: https://issues.apache.org/jira/browse/YARN-8667
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Major
>
> Service is launched with TAR BALL artifacts. If a container is exited due to 
> any reasons, container relaunch policy try to relaunch the container on same 
> node with same container work space. As a result, container relaunch is keep 
> on failing. 
> If container relaunch max-retry policy is disabled, then  container never 
> launched in any other nodes also rather it keep on retrying on same node 
> manager which never succeeds.
> {code}
> Relaunching Container container_e05_1533635581781_0001_01_02. Remaining 
> retry attempts(after relaunch) : -4816.
> {code}
> There are two issues
> # Container relaunch is keep on failing
> # Log message is misleading



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8667) Container Relaunch fails with "find: File system loop detected;" for tar ball artifacts

2018-08-15 Thread Rohith Sharma K S (JIRA)
Rohith Sharma K S created YARN-8667:
---

 Summary: Container Relaunch fails with "find: File system loop 
detected;" for tar ball artifacts
 Key: YARN-8667
 URL: https://issues.apache.org/jira/browse/YARN-8667
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Rohith Sharma K S


Service is launched with TAR BALL artifacts. If a container is exited due to 
any reasons, container relaunch policy try to relaunch the container on same 
node with same container work space. As a result, container relaunch is keep on 
failing. 

If container relaunch max-retry policy is disabled, then  container never 
launched in any other nodes also rather it keep on retrying on same node 
manager which never succeeds.
{code}
Relaunching Container container_e05_1533635581781_0001_01_02. Remaining 
retry attempts(after relaunch) : -4816.
{code}

There are two issues
# Container relaunch is keep on failing
# Log message is misleading



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org