[jira] [Created] (YARN-8669) Yarn application has already ended! It might have been killed or unable to launch application master.
Bheemidi Vikram Reddy created YARN-8669: --- Summary: Yarn application has already ended! It might have been killed or unable to launch application master. Key: YARN-8669 URL: https://issues.apache.org/jira/browse/YARN-8669 Project: Hadoop YARN Issue Type: Bug Components: applications/unmanaged-AM-launcher Affects Versions: 2.7.3 Environment: Ubuntu-16.04 RAM-32gb Cores-8 Reporter: Bheemidi Vikram Reddy Attachments: yarn-testuser-resourcemanager-coea18.log When I submit the Spark job to the yarn cluster through Zeppelin notebook, I'm facing the AM Killing. Sp Pls can one help me in the yarn configuration? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8613) Old RM UI shows wrong vcores total value
[ https://issues.apache.org/jira/browse/YARN-8613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sen Zhao reassigned YARN-8613: -- Assignee: (was: Sen Zhao) > Old RM UI shows wrong vcores total value > > > Key: YARN-8613 > URL: https://issues.apache.org/jira/browse/YARN-8613 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akhil PB >Priority: Major > Attachments: Screen Shot 2018-08-02 at 12.12.41 PM.png, Screen Shot > 2018-08-02 at 12.16.53 PM.png, YARN-8613.001.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8668) Inconsistency between capacity and fair scheduler in the aspect of computing node available resource
[ https://issues.apache.org/jira/browse/YARN-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581866#comment-16581866 ] Yeliang Cang commented on YARN-8668: Thanks [~leftnoteasy] for clarifying this, close this Jira as not a problem! > Inconsistency between capacity and fair scheduler in the aspect of computing > node available resource > > > Key: YARN-8668 > URL: https://issues.apache.org/jira/browse/YARN-8668 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yeliang Cang >Assignee: Yeliang Cang >Priority: Major > Labels: capacityscheduler > Attachments: YARN-8668.001.patch > > > We have observed that given capacityScheduler and defaultResourceCalculor, > when there are many memory resources in a node, running heavy workload, then > the available vcores of this node will be negative! > I noticed that in capacityScheduler.java, use code below to calculate the > available resources for allocating containers: > {code} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), node.getTotalKillableResources()), > minimumAllocation) <= 0) { > if (LOG.isDebugEnabled()) { > LOG.debug("This node or this node partition doesn't have available or" > + "killable resource"); > } > {code} > while in fairscheduler FsAppAttempt.java, similar code was found: > {code} > // Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > ... > } > {code} > Why is the inconsistency? I think we should use > Resources.fitsIn(smaller,bigger) instead in capacityScheduler !!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8667) Container Relaunch fails with "find: File system loop detected;" for tar ball artifacts
[ https://issues.apache.org/jira/browse/YARN-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581842#comment-16581842 ] genericqa commented on YARN-8667: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 26s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 17s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 54s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 22s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 1 new + 63 unchanged - 1 fixed = 64 total (was 64) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 30s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 18m 58s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 26s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 75m 29s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 | | JIRA Issue | YARN-8667 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12935786/YARN-8667.001.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 397b6ab4ff6a 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 7dc79a8 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/21610/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21610/testReport/ | | Max. process+thread count | 448 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U:
[jira] [Updated] (YARN-8662) Fair Scheduler stops scheduling when a queue is configured only CPU and memory
[ https://issues.apache.org/jira/browse/YARN-8662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sen Zhao updated YARN-8662: --- Component/s: fairscheduler > Fair Scheduler stops scheduling when a queue is configured only CPU and memory > -- > > Key: YARN-8662 > URL: https://issues.apache.org/jira/browse/YARN-8662 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Sen Zhao >Assignee: Sen Zhao >Priority: Major > Attachments: NonResourceToSchedule.png, YARN-8662.001.patch > > > Add a new resource type in resource-types.xml, eg: resource1. > In Fair scheduler when queue's MaxResources is configured like: > {code}4096 mb, 4 vcores{code} > When submit a application which need resource like: > {code} 1536 mb, 1 vcores, 10 resource1{code} > The application will be pending. Because there is no resource1 in this queue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8597) Build Worker utility for MaWo Application
[ https://issues.apache.org/jira/browse/YARN-8597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yesha Vora updated YARN-8597: - Attachment: YARN-8597.001.patch > Build Worker utility for MaWo Application > - > > Key: YARN-8597 > URL: https://issues.apache.org/jira/browse/YARN-8597 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Yesha Vora >Assignee: Yesha Vora >Priority: Major > Attachments: YARN-8597.001.patch > > > The worker is responsible for executing Tasks. > * Worker > ** Create a worker class which drives worker life cycle > ** Create WorkAssignment Protocol. It should be handle Register/deregister > worker, send heartbeat > ** Lifecycle: Register worker, Run Setup Task, Get Task from master and > execute it using TaskRunner, Run Teardown Task > * TaskRunner > ** Simple Task Runner : This runner should be able to execute a simple task > ** Composite Task Runner: This runner should be able to execute composite > task > * TaskWallTimeLimiter > ** Create a utility which can abort the task if the execution time exceeds > task timeout. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8667) Container Relaunch fails with "find: File system loop detected;" for tar ball artifacts
[ https://issues.apache.org/jira/browse/YARN-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581802#comment-16581802 ] Chandni Singh commented on YARN-8667: - Patch 1 contains a fix and a unit test. [~billie.rinaldi] [~eyang] please review > Container Relaunch fails with "find: File system loop detected;" for tar ball > artifacts > --- > > Key: YARN-8667 > URL: https://issues.apache.org/jira/browse/YARN-8667 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8667.001.patch > > > Service is launched with TAR BALL artifacts. If a container is exited due to > any reasons, container relaunch policy try to relaunch the container on same > node with same container work space. As a result, container relaunch is keep > on failing. > If container relaunch max-retry policy is disabled, then container never > launched in any other nodes also rather it keep on retrying on same node > manager which never succeeds. > {code} > Relaunching Container container_e05_1533635581781_0001_01_02. Remaining > retry attempts(after relaunch) : -4816. > {code} > There are two issues > # Container relaunch is keep on failing > # Log message is misleading -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8667) Container Relaunch fails with "find: File system loop detected;" for tar ball artifacts
[ https://issues.apache.org/jira/browse/YARN-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8667: Attachment: YARN-8667.001.patch > Container Relaunch fails with "find: File system loop detected;" for tar ball > artifacts > --- > > Key: YARN-8667 > URL: https://issues.apache.org/jira/browse/YARN-8667 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8667.001.patch > > > Service is launched with TAR BALL artifacts. If a container is exited due to > any reasons, container relaunch policy try to relaunch the container on same > node with same container work space. As a result, container relaunch is keep > on failing. > If container relaunch max-retry policy is disabled, then container never > launched in any other nodes also rather it keep on retrying on same node > manager which never succeeds. > {code} > Relaunching Container container_e05_1533635581781_0001_01_02. Remaining > retry attempts(after relaunch) : -4816. > {code} > There are two issues > # Container relaunch is keep on failing > # Log message is misleading -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8569) Create an interface to provide cluster information to application
[ https://issues.apache.org/jira/browse/YARN-8569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang reassigned YARN-8569: --- Assignee: Eric Yang > Create an interface to provide cluster information to application > - > > Key: YARN-8569 > URL: https://issues.apache.org/jira/browse/YARN-8569 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: Docker > > Some program requires container hostnames to be known for application to run. > For example, distributed tensorflow requires launch_command that looks like: > {code} > # On ps0.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=ps --task_index=0 > # On ps1.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=ps --task_index=1 > # On worker0.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=worker --task_index=0 > # On worker1.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=worker --task_index=1 > {code} > This is a bit cumbersome to orchestrate via Distributed Shell, or YARN > services launch_command. In addition, the dynamic parameters do not work > with YARN flex command. This is the classic pain point for application > developer attempt to automate system environment settings as parameter to end > user application. > It would be great if YARN Docker integration can provide a simple option to > expose hostnames of the yarn service via a mounted file. The file content > gets updated when flex command is performed. This allows application > developer to consume system environment settings via a standard interface. > It is like /proc/devices for Linux, but for Hadoop. This may involve > updating a file in distributed cache, and allow mounting of the file via > container-executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8488) YARN service/components/instances should have SUCCEEDED/FAILED states
[ https://issues.apache.org/jira/browse/YARN-8488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581598#comment-16581598 ] Eric Yang commented on YARN-8488: - [~suma.shivaprasad], thank you for the patch. A few minor nitpicks: # Introduce synchronized boolean getTimelineServiceEnabled method to make this class thread safe. # hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/component/Component.java changes is unnecessary. # ComponentInstance.java, near line 265, } else { # It might be useful to pass in a real diagnostic string to handleComponentInstanceRelaunch to make sure the down stream classes isn't failing to due NPE. The new state works fine. > YARN service/components/instances should have SUCCEEDED/FAILED states > - > > Key: YARN-8488 > URL: https://issues.apache.org/jira/browse/YARN-8488 > Project: Hadoop YARN > Issue Type: Task > Components: yarn-native-services >Reporter: Wangda Tan >Assignee: Suma Shivaprasad >Priority: Major > Attachments: YARN-8488.1.patch, YARN-8488.2.patch, YARN-8488.3.patch, > YARN-8488.4.patch, YARN-8488.5.patch > > > Existing YARN service has following states: > {code} > public enum ServiceState { > ACCEPTED, STARTED, STABLE, STOPPED, FAILED, FLEX, UPGRADING, > UPGRADING_AUTO_FINALIZE; > } > {code} > Ideally we should add "SUCCEEDED" state in order to support long running > applications like Tensorflow. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8668) Inconsistency between capacity and fair scheduler in the aspect of computing node available resource
[ https://issues.apache.org/jira/browse/YARN-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581574#comment-16581574 ] Wangda Tan edited comment on YARN-8668 at 8/15/18 8:34 PM: --- Thanks [~Cyl] for reporting the issue, this is by design in CS. Using computeAvailableContainers can get correct result when both DominantResourceCalculator and DefaultResourceCalculator enabled. Using fitsIn(res, res) only works when DominantResourceCalculator is enabled. To me, the correct solution is to use fits(resourceCalculator, res, res) I don't think fix required in CS. was (Author: leftnoteasy): Thanks [~Cyl] for reporting the issue, this is by design in CS. Using computeAvailableContainers can get correct result when both DominantResourceCalculator and DefaultResourceCalculator enabled. Using fitsIn only works when DominantResourceCalculator is enabled. I don't think fix required in CS. > Inconsistency between capacity and fair scheduler in the aspect of computing > node available resource > > > Key: YARN-8668 > URL: https://issues.apache.org/jira/browse/YARN-8668 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yeliang Cang >Assignee: Yeliang Cang >Priority: Major > Labels: capacityscheduler > Attachments: YARN-8668.001.patch > > > We have observed that given capacityScheduler and defaultResourceCalculor, > when there are many memory resources in a node, running heavy workload, then > the available vcores of this node will be negative! > I noticed that in capacityScheduler.java, use code below to calculate the > available resources for allocating containers: > {code} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), node.getTotalKillableResources()), > minimumAllocation) <= 0) { > if (LOG.isDebugEnabled()) { > LOG.debug("This node or this node partition doesn't have available or" > + "killable resource"); > } > {code} > while in fairscheduler FsAppAttempt.java, similar code was found: > {code} > // Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > ... > } > {code} > Why is the inconsistency? I think we should use > Resources.fitsIn(smaller,bigger) instead in capacityScheduler !!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8509) Total pending resource calculation in preemption should use user-limit factor instead of minimum-user-limit-percent
[ https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581573#comment-16581573 ] Zian Chen commented on YARN-8509: - Offline discussed with Eric and Wangda, will upload a new patch to evaluate the algorithm we provided here works as expected and not cause any over-preemption. > Total pending resource calculation in preemption should use user-limit factor > instead of minimum-user-limit-percent > --- > > Key: YARN-8509 > URL: https://issues.apache.org/jira/browse/YARN-8509 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Major > Attachments: YARN-8509.001.patch, YARN-8509.002.patch, > YARN-8509.003.patch > > > In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total > pending resource based on user-limit percent and user-limit factor which will > cap pending resource for each user to the minimum of user-limit pending and > actual pending. This will prevent queue from taking more pending resource to > achieve queue balance after all queue satisfied with its ideal allocation. > > We need to change the logic to let queue pending can go beyond userlimit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8668) Inconsistency between capacity and fair scheduler in the aspect of computing node available resource
[ https://issues.apache.org/jira/browse/YARN-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581574#comment-16581574 ] Wangda Tan commented on YARN-8668: -- Thanks [~Cyl] for reporting the issue, this is by design in CS. Using computeAvailableContainers can get correct result when both DominantResourceCalculator and DefaultResourceCalculator enabled. Using fitsIn only works when DominantResourceCalculator is enabled. I don't think fix required in CS. > Inconsistency between capacity and fair scheduler in the aspect of computing > node available resource > > > Key: YARN-8668 > URL: https://issues.apache.org/jira/browse/YARN-8668 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yeliang Cang >Assignee: Yeliang Cang >Priority: Major > Labels: capacityscheduler > Attachments: YARN-8668.001.patch > > > We have observed that given capacityScheduler and defaultResourceCalculor, > when there are many memory resources in a node, running heavy workload, then > the available vcores of this node will be negative! > I noticed that in capacityScheduler.java, use code below to calculate the > available resources for allocating containers: > {code} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), node.getTotalKillableResources()), > minimumAllocation) <= 0) { > if (LOG.isDebugEnabled()) { > LOG.debug("This node or this node partition doesn't have available or" > + "killable resource"); > } > {code} > while in fairscheduler FsAppAttempt.java, similar code was found: > {code} > // Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > ... > } > {code} > Why is the inconsistency? I think we should use > Resources.fitsIn(smaller,bigger) instead in capacityScheduler !!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8474) sleeper service fails to launch with "Authentication Required"
[ https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581558#comment-16581558 ] genericqa commented on YARN-8474: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 34s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 32m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 54s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 34s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 12s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-api: The patch generated 19 new + 4 unchanged - 0 fixed = 23 total (was 4) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 36s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 42s{color} | {color:green} hadoop-yarn-services-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 64m 38s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 | | JIRA Issue | YARN-8474 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12935740/YARN-8474.006.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient xml findbugs checkstyle | | uname | Linux 6706e194e545 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / d951af2 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/21609/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-api.txt | | Test Results |
[jira] [Commented] (YARN-8667) Container Relaunch fails with "find: File system loop detected;" for tar ball artifacts
[ https://issues.apache.org/jira/browse/YARN-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581548#comment-16581548 ] Billie Rinaldi commented on YARN-8667: -- That sounds like the issue. Thanks for figuring out the problem, [~csingh]! It will be good to get this bug fixed. > Container Relaunch fails with "find: File system loop detected;" for tar ball > artifacts > --- > > Key: YARN-8667 > URL: https://issues.apache.org/jira/browse/YARN-8667 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Chandni Singh >Priority: Major > > Service is launched with TAR BALL artifacts. If a container is exited due to > any reasons, container relaunch policy try to relaunch the container on same > node with same container work space. As a result, container relaunch is keep > on failing. > If container relaunch max-retry policy is disabled, then container never > launched in any other nodes also rather it keep on retrying on same node > manager which never succeeds. > {code} > Relaunching Container container_e05_1533635581781_0001_01_02. Remaining > retry attempts(after relaunch) : -4816. > {code} > There are two issues > # Container relaunch is keep on failing > # Log message is misleading -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8667) Container Relaunch fails with "find: File system loop detected;" for tar ball artifacts
[ https://issues.apache.org/jira/browse/YARN-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581524#comment-16581524 ] Chandni Singh commented on YARN-8667: - Before relaunch, container script and container tokens file is deleted from the container's working directory. {code:java} protected void cleanupContainerFiles(Path containerWorkDir) { LOG.debug("cleanup container {} files", containerWorkDir); // delete ContainerScriptPath deleteAsUser(new Path(containerWorkDir, CONTAINER_SCRIPT)); // delete TokensPath deleteAsUser(new Path(containerWorkDir, FINAL_CONTAINER_TOKENS_FILE)); }{code} Seems like we might have to delete any symlinks from the container's working directory as well? cc. [~billie.rinaldi] [~shaneku...@gmail.com] [~eyang] > Container Relaunch fails with "find: File system loop detected;" for tar ball > artifacts > --- > > Key: YARN-8667 > URL: https://issues.apache.org/jira/browse/YARN-8667 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Chandni Singh >Priority: Major > > Service is launched with TAR BALL artifacts. If a container is exited due to > any reasons, container relaunch policy try to relaunch the container on same > node with same container work space. As a result, container relaunch is keep > on failing. > If container relaunch max-retry policy is disabled, then container never > launched in any other nodes also rather it keep on retrying on same node > manager which never succeeds. > {code} > Relaunching Container container_e05_1533635581781_0001_01_02. Remaining > retry attempts(after relaunch) : -4816. > {code} > There are two issues > # Container relaunch is keep on failing > # Log message is misleading -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8474) sleeper service fails to launch with "Authentication Required"
[ https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Billie Rinaldi updated YARN-8474: - Attachment: YARN-8474.006.patch > sleeper service fails to launch with "Authentication Required" > -- > > Key: YARN-8474 > URL: https://issues.apache.org/jira/browse/YARN-8474 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.1.0 >Reporter: Sumana Sathish >Assignee: Billie Rinaldi >Priority: Critical > Attachments: YARN-8474.001.patch, YARN-8474.002.patch, > YARN-8474.003.patch, YARN-8474.004.patch, YARN-8474.005.patch, > YARN-8474.006.patch > > > Sleeper job fails with Authentication required. > {code} > yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition > from local FS: /a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8474) sleeper service fails to launch with "Authentication Required"
[ https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581488#comment-16581488 ] Billie Rinaldi commented on YARN-8474: -- Patch 6 fixes checkstyle issues. > sleeper service fails to launch with "Authentication Required" > -- > > Key: YARN-8474 > URL: https://issues.apache.org/jira/browse/YARN-8474 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.1.0 >Reporter: Sumana Sathish >Assignee: Billie Rinaldi >Priority: Critical > Attachments: YARN-8474.001.patch, YARN-8474.002.patch, > YARN-8474.003.patch, YARN-8474.004.patch, YARN-8474.005.patch, > YARN-8474.006.patch > > > Sleeper job fails with Authentication required. > {code} > yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition > from local FS: /a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8474) sleeper service fails to launch with "Authentication Required"
[ https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581470#comment-16581470 ] genericqa commented on YARN-8474: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 21s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 28m 54s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 31s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 31s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 10s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-api: The patch generated 15 new + 4 unchanged - 0 fixed = 19 total (was 4) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 53s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 13s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 42s{color} | {color:green} hadoop-yarn-services-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 57m 0s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 | | JIRA Issue | YARN-8474 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12935725/YARN-8474.005.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient xml findbugs checkstyle | | uname | Linux 134c2a1fa8d2 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / c918d88 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/21608/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-api.txt | | Test Results |
[jira] [Assigned] (YARN-8667) Container Relaunch fails with "find: File system loop detected;" for tar ball artifacts
[ https://issues.apache.org/jira/browse/YARN-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh reassigned YARN-8667: --- Assignee: Chandni Singh > Container Relaunch fails with "find: File system loop detected;" for tar ball > artifacts > --- > > Key: YARN-8667 > URL: https://issues.apache.org/jira/browse/YARN-8667 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Chandni Singh >Priority: Major > > Service is launched with TAR BALL artifacts. If a container is exited due to > any reasons, container relaunch policy try to relaunch the container on same > node with same container work space. As a result, container relaunch is keep > on failing. > If container relaunch max-retry policy is disabled, then container never > launched in any other nodes also rather it keep on retrying on same node > manager which never succeeds. > {code} > Relaunching Container container_e05_1533635581781_0001_01_02. Remaining > retry attempts(after relaunch) : -4816. > {code} > There are two issues > # Container relaunch is keep on failing > # Log message is misleading -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8242) YARN NM: OOM error while reading back the state store on recovery
[ https://issues.apache.org/jira/browse/YARN-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581419#comment-16581419 ] Jason Lowe commented on YARN-8242: -- Thanks for updating the patch! bq. The problem/issue that I faced with that is seeking/skipping to next user entry in the localization state is complex, as we do not know who next user is or how much information (key/values) is associated with a respective user without iterating. Rather than a full re-iteration, we can seek to a key that we know is after a user's localization entries but necessarily before any other user's entry. Seeking is very fast and done all the time during recovery, so it would be much faster than iterating. For example, userA's private localization entries will have a key prefix of "Localization/private/userA/" and have entries with a prefix of either "Localization/private/userA/filecache/" or "Localization/private/userA/appcache/". If we seek to a key that occurs lexicographically after those prefixes, like "Localization/private/userA/zzz", then we will have an iterator starting after the localization records for userA but necessarily before any user that occurs after userA lexicographically. That avoids the double-iteration performance problem nor does it rely on approaches that would require the previous user iterator to be fully consumed to function properly. bq. So, reading LocalResourceTrackerState might require two different keys. Yes, one way to solve that is have two iterators for the two payloads, one for completed resources and one for started resources. We know the prefix to seek for on each one, so they are easy to setup. It's a bit trickier to do the full iteration for localized resource state, but it should be possible. I would be fine with punting that to a followup JIRA since this current work is still a significant improvement over the old method of loading everything at once. Other comments on the patch: getLeveldbIterator calls constructors and methods that can can throw DBException which is a runtime exception. Those need to be caught and translated to IOException as was done with iterators before this patch. Some lines were reformatted to split else blocks onto separate lines and remove spaces before opening braces which is inconsistent with the coding style. New methods and conditionals were added without whitespace between the parameters and the opening brace. Checkstyle is currently passing with false positives, otherwise I would expect it to complain. typo: getConstainerStateIterator Rather than redundantly re-parsing a container ID from the key, it would be cleaner and more intuitive to have RecoveredContainerState track the container ID. RecoveredContainerState didn't need to explicitly track it before since it was always paired with a container ID in a map, but now that we're returning a series of objects via an iterator it makes sense to move that key into the value object, in this case the RecoveredContainerState. This comment was not addressed, intentional? bq. Nit: RCSIterator would be more readable as ContainerStateIterator, e.g.: getContainerStateIterator instead of getRCSIterator. Similar comments for the other acronym iterator classes. getNextRecoveredLocalizationEntry implies it could be called for all types of localization entries but it only works for private resources. The name should reflect that or it could simply be pulled into RURIterator#getNextItem directly. getMasterKey is more complicated than it needs to be. No iterator needed since we can lookup keys in the database directly, e.g.: {code} private MasterKey getMasterKey(String dbKey) throws IOException { byte[] data = db.get(bytes(dbKey)); if (data == null || data.length == 0) { return null; } return parseMasterKey(data); } {code} The synchronization on the various load methods for the memory state store is a false promise of safety as they return iterators that can access state asynchronously with other state store operations. For real safety here it would need to return an iterator on a copy of the underlying state rather than an iterator on the state directly. leveldb is async-safe but the memory store is not. Why does TestNMLeveldbStateStoreService#loadContainersState explicitly check for and skip recovered containers without a start request? Isn't it the job of the iterator to not return those types of entries? > YARN NM: OOM error while reading back the state store on recovery > - > > Key: YARN-8242 > URL: https://issues.apache.org/jira/browse/YARN-8242 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 2.6.0, 2.9.0, 2.6.5, 2.8.3, 3.1.0, 2.7.6, 3.0.2 >Reporter: Kanwaljeet
[jira] [Updated] (YARN-8474) sleeper service fails to launch with "Authentication Required"
[ https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Billie Rinaldi updated YARN-8474: - Attachment: YARN-8474.005.patch > sleeper service fails to launch with "Authentication Required" > -- > > Key: YARN-8474 > URL: https://issues.apache.org/jira/browse/YARN-8474 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.1.0 >Reporter: Sumana Sathish >Assignee: Eric Yang >Priority: Critical > Attachments: YARN-8474.001.patch, YARN-8474.002.patch, > YARN-8474.003.patch, YARN-8474.004.patch, YARN-8474.005.patch > > > Sleeper job fails with Authentication required. > {code} > yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition > from local FS: /a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8474) sleeper service fails to launch with "Authentication Required"
[ https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581410#comment-16581410 ] Billie Rinaldi commented on YARN-8474: -- Attached patch 5 based on patch 4 plus dependency cleanup. > sleeper service fails to launch with "Authentication Required" > -- > > Key: YARN-8474 > URL: https://issues.apache.org/jira/browse/YARN-8474 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.1.0 >Reporter: Sumana Sathish >Assignee: Billie Rinaldi >Priority: Critical > Attachments: YARN-8474.001.patch, YARN-8474.002.patch, > YARN-8474.003.patch, YARN-8474.004.patch, YARN-8474.005.patch > > > Sleeper job fails with Authentication required. > {code} > yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition > from local FS: /a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8474) sleeper service fails to launch with "Authentication Required"
[ https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Billie Rinaldi reassigned YARN-8474: Assignee: Billie Rinaldi (was: Eric Yang) > sleeper service fails to launch with "Authentication Required" > -- > > Key: YARN-8474 > URL: https://issues.apache.org/jira/browse/YARN-8474 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.1.0 >Reporter: Sumana Sathish >Assignee: Billie Rinaldi >Priority: Critical > Attachments: YARN-8474.001.patch, YARN-8474.002.patch, > YARN-8474.003.patch, YARN-8474.004.patch, YARN-8474.005.patch > > > Sleeper job fails with Authentication required. > {code} > yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition > from local FS: /a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7708) [GPG] Load based policy generator
[ https://issues.apache.org/jira/browse/YARN-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581348#comment-16581348 ] Botong Huang commented on YARN-7708: Committed to YARN-7402. Thanks [~youchen] for the patch! > [GPG] Load based policy generator > - > > Key: YARN-7708 > URL: https://issues.apache.org/jira/browse/YARN-7708 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Carlo Curino >Assignee: Young Chen >Priority: Major > Attachments: YARN-7708-YARN-7402.01.cumulative.patch, > YARN-7708-YARN-7402.01.patch, YARN-7708-YARN-7402.02.cumulative.patch, > YARN-7708-YARN-7402.02.patch, YARN-7708-YARN-7402.03.cumulative.patch, > YARN-7708-YARN-7402.03.patch, YARN-7708-YARN-7402.03.patch, > YARN-7708-YARN-7402.04.cumulative.patch, YARN-7708-YARN-7402.04.patch, > YARN-7708-YARN-7402.04.patch, YARN-7708-YARN-7402.05.cumulative.patch, > YARN-7708-YARN-7402.05.patch, YARN-7708-YARN-7402.06.cumulative.patch, > YARN-7708-YARN-7402.07.cumulative.patch > > > This policy reads load from the "pendingQueueLength" metrics and provides > scaling into a set of weights that influence the AMRMProxy and Router > behaviors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7708) [GPG] Load based policy generator
[ https://issues.apache.org/jira/browse/YARN-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581323#comment-16581323 ] Young Chen commented on YARN-7708: -- Unit test failure is unrelated. > [GPG] Load based policy generator > - > > Key: YARN-7708 > URL: https://issues.apache.org/jira/browse/YARN-7708 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Carlo Curino >Assignee: Young Chen >Priority: Major > Attachments: YARN-7708-YARN-7402.01.cumulative.patch, > YARN-7708-YARN-7402.01.patch, YARN-7708-YARN-7402.02.cumulative.patch, > YARN-7708-YARN-7402.02.patch, YARN-7708-YARN-7402.03.cumulative.patch, > YARN-7708-YARN-7402.03.patch, YARN-7708-YARN-7402.03.patch, > YARN-7708-YARN-7402.04.cumulative.patch, YARN-7708-YARN-7402.04.patch, > YARN-7708-YARN-7402.04.patch, YARN-7708-YARN-7402.05.cumulative.patch, > YARN-7708-YARN-7402.05.patch, YARN-7708-YARN-7402.06.cumulative.patch, > YARN-7708-YARN-7402.07.cumulative.patch > > > This policy reads load from the "pendingQueueLength" metrics and provides > scaling into a set of weights that influence the AMRMProxy and Router > behaviors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8129) Improve error message for invalid value in fields attribute
[ https://issues.apache.org/jira/browse/YARN-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581269#comment-16581269 ] Suma Shivaprasad commented on YARN-8129: Thanks for the patch [~abmodi] Patch LGTM . +1 > Improve error message for invalid value in fields attribute > --- > > Key: YARN-8129 > URL: https://issues.apache.org/jira/browse/YARN-8129 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Reporter: Charan Hebri >Assignee: Abhishek Modi >Priority: Minor > Attachments: YARN-8129.001.patch > > > Query with invalid values for the 'fields' attributes throws a message that > isn't very informative. > Reader log, > {noformat} > 2018-04-09 08:59:46,069 INFO reader.TimelineReaderWebServices > (TimelineReaderWebServices.java:getEntities(595)) - Received URL > /ws/v2/timeline/users/hrt_qa/flows/test_flow/apps?limit=3=INFOS from > user hrt_qa > 2018-04-09 08:59:46,070 INFO reader.TimelineReaderWebServices > (TimelineReaderWebServices.java:handleException(173)) - Processed URL > /ws/v2/timeline/users/hrt_qa/flows/test_flow/apps?limit=3=INFOS but > encountered exception (Took 1 ms.){noformat} > Here INFOS is the invalid value for the fields attribute. > Response, > {noformat} > { > "exception": "BadRequestException", > "message": "java.lang.Exception: No enum constant > org.apache.hadoop.yarn.server.timelineservice.storage.TimelineReader.Field.INFOS", > "javaClassName": "org.apache.hadoop.yarn.webapp.BadRequestException" > }{noformat} > The message shouldn't ideally contain the enum information. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8474) sleeper service fails to launch with "Authentication Required"
[ https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581217#comment-16581217 ] Billie Rinaldi commented on YARN-8474: -- I have done some testing with patch 4 and it looks pretty good. It needs some dependency cleanup, because the services-api module has a lot of undeclared dependencies (only some of which are introduced by this patch). Also, I would suggest using javax.ws.rs.core.HttpHeaders instead of the org.apache.http version, since we already have javax.ws.rs:jsr311-api as a dependency. > sleeper service fails to launch with "Authentication Required" > -- > > Key: YARN-8474 > URL: https://issues.apache.org/jira/browse/YARN-8474 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.1.0 >Reporter: Sumana Sathish >Assignee: Eric Yang >Priority: Critical > Attachments: YARN-8474.001.patch, YARN-8474.002.patch, > YARN-8474.003.patch, YARN-8474.004.patch > > > Sleeper job fails with Authentication required. > {code} > yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition > from local FS: /a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8656) container-executor should not write cgroup tasks files for docker containers
[ https://issues.apache.org/jira/browse/YARN-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581110#comment-16581110 ] Jim Brennan commented on YARN-8656: --- I am unable to repro the unit test failure in TestContainerManager#testLocalingResourceWhileContainerRunning. I don't think it is related to my change. > container-executor should not write cgroup tasks files for docker containers > > > Key: YARN-8656 > URL: https://issues.apache.org/jira/browse/YARN-8656 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8656.001.patch, YARN-8656.002.patch > > > If cgroups are enabled, we pass the {{--cgroup-parent}} option to {{docker > run}} to ensure that all processes for the container are placed into a cgroup > under (for example) {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. > Docker creates a cgroup there with the docker container id as the name and > all of the processes in the container go into that cgroup. > container-executor has code in {{launch_docker_container_as_user()}} that > then cherry-picks the PID of the docker container (usually the launch shell) > and writes that into the > {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/tasks}} file, effectively > moving it from > {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/docker_container_id}} to > {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. So you end up with > one process out of the container in the {{container_id}} cgroup, and the rest > in the {{container_id/docker_container_id}} cgroup. > Since we are passing the {{--cgroup-parent}} to docker, there is no need to > manually write the container pid to the tasks file - we can just remove the > code that does this in the docker case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8668) Inconsistency between capacity and fair scheduler in the aspect of computing node available resource
[ https://issues.apache.org/jira/browse/YARN-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibo Chen updated YARN-8668: - Labels: capacityscheduler (was: ) > Inconsistency between capacity and fair scheduler in the aspect of computing > node available resource > > > Key: YARN-8668 > URL: https://issues.apache.org/jira/browse/YARN-8668 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yeliang Cang >Assignee: Yeliang Cang >Priority: Major > Labels: capacityscheduler > Attachments: YARN-8668.001.patch > > > We have observed that given capacityScheduler and defaultResourceCalculor, > when there are many memory resources in a node, running heavy workload, then > the available vcores of this node will be negative! > I noticed that in capacityScheduler.java, use code below to calculate the > available resources for allocating containers: > {code} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), node.getTotalKillableResources()), > minimumAllocation) <= 0) { > if (LOG.isDebugEnabled()) { > LOG.debug("This node or this node partition doesn't have available or" > + "killable resource"); > } > {code} > while in fairscheduler FsAppAttempt.java, similar code was found: > {code} > // Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > ... > } > {code} > Why is the inconsistency? I think we should use > Resources.fitsIn(smaller,bigger) instead in capacityScheduler !!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
[ https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8664: Description: ResourceManager logs about exception is: {code:java} 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 11.13.73.101:51083 java.lang.NullPointerException at org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402) at org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642) at org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254) at org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) at org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313) at org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264) at org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287) at org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669) at com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) at com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) at org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) {code} ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes when NM losting, and AllocateResponse#getProto will call ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of PB . Because ResourcePBImpl is not thread safe and multiple AM will call allocate at the same time, ResourcePBImpl#getProto may throw NullPointerException or UnsupportedOperationException. I wrote a test code which can reproduce exception. {code:java} @Test public void testResource1() throws InterruptedException { ResourcePBImpl resource = (ResourcePBImpl) Resource.newInstance(1, 1); for (int i =0;i<10;i++ ) { Thread thread = new PBThread(resource); thread.setName("t"+i); thread.start(); } Thread.sleep(1); } class PBThread extends Thread { ResourcePBImpl resourcePB; public PBThread(ResourcePBImpl resourcePB) { this.resourcePB = resourcePB; } @Override public void run() { while(true) { this.resourcePB.getProto(); } } } {code} was: ResourceManager logs about exception is: {code:java} 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 11.13.73.101:51083 java.lang.NullPointerException
[jira] [Commented] (YARN-8668) Inconsistency between capacity and fair scheduler in the aspect of computing node available resource
[ https://issues.apache.org/jira/browse/YARN-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580937#comment-16580937 ] genericqa commented on YARN-8668: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 27m 16s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 31m 52s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 1s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 17s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 16s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 16s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 8s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 18s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} shadedclient {color} | {color:red} 3m 44s{color} | {color:red} patch has errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 18s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 14s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 18s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 35s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 79m 35s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 | | JIRA Issue | YARN-8668 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12935673/YARN-8668.001.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux c83e944cc88c 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 8dc07b4 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC1 | | mvninstall | https://builds.apache.org/job/PreCommit-YARN-Build/21607/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | compile |
[jira] [Commented] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
[ https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580902#comment-16580902 ] Weiwei Yang commented on YARN-8664: --- Hi [~yangjiandan] Yeah, seems like the jenkins env is broken on this branch, not sure why, I will check with some other folks about this. Will keep you posted ! > ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting > - > > Key: YARN-8664 > URL: https://issues.apache.org/jira/browse/YARN-8664 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.2 > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8664-branch-2.8.001.pathch, > YARN-8664-branch-2.8.2.001.patch, YARN-8664-branch-2.8.2.002.patch > > > ResourceManager logs about exception is: > {code:java} > 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] > org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 > Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from > 11.13.73.101:51083 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402) > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) > {code} > ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes > when NM losting, and AllocateResponse#getProto will call > ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of > PB . Because ResourcePBImpl is not thread safe and > multiple AM will call allocate at the same time, ResourcePBImpl#getProto may > throw NullPointerException or UnsupportedOperationException. > I wrote a test code
[jira] [Commented] (YARN-8668) Inconsistency between capacity and fair scheduler in the aspect of computing node available resource
[ https://issues.apache.org/jira/browse/YARN-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580892#comment-16580892 ] Yeliang Cang commented on YARN-8668: Submit a patch to resolve this! > Inconsistency between capacity and fair scheduler in the aspect of computing > node available resource > > > Key: YARN-8668 > URL: https://issues.apache.org/jira/browse/YARN-8668 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yeliang Cang >Assignee: Yeliang Cang >Priority: Major > Attachments: YARN-8668.001.patch > > > We have observed that given capacityScheduler and defaultResourceCalculor, > when there are many memory resources in a node, running heavy workload, then > the available vcores of this node will be negative! > I noticed that in capacityScheduler.java, use code below to calculate the > available resources for allocating containers: > {code} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), node.getTotalKillableResources()), > minimumAllocation) <= 0) { > if (LOG.isDebugEnabled()) { > LOG.debug("This node or this node partition doesn't have available or" > + "killable resource"); > } > {code} > while in fairscheduler FsAppAttempt.java, similar code was found: > {code} > // Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > ... > } > {code} > Why is the inconsistency? I think we should use > Resources.fitsIn(smaller,bigger) instead in capacityScheduler !!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8668) Inconsistency between capacity and fair scheduler in the aspect of computing node available resource
[ https://issues.apache.org/jira/browse/YARN-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yeliang Cang updated YARN-8668: --- Attachment: YARN-8668.001.patch > Inconsistency between capacity and fair scheduler in the aspect of computing > node available resource > > > Key: YARN-8668 > URL: https://issues.apache.org/jira/browse/YARN-8668 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yeliang Cang >Assignee: Yeliang Cang >Priority: Major > Attachments: YARN-8668.001.patch > > > We have observed that given capacityScheduler and defaultResourceCalculor, > when there are many memory resources in a node, running heavy workload, then > the available vcores of this node will be negative! > I noticed that in capacityScheduler.java, use code below to calculate the > available resources for allocating containers: > {code} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), node.getTotalKillableResources()), > minimumAllocation) <= 0) { > if (LOG.isDebugEnabled()) { > LOG.debug("This node or this node partition doesn't have available or" > + "killable resource"); > } > {code} > while in fairscheduler FsAppAttempt.java, similar code was found: > {code} > // Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > ... > } > {code} > Why is the inconsistency? I think we should use > Resources.fitsIn(smaller,bigger) instead in capacityScheduler !!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8668) Inconsistency between capacity and fair scheduler in the aspect of computing node available resource
[ https://issues.apache.org/jira/browse/YARN-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yeliang Cang updated YARN-8668: --- Description: We have observed that given capacityScheduler and defaultResourceCalculor, when there are many memory resources in a node, running heavy workload, then the available vcores of this node will be negative! I noticed that in capacityScheduler.java, use code below to calculate the available resources for allocating containers: {code} if (calculator.computeAvailableContainers(Resources .add(node.getUnallocatedResource(), node.getTotalKillableResources()), minimumAllocation) <= 0) { if (LOG.isDebugEnabled()) { LOG.debug("This node or this node partition doesn't have available or" + "killable resource"); } {code} while in fairscheduler FsAppAttempt.java, similar code was found: {code} // Can we allocate a container on this node? if (Resources.fitsIn(capability, available)) { ... } {code} Why is the inconsistency? I think we should use Resources.fitsIn(smaller,bigger) instead in capacityScheduler !!! > Inconsistency between capacity and fair scheduler in the aspect of computing > node available resource > > > Key: YARN-8668 > URL: https://issues.apache.org/jira/browse/YARN-8668 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yeliang Cang >Assignee: Yeliang Cang >Priority: Major > > We have observed that given capacityScheduler and defaultResourceCalculor, > when there are many memory resources in a node, running heavy workload, then > the available vcores of this node will be negative! > I noticed that in capacityScheduler.java, use code below to calculate the > available resources for allocating containers: > {code} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), node.getTotalKillableResources()), > minimumAllocation) <= 0) { > if (LOG.isDebugEnabled()) { > LOG.debug("This node or this node partition doesn't have available or" > + "killable resource"); > } > {code} > while in fairscheduler FsAppAttempt.java, similar code was found: > {code} > // Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > ... > } > {code} > Why is the inconsistency? I think we should use > Resources.fitsIn(smaller,bigger) instead in capacityScheduler !!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8668) Inconsistency between capacity and fair scheduler in the aspect of computing node available resource
Yeliang Cang created YARN-8668: -- Summary: Inconsistency between capacity and fair scheduler in the aspect of computing node available resource Key: YARN-8668 URL: https://issues.apache.org/jira/browse/YARN-8668 Project: Hadoop YARN Issue Type: Bug Reporter: Yeliang Cang Assignee: Yeliang Cang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
[ https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580881#comment-16580881 ] Jiandan Yang commented on YARN-8664: - [~cheersyang] Jenkins is probably not OK. Would you please fix it? > ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting > - > > Key: YARN-8664 > URL: https://issues.apache.org/jira/browse/YARN-8664 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.2 > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8664-branch-2.8.001.pathch, > YARN-8664-branch-2.8.2.001.patch, YARN-8664-branch-2.8.2.002.patch > > > ResourceManager logs about exception is: > {code:java} > 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] > org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 > Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from > 11.13.73.101:51083 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402) > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) > {code} > ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes > when NM losting, and AllocateResponse#getProto will call > ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of > PB . Because ResourcePBImpl is not thread safe and > multiple AM will call allocate at the same time, ResourcePBImpl#getProto may > throw NullPointerException or UnsupportedOperationException. > I wrote a test code which can reproduce exception. > {code:java} > @Test > public void testResource1() throws
[jira] [Commented] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized
[ https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580862#comment-16580862 ] Chen Yufei commented on YARN-8513: -- We got infinite loops two times recently with 2.9.1, restarting ResourceManager fixed the issue again. As the cause of the problem is still not clear, we have upgraded to Hadoop 3.1.0. I'll give further updates in case we encounter this issue again. > CapacityScheduler infinite loop when queue is near fully utilized > - > > Key: YARN-8513 > URL: https://issues.apache.org/jira/browse/YARN-8513 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.1 > Environment: Ubuntu 14.04.5 > YARN is configured with one label and 5 queues. >Reporter: Chen Yufei >Priority: Major > Attachments: jstack-1.log, jstack-2.log, jstack-3.log, jstack-4.log, > jstack-5.log, top-during-lock.log, top-when-normal.log > > > ResourceManager does not respond to any request when queue is near fully > utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM > restart, it can recover running jobs and start accepting new ones. > > Seems like CapacityScheduler is in an infinite loop printing out the > following log messages (more than 25,000 lines in a second): > > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=root usedCapacity=0.99816763 > absoluteUsedCapacity=0.99816763 used= > cluster=}} > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal}} > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application attempt=appattempt_1530619767030_1652_01 > container=null > queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943 > clusterResource= type=NODE_LOCAL > requestedPartition=}} > > I encounter this problem several times after upgrading to YARN 2.9.1, while > the same configuration works fine under version 2.7.3. > > YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a > similar problem. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8667) Container Relaunch fails with "find: File system loop detected;" for tar ball artifacts
[ https://issues.apache.org/jira/browse/YARN-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580751#comment-16580751 ] Rohith Sharma K S commented on YARN-8667: - Container Relaunch shares same working directory. As a result, launch container script create a symlink which is already created leading this issue. In order to debug the issue, follow the below steps # launch a sleeper service with below spec. Note that I am providing TAR BALL artifacts {code} curl --negotiate -u: -H "Content-Type: application/json" -X POST http://localhost:8088/app/v1/services?user.name=yarn-ats -d '{ "name": "sleeper", "version": "1.0.0", "queue": "default", "artifact": { "id": "/mapreduce/mapreduce.tar.gz", "type": "TARBALL" }, "components" : [ { "name": "sleeper1", "number_of_containers": 1, "launch_command": "sleep infinity", "resource": { "cpus": 1, "memory": "2048" } } ] }' {code} # After sleeper service is launched, goto working directory of container. There you see below files {noformat} [root@ctr-e138-1518143905142-431547-01-04 container_e04_1534244457405_0004_01_02]# ll total 24 -rw-r--r-- 1 yarn hadoop7 Aug 15 05:47 container_tokens -rwx-- 1 yarn hadoop 656 Aug 15 05:47 default_container_executor_session.sh -rwx-- 1 yarn hadoop 711 Aug 15 05:47 default_container_executor.sh -rwx-- 1 yarn hadoop 3817 Aug 15 05:47 launch_container.sh lrwxrwxrwx 1 yarn hadoop 107 Aug 15 05:47 lib -> /hadoop/yarn/local/usercache/yarn-ats/appcache/application_1534244457405_0004/filecache/10/mapreduce.tar.gz drwx--x--- 2 yarn hadoop 4096 Aug 15 05:47 tmp {noformat} # You can try to execute launch_container.sh again manually which fails with below error. {code:} find: File system loop detected; ‘./lib/mapreduce.tar.gz’ is part of the same file system loop as ‘./lib’.{code} In container relaunch also the same working directory is shared which executes launch_container.sh again. This is causing an error which terminates launch_container.sh with error code 1 > Container Relaunch fails with "find: File system loop detected;" for tar ball > artifacts > --- > > Key: YARN-8667 > URL: https://issues.apache.org/jira/browse/YARN-8667 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Major > > Service is launched with TAR BALL artifacts. If a container is exited due to > any reasons, container relaunch policy try to relaunch the container on same > node with same container work space. As a result, container relaunch is keep > on failing. > If container relaunch max-retry policy is disabled, then container never > launched in any other nodes also rather it keep on retrying on same node > manager which never succeeds. > {code} > Relaunching Container container_e05_1533635581781_0001_01_02. Remaining > retry attempts(after relaunch) : -4816. > {code} > There are two issues > # Container relaunch is keep on failing > # Log message is misleading -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8667) Container Relaunch fails with "find: File system loop detected;" for tar ball artifacts
Rohith Sharma K S created YARN-8667: --- Summary: Container Relaunch fails with "find: File system loop detected;" for tar ball artifacts Key: YARN-8667 URL: https://issues.apache.org/jira/browse/YARN-8667 Project: Hadoop YARN Issue Type: Bug Reporter: Rohith Sharma K S Service is launched with TAR BALL artifacts. If a container is exited due to any reasons, container relaunch policy try to relaunch the container on same node with same container work space. As a result, container relaunch is keep on failing. If container relaunch max-retry policy is disabled, then container never launched in any other nodes also rather it keep on retrying on same node manager which never succeeds. {code} Relaunching Container container_e05_1533635581781_0001_01_02. Remaining retry attempts(after relaunch) : -4816. {code} There are two issues # Container relaunch is keep on failing # Log message is misleading -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org