[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)
[ https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124527#comment-17124527 ] Tao Yang commented on YARN-10293: - Thanks [~wangda] for your confirmation. I think the proposed change can solve the problem for heartbeat-driven scheduling but not async scheduling, since it may still keep in a loop that chooses the first one of candidate nodes then do re-reservation as mentioned in YARN-9598. However, if what we want for this issue is just to fix this problem for heartbeat-driven scenarios, and later will have a more complete solution, the change is fine to me for now. In our internal version, we already remove this check to support allocating OPPORTUNISTIC containers in the main scheduling process. > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement (YARN-10259) > > > Key: YARN-10293 > URL: https://issues.apache.org/jira/browse/YARN-10293 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10293-001.patch, YARN-10293-002.patch, > YARN-10293-003-WIP.patch > > > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues > related to it > https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987 > Have found one more bug in the CapacityScheduler.java code which causes the > same issue with slight difference in the repro. > *Repro:* > *Nodes : Available : Used* > Node1 - 8GB, 8vcores - 8GB. 8cores > Node2 - 8GB, 8vcores - 8GB. 8cores > Node3 - 8GB, 8vcores - 8GB. 8cores > Queues -> A and B both 50% capacity, 100% max capacity > MultiNode enabled + Preemption enabled > 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores > 2. JobB Submitted to B queue with AM size of 1GB > {code} > 2020-05-21 12:12:27,313 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest > IP=172.27.160.139 OPERATION=Submit Application Request > TARGET=ClientRMService RESULT=SUCCESS APPID=application_1590046667304_0005 > CALLERCONTEXT=CLI QUEUENAME=dummy > {code} > 3. Preemption happens and used capacity is lesser than 1.0f > {code} > 2020-05-21 12:12:48,222 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics: > Non-AM container preempted, current > appAttemptId=appattempt_1590046667304_0004_01, > containerId=container_e09_1590046667304_0004_01_24, > resource= > {code} > 4. JobB gets a Reserved Container as part of > CapacityScheduler#allocateOrReserveNewContainer > {code} > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to > RESERVED > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > {code} > *Why RegularContainerAllocator reserved the container when the used capacity > is <= 1.0f ?* > {code} > The reason is even though the container is preempted - nodemanager has to > stop the container and heartbeat and update the available and unallocated > resources to ResourceManager. > {code} > 5. Now, no new allocation happens and reserved container stays at reserved. > After reservation the used capacity becomes 1.0f, below will be in a loop and > no new allocate or reserve happens. The reserved container cannot be > allocated as reserved node does not have space. node2 has space for 1GB, > 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting > called causing the Hang. > *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> > CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container > on node* > {code} > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Trying to fulfill reservation for application application_1590046667304_0005 > on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > assignContainers: partition= #applications=1 > 2020-05-21 12:13:33,242 INFO >
[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
[ https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124511#comment-17124511 ] Hadoop QA commented on YARN-10300: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 50s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 56s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 7s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 18m 37s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 2m 9s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 5s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 14s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 4s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 52m 43s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 39s{color} | {color:red} The patch generated 17 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}126m 31s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerOvercommit | | | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26105/artifact/out/Dockerfile | | JIRA Issue | YARN-10300 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13004661/YARN-10300.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 27ca2cb7f6cd 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 6288e15118f | | Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 | |
[jira] [Commented] (YARN-10302) Support custom packing algorithm for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124480#comment-17124480 ] Hadoop QA commented on YARN-10302: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 42s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 30m 26s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 11s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 41s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 39s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 50s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 40s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 87m 45s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 33s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}158m 51s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26104/artifact/out/Dockerfile | | JIRA Issue | YARN-10302 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13004658/YARN-10302-trunk.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 737e05239109 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 6288e15118f | | Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 | | unit |
[jira] [Commented] (YARN-9903) Support reservations continue looking for Node Labels
[ https://issues.apache.org/jira/browse/YARN-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124478#comment-17124478 ] Hadoop QA commented on YARN-9903: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 29s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 48s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 43s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 39s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 38s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 1s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 45s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 87m 22s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 36s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}151m 29s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26103/artifact/out/Dockerfile | | JIRA Issue | YARN-9903 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13004653/YARN-9903.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux e1f75a293e55 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 6288e15118f | | Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 | | unit |
[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
[ https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124461#comment-17124461 ] Eric Badger commented on YARN-10300: Added some null checks to patch 002 > appMasterHost not set in RM ApplicationSummary when AM fails before first > heartbeat > --- > > Key: YARN-10300 > URL: https://issues.apache.org/jira/browse/YARN-10300 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10300.001.patch, YARN-10300.002.patch > > > {noformat} > 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: > appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https > > ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources= vCores:0>,applicationType=MAPREDUCE > {noformat} > {{appMasterHost=N/A}} should have the AM hostname instead of N/A -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
[ https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10300: --- Attachment: YARN-10300.002.patch > appMasterHost not set in RM ApplicationSummary when AM fails before first > heartbeat > --- > > Key: YARN-10300 > URL: https://issues.apache.org/jira/browse/YARN-10300 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10300.001.patch, YARN-10300.002.patch > > > {noformat} > 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: > appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https > > ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources= vCores:0>,applicationType=MAPREDUCE > {noformat} > {{appMasterHost=N/A}} should have the AM hostname instead of N/A -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
[ https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124450#comment-17124450 ] Eric Badger commented on YARN-10300: bq. Eric Badger can you explain under what circumstances attempt.getMasterContainer().getNodeId().getHost() will succeed where attempt.getHost() fails? {{attempt.getHost()}} grabs {{host}} within RMAppAttemptImpl.java. This is set during {{AMRegisteredTransition}}, so it will be "N/A" until the AM registers with the NM during the first heartbeat. {{attempt.getMasterContainer()}} grabs {{masterContainer}} within RMAppAttemptImpl.java. This is set during {{AMContainerAllocatedTransition}}. So from the time between when the container is allocated until the time of the first AM heartbeat, {{masterContainer}} will be set, but {{host}} will be "N/A" bq. Also, do we need to add some null checks for these? getMasterContainer(), getNodeId(), getHost()? Probably wouldn't hurt. bq. Note that the attempt.getHost() defaults to "N/A" before it is set - what do we get if NodeID().getHost() isn't valid yet? Is that even a possibility? I don't know if it's possible to be invalid at the start or not. The {{Container}} is going to be created via {{newInstance}}, which requires a {{NodeId}} as a parameter. But that could potentially be sent in as null. But I think it will either be the correct nodeId or will be null, which I can interpret as "N/A". There are so many places that containers are instantiated in the scheduler that it'd be pretty tough to see if all of the cases have the nodeID set initially. I can add in the null checks and default the string to "N/A" if any of them don't exist. > appMasterHost not set in RM ApplicationSummary when AM fails before first > heartbeat > --- > > Key: YARN-10300 > URL: https://issues.apache.org/jira/browse/YARN-10300 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10300.001.patch > > > {noformat} > 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: > appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https > > ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources= vCores:0>,applicationType=MAPREDUCE > {noformat} > {{appMasterHost=N/A}} should have the AM hostname instead of N/A -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10283) Capacity Scheduler: starvation occurs if a higher priority queue is full and node labels are used
[ https://issues.apache.org/jira/browse/YARN-10283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124440#comment-17124440 ] Jim Brennan commented on YARN-10283: [~pbacsko] I downloaded your patch and your test case and verified that I see the same behavior as you. reproWithoutNodeLabels fails unless I set minimum-allocation-mb=1024, and reproTestWithNodeLabels succeeds. I have uploaded a patch for YARN-9903 based on an internal change that we've been running with for a long time. I verified that with the YARN-9903 patch, I get the same results for your repro test cases as with the YARN-10283 patch. Please feel free to pull in the additional changes from YARN-9903 into this patch. I think it addresses the comment from [~tarunparimi]. The YARN-9903 patch does not include your changes to FiCaSchedulerApp. I'm not certain that change is needed, but it might be. [~epayne], can you take a look as well? > Capacity Scheduler: starvation occurs if a higher priority queue is full and > node labels are used > - > > Key: YARN-10283 > URL: https://issues.apache.org/jira/browse/YARN-10283 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10283-POC01.patch, YARN-10283-ReproTest.patch, > YARN-10283-ReproTest2.patch > > > Recently we've been investigating a scenario where applications submitted to > a lower priority queue could not get scheduled because a higher priority > queue in the same hierarchy could now satisfy the allocation request. Both > queue belonged to the same partition. > If we disabled node labels, the problem disappeared. > The problem is that {{RegularContainerAllocator}} always allocated a > container for the request, even if it should not have. > *Example:* > * Cluster total resources: 3 nodes, 15GB, 24 vcores (5GB / 8 vcore per node) > * Partition "shared" was created with 2 nodes > * "root.lowprio" (priority = 20) and "root.highprio" (priorty = 40) were > added to the partition > * Both queues have a limit of > * Using DominantResourceCalculator > Setup: > Submit distributed shell application to highprio with switches > "-num_containers 3 -container_vcores 4". The memory allocation is 512MB per > container. > Chain of events: > 1. Queue is filled with contaners until it reaches usage vCores:5> > 2. A node update event is pushed to CS from a node which is part of the > partition > 2. {{AbstractCSQueue.canAssignToQueue()}} returns true because it's smaller > than the current limit resource > 3. Then {{LeafQueue.assignContainers()}} runs successfully and gets an > allocated container for > 4. But we can't commit the resource request because we would have 9 vcores in > total, violating the limit. > The problem is that we always try to assign container for the same > application in each heartbeat from "highprio". Applications in "lowprio" > cannot make progress. > *Problem:* > {{RegularContainerAllocator.assignContainer()}} does not handle this case > well. We only reject allocation if this condition is satisfied: > {noformat} > if (rmContainer == null && reservationsContinueLooking > && node.getLabels().isEmpty()) { > {noformat} > But if we have node labels, we enter a different code path and succeed with > the allocation if there's room for a container. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
[ https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124421#comment-17124421 ] Jim Brennan commented on YARN-10300: [~ebadger] can you explain under what circumstances {{attempt.getMasterContainer().getNodeId().getHost()}} will succeed where {{attempt.getHost()}} fails? Also, do we need to add some null checks for these? getMasterContainer(), getNodeId(), getHost()? Note that the attempt.getHost() defaults to "N/A" before it is set - what do we get if NodeID().getHost() isn't valid yet? Is that even a possibility? > appMasterHost not set in RM ApplicationSummary when AM fails before first > heartbeat > --- > > Key: YARN-10300 > URL: https://issues.apache.org/jira/browse/YARN-10300 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10300.001.patch > > > {noformat} > 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: > appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https > > ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources= vCores:0>,applicationType=MAPREDUCE > {noformat} > {{appMasterHost=N/A}} should have the AM hostname instead of N/A -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10302) Support custom packing algorithm for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124348#comment-17124348 ] William W. Graham Jr commented on YARN-10302: - Done. Also corrected the checkstyles and findbugs errors flagged in https://github.com/apache/hadoop/pull/2044 > Support custom packing algorithm for FairScheduler > -- > > Key: YARN-10302 > URL: https://issues.apache.org/jira/browse/YARN-10302 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: William W. Graham Jr >Priority: Major > Attachments: YARN-10302-trunk.001.patch > > > The {{FairScheduler}} class allocates containers to nodes based on the node > with the most available memory[0]. Create the ability to instead configure a > custom packing algorithm with different logic. For instance for effective > auto scaling, a bin packing algorithm might be a better choice. > 0 - > https://github.com/apache/hadoop/blob/56b7571131b0af03b32bf1c5673c32634652df21/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1034-L1043 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
[ https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124328#comment-17124328 ] Eric Badger commented on YARN-10300: The unit tests pass for me locally. [~epayne], [~Jim_Brennan], could you take a look at this patch? > appMasterHost not set in RM ApplicationSummary when AM fails before first > heartbeat > --- > > Key: YARN-10300 > URL: https://issues.apache.org/jira/browse/YARN-10300 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10300.001.patch > > > {noformat} > 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: > appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https > > ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources= vCores:0>,applicationType=MAPREDUCE > {noformat} > {{appMasterHost=N/A}} should have the AM hostname instead of N/A -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9903) Support reservations continue looking for Node Labels
[ https://issues.apache.org/jira/browse/YARN-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124325#comment-17124325 ] Jim Brennan commented on YARN-9903: --- Uploaded patch 002, which removes an extraneous file. > Support reservations continue looking for Node Labels > - > > Key: YARN-9903 > URL: https://issues.apache.org/jira/browse/YARN-9903 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Tarun Parimi >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-9903.001.patch, YARN-9903.002.patch > > > YARN-1769 brought in reservations continue looking feature which improves the > several resource reservation scenarios. However, it is not handled currently > when nodes have a label assigned to them. This is useful and in many cases > necessary even for Node Labels. So we should look to support this for node > labels also. > For example, in AbstractCSQueue.java, we have the below TODO. > {code:java} > // TODO, now only consider reservation cases when the node has no label > if (this.reservationsContinueLooking && nodePartition.equals( > RMNodeLabelsManager.NO_LABEL) && Resources.greaterThan( resourceCalculator, > clusterResource, resourceCouldBeUnreserved, Resources.none())) { > {code} > cc [~sunilg] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9903) Support reservations continue looking for Node Labels
[ https://issues.apache.org/jira/browse/YARN-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-9903: -- Attachment: YARN-9903.002.patch > Support reservations continue looking for Node Labels > - > > Key: YARN-9903 > URL: https://issues.apache.org/jira/browse/YARN-9903 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Tarun Parimi >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-9903.001.patch, YARN-9903.002.patch > > > YARN-1769 brought in reservations continue looking feature which improves the > several resource reservation scenarios. However, it is not handled currently > when nodes have a label assigned to them. This is useful and in many cases > necessary even for Node Labels. So we should look to support this for node > labels also. > For example, in AbstractCSQueue.java, we have the below TODO. > {code:java} > // TODO, now only consider reservation cases when the node has no label > if (this.reservationsContinueLooking && nodePartition.equals( > RMNodeLabelsManager.NO_LABEL) && Resources.greaterThan( resourceCalculator, > clusterResource, resourceCouldBeUnreserved, Resources.none())) { > {code} > cc [~sunilg] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)
[ https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124195#comment-17124195 ] Wangda Tan commented on YARN-10293: --- [~Tao Yang], the suggestion totally make sense to me. When we have done the initial global scheduling framework, the goal is to make it compatible to the previous behavior, I agree to make additional steps to overhaul reservation logic under the context of global scheduling is a good idea. Now the code is very hard to read and understand. I think we can do this step by step, first, let's fix low hanging fruits like this Jira. (I hope to get idea from you about the proposed change: https://issues.apache.org/jira/browse/YARN-10293?focusedCommentId=17121419=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17121419 And [~prabhujoseph] if you have time/bandwidth, can you take a look into reservation related logic + preemption + unreserve + global scheduling and see what we can optimize here? > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement (YARN-10259) > > > Key: YARN-10293 > URL: https://issues.apache.org/jira/browse/YARN-10293 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10293-001.patch, YARN-10293-002.patch, > YARN-10293-003-WIP.patch > > > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues > related to it > https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987 > Have found one more bug in the CapacityScheduler.java code which causes the > same issue with slight difference in the repro. > *Repro:* > *Nodes : Available : Used* > Node1 - 8GB, 8vcores - 8GB. 8cores > Node2 - 8GB, 8vcores - 8GB. 8cores > Node3 - 8GB, 8vcores - 8GB. 8cores > Queues -> A and B both 50% capacity, 100% max capacity > MultiNode enabled + Preemption enabled > 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores > 2. JobB Submitted to B queue with AM size of 1GB > {code} > 2020-05-21 12:12:27,313 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest > IP=172.27.160.139 OPERATION=Submit Application Request > TARGET=ClientRMService RESULT=SUCCESS APPID=application_1590046667304_0005 > CALLERCONTEXT=CLI QUEUENAME=dummy > {code} > 3. Preemption happens and used capacity is lesser than 1.0f > {code} > 2020-05-21 12:12:48,222 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics: > Non-AM container preempted, current > appAttemptId=appattempt_1590046667304_0004_01, > containerId=container_e09_1590046667304_0004_01_24, > resource= > {code} > 4. JobB gets a Reserved Container as part of > CapacityScheduler#allocateOrReserveNewContainer > {code} > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to > RESERVED > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > {code} > *Why RegularContainerAllocator reserved the container when the used capacity > is <= 1.0f ?* > {code} > The reason is even though the container is preempted - nodemanager has to > stop the container and heartbeat and update the available and unallocated > resources to ResourceManager. > {code} > 5. Now, no new allocation happens and reserved container stays at reserved. > After reservation the used capacity becomes 1.0f, below will be in a loop and > no new allocate or reserve happens. The reserved container cannot be > allocated as reserved node does not have space. node2 has space for 1GB, > 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting > called causing the Hang. > *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> > CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container > on node* > {code} > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Trying to fulfill reservation for application application_1590046667304_0005 > on node:
[jira] [Commented] (YARN-10286) PendingContainers bugs in the scheduler outputs
[ https://issues.apache.org/jira/browse/YARN-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124175#comment-17124175 ] Hadoop QA commented on YARN-10286: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 21s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:blue}0{color} | {color:blue} markdownlint {color} | {color:blue} 0m 0s{color} | {color:blue} markdownlint was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} branch-3.3 Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 3m 15s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 29m 23s{color} | {color:green} branch-3.3 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 52s{color} | {color:green} branch-3.3 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 27s{color} | {color:green} branch-3.3 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 33s{color} | {color:green} branch-3.3 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 19m 10s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 4s{color} | {color:green} branch-3.3 passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 48s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 28s{color} | {color:blue} branch/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site no findbugs output file (findbugsXml.xml) {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 23s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 56s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 22s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 1 new + 3 unchanged - 0 fixed = 4 total (was 3) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 46s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 22s{color} | {color:blue} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site has no data from findbugs {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 92m 25s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 20s{color} | {color:green} hadoop-yarn-site in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 46s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}192m 17s{color} | {color:black} {color} | \\ \\ || Subsystem
[jira] [Commented] (YARN-10296) Make ContainerPBImpl#getId/setId synchronized
[ https://issues.apache.org/jira/browse/YARN-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124121#comment-17124121 ] Wangda Tan commented on YARN-10296: --- [~bteke], It makes sense to covert other method which uses containerId to synchronized as well, for example, {{getProto}}, performance should not be a big concern here, as multiple threads not likely access same Container object. (Because we have many container objects in the memory). > Make ContainerPBImpl#getId/setId synchronized > - > > Key: YARN-10296 > URL: https://issues.apache.org/jira/browse/YARN-10296 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Minor > Attachments: YARN-10296.001.patch > > > ContainerPBImpl getId and setId methods can be accessed from multiple > threads. In order to avoid any simultaneous accesses and race conditions > these methods should be synchronized. > The idea came from the issue described in YARN-10295, however that patch is > only applicable to branch-3.2 and 3.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10254) CapacityScheduler incorrect User Group Mapping after leaf queue change
[ https://issues.apache.org/jira/browse/YARN-10254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124044#comment-17124044 ] Hudson commented on YARN-10254: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18315 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/18315/]) YARN-10254. CapacityScheduler incorrect User Group Mapping after leaf (snemeth: rev b5efdea4fd385453fd9f9da7106e908d2a9a3812) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/placement/UserGroupMappingPlacementRule.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacitySchedulerQueueMappingFactory.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/placement/TestUserGroupMappingPlacementRule.java > CapacityScheduler incorrect User Group Mapping after leaf queue change > -- > > Key: YARN-10254 > URL: https://issues.apache.org/jira/browse/YARN-10254 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Gergely Pollak >Assignee: Gergely Pollak >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10254.001.patch, YARN-10254.002.patch, > YARN-10254.003.patch, YARN-10254.004.patch, YARN-10254.005.patch, > YARN-10254.branch-3.3.001.patch > > > YARN-9879 and YARN-10198 introduced some major changes to user group mapping, > and some of them unfortunately had some negative impact on the way mapping > works. > In some cases incorrect PlacementContexts were created, where full queue path > was passed as leaf queue name. This affects how the yarn cli app list > displays the queues. > u:%user:%primary_group.%user mapping fails with an incorrect validation error > when the %primary_group parent queue was a managed parent. > Group based rules in certain cases are mapped to root.[primary_group] rules, > loosing the ability to create deeper structures. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9097) Investigate why GpuDiscoverer methods are synchronized
[ https://issues.apache.org/jira/browse/YARN-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123963#comment-17123963 ] Bilwa S T edited comment on YARN-9097 at 6/2/20, 4:47 PM: -- Hi [~snemeth] Thanks for raising this issue. I could see that getGpuDeviceInformation is called on NM init and GpuResourcePlugin#getNMResourceInfo which is already synchronized and getGpusUsableByYarn is also called on NM init. Hence all three methods need not be synchronized I think even GpuResourceHandlerImpl#postComplete need not be synchronized as GpuResourceAllocator#unassignGpus is already synchronized and we can make GpuResourceAllocator#usedDevices as ConcurrentHashMap and GpuResourceAllocator#allowedGpuDevices as ConcurrentHashSet as order need not be maintained. Thoughts? was (Author: bilwast): Hi [~snemeth] Thanks for raising this issue. I could see that getGpuDeviceInformation is called on NM init and GpuResourcePlugin#getNMResourceInfo which is already synchronized and getGpusUsableByYarn is also called on NM init. Hence all three methods need not be synchronized I think even GpuResourceHandlerImpl#postComplete need not be synchronized as GpuResourceAllocator#unassignGpus is already synchronized > Investigate why GpuDiscoverer methods are synchronized > -- > > Key: YARN-9097 > URL: https://issues.apache.org/jira/browse/YARN-9097 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Bilwa S T >Priority: Minor > Labels: newbie, newbie++ > > GpuDiscoverer.initialize surely shouldn't have been synchronized. > Please also investigate why getGpuDeviceInformation / getGpusUsableByYarn are > synchronized. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10296) Make ContainerPBImpl#getId/setId synchronized
[ https://issues.apache.org/jira/browse/YARN-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124042#comment-17124042 ] Benjamin Teke commented on YARN-10296: -- Hi [~leftnoteasy], I wanted to ask your opinion on this task. Making the getId/setId synchronized triggers the findBugs issue: IS2_INCONSISTENT_SYNC. It is valid based on the code, as the class's other methods do access the containerId with no synchronisation. Is it worth the extra effort to refactor those methods? I'm having concerns about performance. Thanks! > Make ContainerPBImpl#getId/setId synchronized > - > > Key: YARN-10296 > URL: https://issues.apache.org/jira/browse/YARN-10296 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Minor > Attachments: YARN-10296.001.patch > > > ContainerPBImpl getId and setId methods can be accessed from multiple > threads. In order to avoid any simultaneous accesses and race conditions > these methods should be synchronized. > The idea came from the issue described in YARN-10295, however that patch is > only applicable to branch-3.2 and 3.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10254) CapacityScheduler incorrect User Group Mapping after leaf queue change
[ https://issues.apache.org/jira/browse/YARN-10254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124018#comment-17124018 ] Szilard Nemeth commented on YARN-10254: --- Thanks [~shuzirra] for working on this. Checked the latest patches for both, LGTM and committed it to trunk and branch-3.3. Thanks [~pbacsko] for the review. Resolving jira. > CapacityScheduler incorrect User Group Mapping after leaf queue change > -- > > Key: YARN-10254 > URL: https://issues.apache.org/jira/browse/YARN-10254 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Gergely Pollak >Assignee: Gergely Pollak >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10254.001.patch, YARN-10254.002.patch, > YARN-10254.003.patch, YARN-10254.004.patch, YARN-10254.005.patch, > YARN-10254.branch-3.3.001.patch > > > YARN-9879 and YARN-10198 introduced some major changes to user group mapping, > and some of them unfortunately had some negative impact on the way mapping > works. > In some cases incorrect PlacementContexts were created, where full queue path > was passed as leaf queue name. This affects how the yarn cli app list > displays the queues. > u:%user:%primary_group.%user mapping fails with an incorrect validation error > when the %primary_group parent queue was a managed parent. > Group based rules in certain cases are mapped to root.[primary_group] rules, > loosing the ability to create deeper structures. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10254) CapacityScheduler incorrect User Group Mapping after leaf queue change
[ https://issues.apache.org/jira/browse/YARN-10254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-10254: -- Fix Version/s: 3.3.1 3.4.0 > CapacityScheduler incorrect User Group Mapping after leaf queue change > -- > > Key: YARN-10254 > URL: https://issues.apache.org/jira/browse/YARN-10254 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Gergely Pollak >Assignee: Gergely Pollak >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10254.001.patch, YARN-10254.002.patch, > YARN-10254.003.patch, YARN-10254.004.patch, YARN-10254.005.patch, > YARN-10254.branch-3.3.001.patch > > > YARN-9879 and YARN-10198 introduced some major changes to user group mapping, > and some of them unfortunately had some negative impact on the way mapping > works. > In some cases incorrect PlacementContexts were created, where full queue path > was passed as leaf queue name. This affects how the yarn cli app list > displays the queues. > u:%user:%primary_group.%user mapping fails with an incorrect validation error > when the %primary_group parent queue was a managed parent. > Group based rules in certain cases are mapped to root.[primary_group] rules, > loosing the ability to create deeper structures. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
[ https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123965#comment-17123965 ] Hadoop QA commented on YARN-10284: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 36m 19s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} branch-3.3 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 31m 15s{color} | {color:green} branch-3.3 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s{color} | {color:green} branch-3.3 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 22s{color} | {color:green} branch-3.3 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 40s{color} | {color:green} branch-3.3 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 14s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} branch-3.3 passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 16s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 15s{color} | {color:green} branch-3.3 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 37s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 22s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 35s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 29s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}110m 50s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26101/artifact/out/Dockerfile | | JIRA Issue | YARN-10284 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13004630/YARN-10284.branch-3.3.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 50a8656be957 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | branch-3.3 / 910d88e | | Default Java | Private Build-1.8.0_252-8u252-b09-1~16.04-b09 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/26101/testReport/ | | Max. process+thread count | 340 (vs. ulimit of 5500) | | modules | C:
[jira] [Commented] (YARN-9097) Investigate why GpuDiscoverer methods are synchronized
[ https://issues.apache.org/jira/browse/YARN-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123963#comment-17123963 ] Bilwa S T commented on YARN-9097: - Hi [~snemeth] Thanks for raising this issue. I could see that getGpuDeviceInformation is called on NM init and GpuResourcePlugin#getNMResourceInfo which is already synchronized and getGpusUsableByYarn is also called on NM init. Hence all three methods need not be synchronized I think even GpuResourceHandlerImpl#postComplete need not be synchronized as GpuResourceAllocator#unassignGpus is already synchronized > Investigate why GpuDiscoverer methods are synchronized > -- > > Key: YARN-9097 > URL: https://issues.apache.org/jira/browse/YARN-9097 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Bilwa S T >Priority: Minor > Labels: newbie, newbie++ > > GpuDiscoverer.initialize surely shouldn't have been synchronized. > Please also investigate why getGpuDeviceInformation / getGpusUsableByYarn are > synchronized. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10286) PendingContainers bugs in the scheduler outputs
[ https://issues.apache.org/jira/browse/YARN-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123910#comment-17123910 ] Andras Gyori commented on YARN-10286: - The patch smoothly applied to both branch-3.3 and 3.2. > PendingContainers bugs in the scheduler outputs > --- > > Key: YARN-10286 > URL: https://issues.apache.org/jira/browse/YARN-10286 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Adam Antal >Assignee: Andras Gyori >Priority: Critical > Fix For: 3.4.0 > > Attachments: YARN-10286.001.patch, YARN-10286.branch-3.2.001.patch, > YARN-10286.branch-3.3.001.patch > > > There are some problems with the {{ws/v1/cluster/scheduler}} output of the > ResourceManager: > Even though the pendingContainers field is > [documented|https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API] > in the FairScheduler output, it is > [missing|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerQueueInfo.java] > from the actual output. In case of the CapacityScheduler this field > [exists|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/CapacitySchedulerQueueInfo.java], > but it is missing from the documentation. Let's fix both cases! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10286) PendingContainers bugs in the scheduler outputs
[ https://issues.apache.org/jira/browse/YARN-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Gyori updated YARN-10286: Attachment: YARN-10286.branch-3.3.001.patch > PendingContainers bugs in the scheduler outputs > --- > > Key: YARN-10286 > URL: https://issues.apache.org/jira/browse/YARN-10286 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Adam Antal >Assignee: Andras Gyori >Priority: Critical > Fix For: 3.4.0 > > Attachments: YARN-10286.001.patch, YARN-10286.branch-3.2.001.patch, > YARN-10286.branch-3.3.001.patch > > > There are some problems with the {{ws/v1/cluster/scheduler}} output of the > ResourceManager: > Even though the pendingContainers field is > [documented|https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API] > in the FairScheduler output, it is > [missing|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerQueueInfo.java] > from the actual output. In case of the CapacityScheduler this field > [exists|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/CapacitySchedulerQueueInfo.java], > but it is missing from the documentation. Let's fix both cases! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10286) PendingContainers bugs in the scheduler outputs
[ https://issues.apache.org/jira/browse/YARN-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Gyori updated YARN-10286: Attachment: YARN-10286.branch-3.2.001.patch > PendingContainers bugs in the scheduler outputs > --- > > Key: YARN-10286 > URL: https://issues.apache.org/jira/browse/YARN-10286 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Adam Antal >Assignee: Andras Gyori >Priority: Critical > Fix For: 3.4.0 > > Attachments: YARN-10286.001.patch, YARN-10286.branch-3.2.001.patch, > YARN-10286.branch-3.3.001.patch > > > There are some problems with the {{ws/v1/cluster/scheduler}} output of the > ResourceManager: > Even though the pendingContainers field is > [documented|https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API] > in the FairScheduler output, it is > [missing|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerQueueInfo.java] > from the actual output. In case of the CapacityScheduler this field > [exists|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/CapacitySchedulerQueueInfo.java], > but it is missing from the documentation. Let's fix both cases! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
[ https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123851#comment-17123851 ] Hudson commented on YARN-10284: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18314 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/18314/]) YARN-10284. Add lazy initialization of (snemeth: rev aa6d13455b9435fb6c6d8f942c2b278dfada8f0c) * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/webapp/TestLogServlet.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/LogServlet.java > Add lazy initialization of LogAggregationFileControllerFactory in LogServlet > > > Key: YARN-10284 > URL: https://issues.apache.org/jira/browse/YARN-10284 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, yarn >Affects Versions: 3.3.0 >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10284.001.patch, YARN-10284.002.patch, > YARN-10284.003.patch, YARN-10284.004.patch, YARN-10284.branch-3.3.001.patch > > > Suppose the {{mapred}} user has no access to the remote folder. Pinging the > JHS if it's online in every few seconds will produce the following entry in > the log: > {noformat} > 2020-05-19 00:17:20,331 WARN > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController: > Unable to determine if the filesystem supports append operation > java.nio.file.AccessDeniedException: test-bucket: > org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: There is no mapped role > for the group(s) associated with the authenticated user. (user: mapred) > at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:204) > [...] > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:513) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.getRollOverLogMaxSize(LogAggregationIndexedFileController.java:1157) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initInternal(LogAggregationIndexedFileController.java:149) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.initialize(LogAggregationFileController.java:135) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileControllerFactory.(LogAggregationFileControllerFactory.java:139) > at > org.apache.hadoop.yarn.server.webapp.LogServlet.(LogServlet.java:66) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.(HsWebServices.java:99) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices$$FastClassByGuice$$1eb8d5d6.newInstance() > at > com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40) > [...] > at > org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938) > at java.lang.Thread.run(Thread.java:748) > {noformat} > We should only create the {{LogAggregationFactory}} instance when we actually > need it, not every time the {{LogServlet}} object is instantiated (so > definitely not in the constructor). In this way we prevent pressure on the > S3A auth side, especially if the authentication request is a costly operation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123844#comment-17123844 ] Benjamin Teke commented on YARN-10295: -- Test issues are related to YARN-10249 not this patch. > CapacityScheduler NPE can cause apps to get stuck without resources > --- > > Key: YARN-10295 > URL: https://issues.apache.org/jira/browse/YARN-10295 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.0, 3.2.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10295.001.branch-3.1.patch, > YARN-10295.001.branch-3.2.patch > > > When the CapacityScheduler Asynchronous scheduling is enabled there is an > edge-case where a NullPointerException can cause the scheduler thread to exit > and the apps to get stuck without allocated resources. Consider the following > log: > {code:java} > 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:apply(681)) - Reserved > container=container_e10_1590502305306_0660_01_000115, on node=host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used= with > resource= > 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:internalUnreserve(743)) - Application > application_1590502305306_0660 unreserved on node host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used=, currently > has 0 at priority 11; currentReservation on node-label= > 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted > 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread > Thread[Thread-4953,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) > {code} > A container gets allocated on a host, but the host doesn't have enough > memory, so after a short while it gets unreserved. However because the > scheduler thread is running asynchronously it might have entered into the > following if block located in > [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], > because at the time _node.getReservedContainer()_ wasn't null. Calling it a > second time for getting the ApplicationAttemptId would be an NPE, as the > container got unreserved in the meantime. > {code:java} > // Do not schedule if there are any reservations to fulfill on the node > if (node.getReservedContainer() != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping scheduling since node " + node.getNodeID() > + " is reserved by application " + node.getReservedContainer() > .getContainerId().getApplicationAttemptId()); > } > return null; > } > {code} > A fix would be to store the container object before the if block. > Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 > which indirectly fixed this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
[ https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Antal updated YARN-10284: -- Attachment: YARN-10284.branch-3.3.001.patch > Add lazy initialization of LogAggregationFileControllerFactory in LogServlet > > > Key: YARN-10284 > URL: https://issues.apache.org/jira/browse/YARN-10284 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, yarn >Affects Versions: 3.3.0 >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10284.001.patch, YARN-10284.002.patch, > YARN-10284.003.patch, YARN-10284.004.patch, YARN-10284.branch-3.3.001.patch > > > Suppose the {{mapred}} user has no access to the remote folder. Pinging the > JHS if it's online in every few seconds will produce the following entry in > the log: > {noformat} > 2020-05-19 00:17:20,331 WARN > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController: > Unable to determine if the filesystem supports append operation > java.nio.file.AccessDeniedException: test-bucket: > org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: There is no mapped role > for the group(s) associated with the authenticated user. (user: mapred) > at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:204) > [...] > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:513) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.getRollOverLogMaxSize(LogAggregationIndexedFileController.java:1157) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initInternal(LogAggregationIndexedFileController.java:149) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.initialize(LogAggregationFileController.java:135) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileControllerFactory.(LogAggregationFileControllerFactory.java:139) > at > org.apache.hadoop.yarn.server.webapp.LogServlet.(LogServlet.java:66) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.(HsWebServices.java:99) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices$$FastClassByGuice$$1eb8d5d6.newInstance() > at > com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40) > [...] > at > org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938) > at java.lang.Thread.run(Thread.java:748) > {noformat} > We should only create the {{LogAggregationFactory}} instance when we actually > need it, not every time the {{LogServlet}} object is instantiated (so > definitely not in the constructor). In this way we prevent pressure on the > S3A auth side, especially if the authentication request is a costly operation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
[ https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123841#comment-17123841 ] Adam Antal commented on YARN-10284: --- Thanks for committing this [~snemeth]. Uploaded patch for branch 3.3. > Add lazy initialization of LogAggregationFileControllerFactory in LogServlet > > > Key: YARN-10284 > URL: https://issues.apache.org/jira/browse/YARN-10284 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, yarn >Affects Versions: 3.3.0 >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10284.001.patch, YARN-10284.002.patch, > YARN-10284.003.patch, YARN-10284.004.patch, YARN-10284.branch-3.3.001.patch > > > Suppose the {{mapred}} user has no access to the remote folder. Pinging the > JHS if it's online in every few seconds will produce the following entry in > the log: > {noformat} > 2020-05-19 00:17:20,331 WARN > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController: > Unable to determine if the filesystem supports append operation > java.nio.file.AccessDeniedException: test-bucket: > org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: There is no mapped role > for the group(s) associated with the authenticated user. (user: mapred) > at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:204) > [...] > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:513) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.getRollOverLogMaxSize(LogAggregationIndexedFileController.java:1157) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initInternal(LogAggregationIndexedFileController.java:149) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.initialize(LogAggregationFileController.java:135) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileControllerFactory.(LogAggregationFileControllerFactory.java:139) > at > org.apache.hadoop.yarn.server.webapp.LogServlet.(LogServlet.java:66) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.(HsWebServices.java:99) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices$$FastClassByGuice$$1eb8d5d6.newInstance() > at > com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40) > [...] > at > org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938) > at java.lang.Thread.run(Thread.java:748) > {noformat} > We should only create the {{LogAggregationFactory}} instance when we actually > need it, not every time the {{LogServlet}} object is instantiated (so > definitely not in the constructor). In this way we prevent pressure on the > S3A auth side, especially if the authentication request is a costly operation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)
[ https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123839#comment-17123839 ] Hadoop QA commented on YARN-10293: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 41s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 7s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 52s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 0s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 2m 1s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 57s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 32s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 99 unchanged - 0 fixed = 100 total (was 99) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 48s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 51s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 93m 28s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 35s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}162m 26s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler | | | hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26100/artifact/out/Dockerfile | | JIRA Issue | YARN-10293 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13004610/YARN-10293-003-WIP.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 3352849ee112 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk /
[jira] [Commented] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
[ https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123830#comment-17123830 ] Szilard Nemeth commented on YARN-10284: --- Hi [~adam.antal], Thanks for working on this. LGTM, committed to trunk. Thanks [~gandras] for the review. [~adam.antal] Do you plan to backport this to 3.3 or any older branches? Thanks. > Add lazy initialization of LogAggregationFileControllerFactory in LogServlet > > > Key: YARN-10284 > URL: https://issues.apache.org/jira/browse/YARN-10284 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, yarn >Affects Versions: 3.3.0 >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Major > Attachments: YARN-10284.001.patch, YARN-10284.002.patch, > YARN-10284.003.patch, YARN-10284.004.patch > > > Suppose the {{mapred}} user has no access to the remote folder. Pinging the > JHS if it's online in every few seconds will produce the following entry in > the log: > {noformat} > 2020-05-19 00:17:20,331 WARN > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController: > Unable to determine if the filesystem supports append operation > java.nio.file.AccessDeniedException: test-bucket: > org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: There is no mapped role > for the group(s) associated with the authenticated user. (user: mapred) > at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:204) > [...] > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:513) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.getRollOverLogMaxSize(LogAggregationIndexedFileController.java:1157) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initInternal(LogAggregationIndexedFileController.java:149) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.initialize(LogAggregationFileController.java:135) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileControllerFactory.(LogAggregationFileControllerFactory.java:139) > at > org.apache.hadoop.yarn.server.webapp.LogServlet.(LogServlet.java:66) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.(HsWebServices.java:99) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices$$FastClassByGuice$$1eb8d5d6.newInstance() > at > com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40) > [...] > at > org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938) > at java.lang.Thread.run(Thread.java:748) > {noformat} > We should only create the {{LogAggregationFactory}} instance when we actually > need it, not every time the {{LogServlet}} object is instantiated (so > definitely not in the constructor). In this way we prevent pressure on the > S3A auth side, especially if the authentication request is a costly operation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
[ https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-10284: -- Fix Version/s: 3.4.0 > Add lazy initialization of LogAggregationFileControllerFactory in LogServlet > > > Key: YARN-10284 > URL: https://issues.apache.org/jira/browse/YARN-10284 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, yarn >Affects Versions: 3.3.0 >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10284.001.patch, YARN-10284.002.patch, > YARN-10284.003.patch, YARN-10284.004.patch > > > Suppose the {{mapred}} user has no access to the remote folder. Pinging the > JHS if it's online in every few seconds will produce the following entry in > the log: > {noformat} > 2020-05-19 00:17:20,331 WARN > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController: > Unable to determine if the filesystem supports append operation > java.nio.file.AccessDeniedException: test-bucket: > org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: There is no mapped role > for the group(s) associated with the authenticated user. (user: mapred) > at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:204) > [...] > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:513) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.getRollOverLogMaxSize(LogAggregationIndexedFileController.java:1157) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initInternal(LogAggregationIndexedFileController.java:149) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.initialize(LogAggregationFileController.java:135) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileControllerFactory.(LogAggregationFileControllerFactory.java:139) > at > org.apache.hadoop.yarn.server.webapp.LogServlet.(LogServlet.java:66) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.(HsWebServices.java:99) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices$$FastClassByGuice$$1eb8d5d6.newInstance() > at > com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40) > [...] > at > org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938) > at java.lang.Thread.run(Thread.java:748) > {noformat} > We should only create the {{LogAggregationFactory}} instance when we actually > need it, not every time the {{LogServlet}} object is instantiated (so > definitely not in the constructor). In this way we prevent pressure on the > S3A auth side, especially if the authentication request is a costly operation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10286) PendingContainers bugs in the scheduler outputs
[ https://issues.apache.org/jira/browse/YARN-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123827#comment-17123827 ] Hudson commented on YARN-10286: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18313 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/18313/]) YARN-10286. PendingContainers bugs in the scheduler outputs. Contributed (snemeth: rev e0a0741ac86dcc4f98ec2f9739b70b3697a4d0c0) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerQueueInfo.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md > PendingContainers bugs in the scheduler outputs > --- > > Key: YARN-10286 > URL: https://issues.apache.org/jira/browse/YARN-10286 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Adam Antal >Assignee: Andras Gyori >Priority: Critical > Fix For: 3.4.0 > > Attachments: YARN-10286.001.patch > > > There are some problems with the {{ws/v1/cluster/scheduler}} output of the > ResourceManager: > Even though the pendingContainers field is > [documented|https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API] > in the FairScheduler output, it is > [missing|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerQueueInfo.java] > from the actual output. In case of the CapacityScheduler this field > [exists|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/CapacitySchedulerQueueInfo.java], > but it is missing from the documentation. Let's fix both cases! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10286) PendingContainers bugs in the scheduler outputs
[ https://issues.apache.org/jira/browse/YARN-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123798#comment-17123798 ] Szilard Nemeth commented on YARN-10286: --- Hi [~gandras], Thanks for working on this. Patch LGTM, committed to trunk. Thanks [~adam.antal] for the review. [~gandras] Can you please assess whether branch-3.3 and branch-3.2 needs this fix as well? Thanks > PendingContainers bugs in the scheduler outputs > --- > > Key: YARN-10286 > URL: https://issues.apache.org/jira/browse/YARN-10286 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Adam Antal >Assignee: Andras Gyori >Priority: Critical > Fix For: 3.4.0 > > Attachments: YARN-10286.001.patch > > > There are some problems with the {{ws/v1/cluster/scheduler}} output of the > ResourceManager: > Even though the pendingContainers field is > [documented|https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API] > in the FairScheduler output, it is > [missing|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerQueueInfo.java] > from the actual output. In case of the CapacityScheduler this field > [exists|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/CapacitySchedulerQueueInfo.java], > but it is missing from the documentation. Let's fix both cases! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10286) PendingContainers bugs in the scheduler outputs
[ https://issues.apache.org/jira/browse/YARN-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-10286: -- Fix Version/s: 3.4.0 > PendingContainers bugs in the scheduler outputs > --- > > Key: YARN-10286 > URL: https://issues.apache.org/jira/browse/YARN-10286 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Adam Antal >Assignee: Andras Gyori >Priority: Critical > Fix For: 3.4.0 > > Attachments: YARN-10286.001.patch > > > There are some problems with the {{ws/v1/cluster/scheduler}} output of the > ResourceManager: > Even though the pendingContainers field is > [documented|https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API] > in the FairScheduler output, it is > [missing|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerQueueInfo.java] > from the actual output. In case of the CapacityScheduler this field > [exists|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/CapacitySchedulerQueueInfo.java], > but it is missing from the documentation. Let's fix both cases! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10304) Create an endpoint for remote application log directory path query
[ https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123766#comment-17123766 ] Hadoop QA commented on YARN-10304: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 3m 21s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 3m 28s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 31m 48s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 18m 2s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 20m 58s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 1s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 19s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 0s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 23s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 17m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 17m 18s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 2m 51s{color} | {color:orange} root: The patch generated 8 new + 11 unchanged - 0 fixed = 19 total (was 11) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 25s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 25s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 3s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 40s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 47s{color} | {color:green} hadoop-mapreduce-client-hs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 47s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}143m 33s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26099/artifact/out/Dockerfile | | JIRA Issue | YARN-10304 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13004561/YARN-10304.001.patch | | Optional Tests | dupname
[jira] [Commented] (YARN-10274) Merge QueueMapping and QueueMappingEntity
[ https://issues.apache.org/jira/browse/YARN-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123751#comment-17123751 ] Hadoop QA commented on YARN-10274: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 2m 13s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 41s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 1s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 19m 29s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 40s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 38s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 31s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 5 new + 64 unchanged - 0 fixed = 69 total (was 64) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 33s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 10s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 93m 32s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 29s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}166m 56s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler | | | hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26098/artifact/out/Dockerfile | | JIRA Issue | YARN-10274 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13004605/YARN-10274.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 55b270dd07a7 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool |
[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)
[ https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123686#comment-17123686 ] Tao Yang commented on YARN-10293: - Hi, [~prabhujoseph], [~wangda] This problem is similar to YARN-9598, which was in dispute so there's no further progress. In my opinion, YARN-9598 and this issue may just parts of reservation problems, it's better to refactor the reservation logic again to compatible with the scheduling framework which has been updated a lot by global scheduler, especially for multi-nodes lookup mechanism. At least we should rethink all referenced logic in scheduling cycle to have a more complete solution for current reservation. Thoughts? > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement (YARN-10259) > > > Key: YARN-10293 > URL: https://issues.apache.org/jira/browse/YARN-10293 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10293-001.patch, YARN-10293-002.patch, > YARN-10293-003-WIP.patch > > > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues > related to it > https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987 > Have found one more bug in the CapacityScheduler.java code which causes the > same issue with slight difference in the repro. > *Repro:* > *Nodes : Available : Used* > Node1 - 8GB, 8vcores - 8GB. 8cores > Node2 - 8GB, 8vcores - 8GB. 8cores > Node3 - 8GB, 8vcores - 8GB. 8cores > Queues -> A and B both 50% capacity, 100% max capacity > MultiNode enabled + Preemption enabled > 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores > 2. JobB Submitted to B queue with AM size of 1GB > {code} > 2020-05-21 12:12:27,313 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest > IP=172.27.160.139 OPERATION=Submit Application Request > TARGET=ClientRMService RESULT=SUCCESS APPID=application_1590046667304_0005 > CALLERCONTEXT=CLI QUEUENAME=dummy > {code} > 3. Preemption happens and used capacity is lesser than 1.0f > {code} > 2020-05-21 12:12:48,222 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics: > Non-AM container preempted, current > appAttemptId=appattempt_1590046667304_0004_01, > containerId=container_e09_1590046667304_0004_01_24, > resource= > {code} > 4. JobB gets a Reserved Container as part of > CapacityScheduler#allocateOrReserveNewContainer > {code} > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to > RESERVED > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > {code} > *Why RegularContainerAllocator reserved the container when the used capacity > is <= 1.0f ?* > {code} > The reason is even though the container is preempted - nodemanager has to > stop the container and heartbeat and update the available and unallocated > resources to ResourceManager. > {code} > 5. Now, no new allocation happens and reserved container stays at reserved. > After reservation the used capacity becomes 1.0f, below will be in a loop and > no new allocate or reserve happens. The reserved container cannot be > allocated as reserved node does not have space. node2 has space for 1GB, > 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting > called causing the Hang. > *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> > CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container > on node* > {code} > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Trying to fulfill reservation for application application_1590046667304_0005 > on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > assignContainers: partition= #applications=1 > 2020-05-21 12:13:33,242 INFO >
[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)
[ https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123652#comment-17123652 ] Prabhu Joseph commented on YARN-10293: -- Have attached a patch with removing the if condition. Will do some functional testing in test cluster. Thanks. > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement (YARN-10259) > > > Key: YARN-10293 > URL: https://issues.apache.org/jira/browse/YARN-10293 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10293-001.patch, YARN-10293-002.patch, > YARN-10293-003-WIP.patch > > > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues > related to it > https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987 > Have found one more bug in the CapacityScheduler.java code which causes the > same issue with slight difference in the repro. > *Repro:* > *Nodes : Available : Used* > Node1 - 8GB, 8vcores - 8GB. 8cores > Node2 - 8GB, 8vcores - 8GB. 8cores > Node3 - 8GB, 8vcores - 8GB. 8cores > Queues -> A and B both 50% capacity, 100% max capacity > MultiNode enabled + Preemption enabled > 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores > 2. JobB Submitted to B queue with AM size of 1GB > {code} > 2020-05-21 12:12:27,313 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest > IP=172.27.160.139 OPERATION=Submit Application Request > TARGET=ClientRMService RESULT=SUCCESS APPID=application_1590046667304_0005 > CALLERCONTEXT=CLI QUEUENAME=dummy > {code} > 3. Preemption happens and used capacity is lesser than 1.0f > {code} > 2020-05-21 12:12:48,222 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics: > Non-AM container preempted, current > appAttemptId=appattempt_1590046667304_0004_01, > containerId=container_e09_1590046667304_0004_01_24, > resource= > {code} > 4. JobB gets a Reserved Container as part of > CapacityScheduler#allocateOrReserveNewContainer > {code} > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to > RESERVED > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > {code} > *Why RegularContainerAllocator reserved the container when the used capacity > is <= 1.0f ?* > {code} > The reason is even though the container is preempted - nodemanager has to > stop the container and heartbeat and update the available and unallocated > resources to ResourceManager. > {code} > 5. Now, no new allocation happens and reserved container stays at reserved. > After reservation the used capacity becomes 1.0f, below will be in a loop and > no new allocate or reserve happens. The reserved container cannot be > allocated as reserved node does not have space. node2 has space for 1GB, > 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting > called causing the Hang. > *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> > CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container > on node* > {code} > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Trying to fulfill reservation for application application_1590046667304_0005 > on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > assignContainers: partition= #applications=1 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > 2020-05-21 12:13:33,243 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Allocation proposal accepted > {code} >
[jira] [Updated] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)
[ https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10293: - Attachment: YARN-10293-003-WIP.patch > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement (YARN-10259) > > > Key: YARN-10293 > URL: https://issues.apache.org/jira/browse/YARN-10293 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10293-001.patch, YARN-10293-002.patch, > YARN-10293-003-WIP.patch > > > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues > related to it > https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987 > Have found one more bug in the CapacityScheduler.java code which causes the > same issue with slight difference in the repro. > *Repro:* > *Nodes : Available : Used* > Node1 - 8GB, 8vcores - 8GB. 8cores > Node2 - 8GB, 8vcores - 8GB. 8cores > Node3 - 8GB, 8vcores - 8GB. 8cores > Queues -> A and B both 50% capacity, 100% max capacity > MultiNode enabled + Preemption enabled > 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores > 2. JobB Submitted to B queue with AM size of 1GB > {code} > 2020-05-21 12:12:27,313 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest > IP=172.27.160.139 OPERATION=Submit Application Request > TARGET=ClientRMService RESULT=SUCCESS APPID=application_1590046667304_0005 > CALLERCONTEXT=CLI QUEUENAME=dummy > {code} > 3. Preemption happens and used capacity is lesser than 1.0f > {code} > 2020-05-21 12:12:48,222 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics: > Non-AM container preempted, current > appAttemptId=appattempt_1590046667304_0004_01, > containerId=container_e09_1590046667304_0004_01_24, > resource= > {code} > 4. JobB gets a Reserved Container as part of > CapacityScheduler#allocateOrReserveNewContainer > {code} > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to > RESERVED > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > {code} > *Why RegularContainerAllocator reserved the container when the used capacity > is <= 1.0f ?* > {code} > The reason is even though the container is preempted - nodemanager has to > stop the container and heartbeat and update the available and unallocated > resources to ResourceManager. > {code} > 5. Now, no new allocation happens and reserved container stays at reserved. > After reservation the used capacity becomes 1.0f, below will be in a loop and > no new allocate or reserve happens. The reserved container cannot be > allocated as reserved node does not have space. node2 has space for 1GB, > 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting > called causing the Hang. > *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> > CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container > on node* > {code} > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Trying to fulfill reservation for application application_1590046667304_0005 > on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > assignContainers: partition= #applications=1 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > 2020-05-21 12:13:33,243 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Allocation proposal accepted > {code} > CapacityScheduler#allocateOrReserveNewContainers won't be called as below > check in allocateContainersOnMultiNodes fails > {code} > if
[jira] [Updated] (YARN-10304) Create an endpoint for remote application log directory path query
[ https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Antal updated YARN-10304: -- Parent: YARN-10025 Issue Type: Sub-task (was: Improvement) > Create an endpoint for remote application log directory path query > -- > > Key: YARN-10304 > URL: https://issues.apache.org/jira/browse/YARN-10304 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Minor > Attachments: YARN-10304.001.patch > > > The logic of the aggregated log directory path determination (currently based > on configuration) is scattered around the codebase and duplicated multiple > times. By providing a separate class for creating the path for a specific > user, it allows for an abstraction over this logic. This could be used in > place of the previously duplicated logic, moreover, we could provide an > endpoint to query this path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10274) Merge QueueMapping and QueueMappingEntity
[ https://issues.apache.org/jira/browse/YARN-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gergely Pollak updated YARN-10274: -- Attachment: YARN-10274.002.patch > Merge QueueMapping and QueueMappingEntity > - > > Key: YARN-10274 > URL: https://issues.apache.org/jira/browse/YARN-10274 > Project: Hadoop YARN > Issue Type: Task > Components: yarn >Reporter: Gergely Pollak >Assignee: Gergely Pollak >Priority: Major > Attachments: YARN-10274.001.patch, YARN-10274.002.patch > > > The role, usage and internal behaviour of these classes are almost identical, > but it makes no sense to keep both of them. One is used by UserGroup > placement rule definitions the other is used by Application placement rules. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10305) Lost system-credentials when restarting RM
[ https://issues.apache.org/jira/browse/YARN-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123589#comment-17123589 ] Hadoop QA commented on YARN-10305: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 19s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 54s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 1s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 18m 47s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 2m 18s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 14s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 43s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 3 new + 94 unchanged - 0 fixed = 97 total (was 94) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 18m 15s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 22s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}101m 32s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 34s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}175m 36s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler | | | hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26097/artifact/out/Dockerfile | | JIRA Issue | YARN-10305 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13004564/YARN-10305.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux db6d91116414 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool |
[jira] [Commented] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
[ https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123426#comment-17123426 ] Andras Gyori commented on YARN-10284: - Last patch LGTM (non-binding), I like the caching approach +1. > Add lazy initialization of LogAggregationFileControllerFactory in LogServlet > > > Key: YARN-10284 > URL: https://issues.apache.org/jira/browse/YARN-10284 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, yarn >Affects Versions: 3.3.0 >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Major > Attachments: YARN-10284.001.patch, YARN-10284.002.patch, > YARN-10284.003.patch, YARN-10284.004.patch > > > Suppose the {{mapred}} user has no access to the remote folder. Pinging the > JHS if it's online in every few seconds will produce the following entry in > the log: > {noformat} > 2020-05-19 00:17:20,331 WARN > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController: > Unable to determine if the filesystem supports append operation > java.nio.file.AccessDeniedException: test-bucket: > org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: There is no mapped role > for the group(s) associated with the authenticated user. (user: mapred) > at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:204) > [...] > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:513) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.getRollOverLogMaxSize(LogAggregationIndexedFileController.java:1157) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initInternal(LogAggregationIndexedFileController.java:149) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.initialize(LogAggregationFileController.java:135) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileControllerFactory.(LogAggregationFileControllerFactory.java:139) > at > org.apache.hadoop.yarn.server.webapp.LogServlet.(LogServlet.java:66) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.(HsWebServices.java:99) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices$$FastClassByGuice$$1eb8d5d6.newInstance() > at > com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40) > [...] > at > org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938) > at java.lang.Thread.run(Thread.java:748) > {noformat} > We should only create the {{LogAggregationFactory}} instance when we actually > need it, not every time the {{LogServlet}} object is instantiated (so > definitely not in the constructor). In this way we prevent pressure on the > S3A auth side, especially if the authentication request is a costly operation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10304) Create an endpoint for remote application log directory path query
[ https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123405#comment-17123405 ] Andras Gyori commented on YARN-10304: - An early draft patch has been uploaded, that provides the foundation of the logic. As for now, it only contains the bare minimum of the path creation, without involving any bucket or group based naming convention. Whether to include anything more is up for discussion. > Create an endpoint for remote application log directory path query > -- > > Key: YARN-10304 > URL: https://issues.apache.org/jira/browse/YARN-10304 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Minor > Attachments: YARN-10304.001.patch > > > The logic of the aggregated log directory path determination (currently based > on configuration) is scattered around the codebase and duplicated multiple > times. By providing a separate class for creating the path for a specific > user, it allows for an abstraction over this logic. This could be used in > place of the previously duplicated logic, moreover, we could provide an > endpoint to query this path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10304) Create an endpoint for remote application log directory path query
[ https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Gyori updated YARN-10304: Attachment: YARN-10304.001.patch > Create an endpoint for remote application log directory path query > -- > > Key: YARN-10304 > URL: https://issues.apache.org/jira/browse/YARN-10304 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Minor > Attachments: YARN-10304.001.patch > > > The logic of the aggregated log directory path determination (currently based > on configuration) is scattered around the codebase and duplicated multiple > times. By providing a separate class for creating the path for a specific > user, it allows for an abstraction over this logic. This could be used in > place of the previously duplicated logic, moreover, we could provide an > endpoint to query this path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
[ https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123401#comment-17123401 ] Hadoop QA commented on YARN-10284: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 18s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 41s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 11s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 8s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 15s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 15s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 30s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 64m 53s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26096/artifact/out/Dockerfile | | JIRA Issue | YARN-10284 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13004558/YARN-10284.004.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 244f95bd5d3a 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 9fe4c37c25b | | Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/26096/testReport/ | | Max. process+thread count | 314 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common U:
[jira] [Created] (YARN-10305) Lost system-credentials when restarting RM
kyungwan nam created YARN-10305: --- Summary: Lost system-credentials when restarting RM Key: YARN-10305 URL: https://issues.apache.org/jira/browse/YARN-10305 Project: Hadoop YARN Issue Type: Bug Reporter: kyungwan nam Assignee: kyungwan nam System-credentials introduced in YARN-2704, it makes it to keep the long-running apps. I’ve met a situation where system-credentials lost when restarting RM. Since then, if an app’s AM is stopped, restarting AM will be failed because NMs do not have HDFS delegation token which is needed for resource localization. The app has a couple of delegation token including timeline-server token and HDFS delegation token. When restarting RM, RM will request a new HDFS delegation token for an app that was submitted long ago. (It's fixed by YARN-5098) But, If an app has a couple of delegation token and an exception occur in the token processed first, the next tokens are not processed. I think that’s why lost system-credentials. Here are RM’s logs at the time of restarting RM. {code} 2020-05-19 14:25:05,712 WARN security.DelegationTokenRenewer (DelegationTokenRenewer.java:handleDTRenewerAppRecoverEvent(955)) - Unable to add the application to the delegation token renewer on recovery. java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, Service: 10.1.1.1:8190, Ident: (TIMELINE_DELEGATION_TOKEN owner=test-admin, renewer=yarn, realUser=yarn, issueDate=1586136363258, maxDate=1587000363258, sequenceNumber=2193, masterKeyId=340) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:503) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: HTTP status [403], message [org.apache.hadoop.security.token.SecretManager$InvalidToken: yarn tried to renew an expired token (TIMELINE_DELEGATION_TOKEN owner=test-admin, renewer=yarn, realUser=yarn, issueDate=1586136363258, maxDate=1587000363258, sequenceNumber=2193, masterKeyId=340) max expiration date: 2020-04-16 10:26:03,258+0900 currentTime: 2020-05-19 14:25:05,700+0900] at org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:166) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:319) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:235) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:437) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:247) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:227) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hadoop.yarn.client.api.impl.TimelineConnector$TimelineClientRetryOpForOperateDelegationToken.run(TimelineConnector.java:431) at org.apache.hadoop.yarn.client.api.impl.TimelineConnector$TimelineClientConnectionRetry.retryOn(TimelineConnector.java:334) at org.apache.hadoop.yarn.client.api.impl.TimelineConnector.operateDelegationToken(TimelineConnector.java:218) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:250) at org.apache.hadoop.yarn.security.client.TimelineDelegationTokenIdentifier$Renewer.renew(TimelineDelegationTokenIdentifier.java:81) at org.apache.hadoop.security.token.Token.renew(Token.java:512) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:629) at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:626) at java.security.AccessController.doPrivileged(Native Method) at
[jira] [Created] (YARN-10304) Create an endpoint for remote application log directory path query
Andras Gyori created YARN-10304: --- Summary: Create an endpoint for remote application log directory path query Key: YARN-10304 URL: https://issues.apache.org/jira/browse/YARN-10304 Project: Hadoop YARN Issue Type: Improvement Reporter: Andras Gyori Assignee: Andras Gyori The logic of the aggregated log directory path determination (currently based on configuration) is scattered around the codebase and duplicated multiple times. By providing a separate class for creating the path for a specific user, it allows for an abstraction over this logic. This could be used in place of the previously duplicated logic, moreover, we could provide an endpoint to query this path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org