[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-06-02 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124527#comment-17124527
 ] 

Tao Yang commented on YARN-10293:
-

Thanks [~wangda] for your confirmation.
I think the proposed change can solve the problem for heartbeat-driven 
scheduling but not async scheduling, since it may still keep in a loop that 
chooses the first one of candidate nodes then do re-reservation as mentioned in 
YARN-9598.
However, if what we want for this issue is just to fix this problem for 
heartbeat-driven scenarios, and later will have a more complete solution, the 
change is fine to me for now. In our internal version, we already remove this 
check to support allocating OPPORTUNISTIC containers in the main scheduling 
process.

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch, 
> YARN-10293-003-WIP.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when the used capacity 
> is <= 1.0f ?*
> {code}
> The reason is even though the container is preempted - nodemanager has to 
> stop the container and heartbeat and update the available and unallocated 
> resources to ResourceManager.
> {code}
> 5. Now, no new allocation happens and reserved container stays at reserved.
> After reservation the used capacity becomes 1.0f, below will be in a loop and 
> no new allocate or reserve happens. The reserved container cannot be 
> allocated as reserved node does not have space. node2 has space for 1GB, 
> 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting 
> called causing the Hang.
> *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> 
> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container 
> on node*
> {code}
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1590046667304_0005 
> on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignContainers: partition= #applications=1
> 2020-05-21 12:13:33,242 INFO 
> 

[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat

2020-06-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124511#comment-17124511
 ] 

Hadoop QA commented on YARN-10300:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
50s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
7s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
18m 37s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m  
9s{color} | {color:blue} Used deprecated FindBugs config; considering switching 
to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
5s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 14s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
4s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 52m 43s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
39s{color} | {color:red} The patch generated 17 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}126m 31s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerOvercommit |
|   | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26105/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-10300 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13004661/YARN-10300.002.patch |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux 27ca2cb7f6cd 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 
10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | trunk / 6288e15118f |
| Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 |
| 

[jira] [Commented] (YARN-10302) Support custom packing algorithm for FairScheduler

2020-06-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124480#comment-17124480
 ] 

Hadoop QA commented on YARN-10302:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
42s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 30m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 11s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
36s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
41s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
39s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 50s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
40s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 87m 45s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
33s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}158m 51s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26104/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-10302 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13004658/YARN-10302-trunk.001.patch
 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux 737e05239109 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | trunk / 6288e15118f |
| Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 |
| unit | 

[jira] [Commented] (YARN-9903) Support reservations continue looking for Node Labels

2020-06-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124478#comment-17124478
 ] 

Hadoop QA commented on YARN-9903:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
29s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
17s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
37s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 43s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
39s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
38s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m  1s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
45s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 87m 22s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
36s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}151m 29s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26103/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-9903 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13004653/YARN-9903.002.patch |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux e1f75a293e55 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 
10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | trunk / 6288e15118f |
| Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 |
| unit | 

[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat

2020-06-02 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124461#comment-17124461
 ] 

Eric Badger commented on YARN-10300:


Added some null checks to patch 002

> appMasterHost not set in RM ApplicationSummary when AM fails before first 
> heartbeat
> ---
>
> Key: YARN-10300
> URL: https://issues.apache.org/jira/browse/YARN-10300
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10300.001.patch, YARN-10300.002.patch
>
>
> {noformat}
> 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: 
> appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https
>  
> ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=  vCores:0>,applicationType=MAPREDUCE
> {noformat}
> {{appMasterHost=N/A}} should have the AM hostname instead of N/A



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat

2020-06-02 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10300:
---
Attachment: YARN-10300.002.patch

> appMasterHost not set in RM ApplicationSummary when AM fails before first 
> heartbeat
> ---
>
> Key: YARN-10300
> URL: https://issues.apache.org/jira/browse/YARN-10300
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10300.001.patch, YARN-10300.002.patch
>
>
> {noformat}
> 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: 
> appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https
>  
> ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=  vCores:0>,applicationType=MAPREDUCE
> {noformat}
> {{appMasterHost=N/A}} should have the AM hostname instead of N/A



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat

2020-06-02 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124450#comment-17124450
 ] 

Eric Badger commented on YARN-10300:


bq. Eric Badger can you explain under what circumstances 
attempt.getMasterContainer().getNodeId().getHost() will succeed where 
attempt.getHost() fails?

{{attempt.getHost()}} grabs {{host}} within RMAppAttemptImpl.java. This is set 
during {{AMRegisteredTransition}}, so it will be "N/A" until the AM registers 
with the NM during the first heartbeat. 

{{attempt.getMasterContainer()}} grabs {{masterContainer}} within 
RMAppAttemptImpl.java. This is set during {{AMContainerAllocatedTransition}}. 

So from the time between when the container is allocated until the time of the 
first AM heartbeat, {{masterContainer}} will be set, but {{host}} will be "N/A"

bq. Also, do we need to add some null checks for these? getMasterContainer(), 
getNodeId(), getHost()?
Probably wouldn't hurt. 

bq. Note that the attempt.getHost() defaults to "N/A" before it is set - what 
do we get if NodeID().getHost() isn't valid yet? Is that even a possibility?
I don't know if it's possible to be invalid at the start or not. The 
{{Container}} is going to be created via {{newInstance}}, which requires a 
{{NodeId}} as a parameter. But that could potentially be sent in as null. But I 
think it will either be the correct nodeId or will be null, which I can 
interpret as "N/A". There are so many places that containers are instantiated 
in the scheduler that it'd be pretty tough to see if all of the cases have the 
nodeID set initially. 

I can add in the null checks and default the string to "N/A" if any of them 
don't exist. 

> appMasterHost not set in RM ApplicationSummary when AM fails before first 
> heartbeat
> ---
>
> Key: YARN-10300
> URL: https://issues.apache.org/jira/browse/YARN-10300
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10300.001.patch
>
>
> {noformat}
> 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: 
> appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https
>  
> ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=  vCores:0>,applicationType=MAPREDUCE
> {noformat}
> {{appMasterHost=N/A}} should have the AM hostname instead of N/A



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10283) Capacity Scheduler: starvation occurs if a higher priority queue is full and node labels are used

2020-06-02 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124440#comment-17124440
 ] 

Jim Brennan commented on YARN-10283:


[~pbacsko] I downloaded your patch and your test case and verified that I see 
the same behavior as you.
reproWithoutNodeLabels fails unless I set minimum-allocation-mb=1024, and 
reproTestWithNodeLabels succeeds.

I have uploaded a patch for YARN-9903 based on an internal change that we've 
been running with for a long time.  I verified that with the YARN-9903 patch, I 
get the same results for your repro test cases as with the YARN-10283 patch.

Please feel free to pull in the additional changes from YARN-9903 into this 
patch.  I think it addresses the comment from [~tarunparimi].   The YARN-9903 
patch does not include your changes to FiCaSchedulerApp.  I'm not certain that 
change is needed, but it might be.

[~epayne], can you take a look as well?



> Capacity Scheduler: starvation occurs if a higher priority queue is full and 
> node labels are used
> -
>
> Key: YARN-10283
> URL: https://issues.apache.org/jira/browse/YARN-10283
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10283-POC01.patch, YARN-10283-ReproTest.patch, 
> YARN-10283-ReproTest2.patch
>
>
> Recently we've been investigating a scenario where applications submitted to 
> a lower priority queue could not get scheduled because a higher priority 
> queue in the same hierarchy could now satisfy the allocation request. Both 
> queue belonged to the same partition.
> If we disabled node labels, the problem disappeared.
> The problem is that {{RegularContainerAllocator}} always allocated a 
> container for the request, even if it should not have.
> *Example:*
> * Cluster total resources: 3 nodes, 15GB, 24 vcores (5GB / 8 vcore per node)
> * Partition "shared" was created with 2 nodes
> * "root.lowprio" (priority = 20) and "root.highprio" (priorty = 40) were 
> added to the partition
> * Both queues have a limit of 
> * Using DominantResourceCalculator
> Setup:
> Submit distributed shell application to highprio with switches 
> "-num_containers 3 -container_vcores 4". The memory allocation is 512MB per 
> container.
> Chain of events:
> 1. Queue is filled with contaners until it reaches usage  vCores:5>
> 2. A node update event is pushed to CS from a node which is part of the 
> partition
> 2. {{AbstractCSQueue.canAssignToQueue()}} returns true because it's smaller 
> than the current limit resource 
> 3. Then {{LeafQueue.assignContainers()}} runs successfully and gets an 
> allocated container for 
> 4. But we can't commit the resource request because we would have 9 vcores in 
> total, violating the limit.
> The problem is that we always try to assign container for the same 
> application in each heartbeat from "highprio". Applications in "lowprio" 
> cannot make progress.
> *Problem:*
> {{RegularContainerAllocator.assignContainer()}} does not handle this case 
> well. We only reject allocation if this condition is satisfied:
> {noformat}
>  if (rmContainer == null && reservationsContinueLooking
>   && node.getLabels().isEmpty()) {
> {noformat}
> But if we have node labels, we enter a different code path and succeed with 
> the allocation if there's room for a container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat

2020-06-02 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124421#comment-17124421
 ] 

Jim Brennan commented on YARN-10300:


[~ebadger] can you explain under what circumstances  
{{attempt.getMasterContainer().getNodeId().getHost()}} will succeed where 
{{attempt.getHost()}} fails?
Also, do we need to add some null checks for these?  getMasterContainer(), 
getNodeId(), getHost()?
Note that the attempt.getHost() defaults to "N/A" before it is set - what do we 
get if NodeID().getHost() isn't valid yet?  Is that even a possibility?


> appMasterHost not set in RM ApplicationSummary when AM fails before first 
> heartbeat
> ---
>
> Key: YARN-10300
> URL: https://issues.apache.org/jira/browse/YARN-10300
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10300.001.patch
>
>
> {noformat}
> 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: 
> appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https
>  
> ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=  vCores:0>,applicationType=MAPREDUCE
> {noformat}
> {{appMasterHost=N/A}} should have the AM hostname instead of N/A



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10302) Support custom packing algorithm for FairScheduler

2020-06-02 Thread William W. Graham Jr (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124348#comment-17124348
 ] 

William W. Graham Jr commented on YARN-10302:
-

Done. Also corrected the checkstyles and findbugs errors flagged in 
https://github.com/apache/hadoop/pull/2044

> Support custom packing algorithm for FairScheduler
> --
>
> Key: YARN-10302
> URL: https://issues.apache.org/jira/browse/YARN-10302
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: William W. Graham Jr
>Priority: Major
> Attachments: YARN-10302-trunk.001.patch
>
>
> The {{FairScheduler}} class allocates containers to nodes based on the node 
> with the most available memory[0]. Create the ability to instead configure a 
> custom packing algorithm with different logic. For instance for effective 
> auto scaling, a bin packing algorithm might be a better choice.
> 0 - 
> https://github.com/apache/hadoop/blob/56b7571131b0af03b32bf1c5673c32634652df21/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1034-L1043



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat

2020-06-02 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124328#comment-17124328
 ] 

Eric Badger commented on YARN-10300:


The unit tests pass for me locally. [~epayne], [~Jim_Brennan], could you take a 
look at this patch?

> appMasterHost not set in RM ApplicationSummary when AM fails before first 
> heartbeat
> ---
>
> Key: YARN-10300
> URL: https://issues.apache.org/jira/browse/YARN-10300
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10300.001.patch
>
>
> {noformat}
> 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: 
> appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https
>  
> ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=  vCores:0>,applicationType=MAPREDUCE
> {noformat}
> {{appMasterHost=N/A}} should have the AM hostname instead of N/A



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9903) Support reservations continue looking for Node Labels

2020-06-02 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124325#comment-17124325
 ] 

Jim Brennan commented on YARN-9903:
---

Uploaded patch 002, which removes an extraneous file.


> Support reservations continue looking for Node Labels
> -
>
> Key: YARN-9903
> URL: https://issues.apache.org/jira/browse/YARN-9903
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Tarun Parimi
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-9903.001.patch, YARN-9903.002.patch
>
>
> YARN-1769 brought in reservations continue looking feature which improves the 
> several resource reservation scenarios. However, it is not handled currently 
> when nodes have a label assigned to them. This is useful and in many cases 
> necessary even for Node Labels. So we should look to support this for node 
> labels also.
> For example, in AbstractCSQueue.java, we have the below TODO.
> {code:java}
> // TODO, now only consider reservation cases when the node has no label 
> if (this.reservationsContinueLooking && nodePartition.equals( 
> RMNodeLabelsManager.NO_LABEL) && Resources.greaterThan( resourceCalculator, 
> clusterResource, resourceCouldBeUnreserved, Resources.none())) {
> {code}
> cc [~sunilg]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9903) Support reservations continue looking for Node Labels

2020-06-02 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-9903:
--
Attachment: YARN-9903.002.patch

> Support reservations continue looking for Node Labels
> -
>
> Key: YARN-9903
> URL: https://issues.apache.org/jira/browse/YARN-9903
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Tarun Parimi
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-9903.001.patch, YARN-9903.002.patch
>
>
> YARN-1769 brought in reservations continue looking feature which improves the 
> several resource reservation scenarios. However, it is not handled currently 
> when nodes have a label assigned to them. This is useful and in many cases 
> necessary even for Node Labels. So we should look to support this for node 
> labels also.
> For example, in AbstractCSQueue.java, we have the below TODO.
> {code:java}
> // TODO, now only consider reservation cases when the node has no label 
> if (this.reservationsContinueLooking && nodePartition.equals( 
> RMNodeLabelsManager.NO_LABEL) && Resources.greaterThan( resourceCalculator, 
> clusterResource, resourceCouldBeUnreserved, Resources.none())) {
> {code}
> cc [~sunilg]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-06-02 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124195#comment-17124195
 ] 

Wangda Tan commented on YARN-10293:
---

[~Tao Yang], the suggestion totally make sense to me. When we have done the 
initial global scheduling framework, the goal is to make it compatible to the 
previous behavior, I agree to make additional steps to overhaul reservation 
logic under the context of global scheduling is a good idea. Now the code is 
very hard to read and understand.

I think we can do this step by step, first, let's fix low hanging fruits like 
this Jira. (I hope to get idea from you about the proposed change: 
https://issues.apache.org/jira/browse/YARN-10293?focusedCommentId=17121419=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17121419

And [~prabhujoseph] if you have time/bandwidth, can you take a look into 
reservation related logic + preemption + unreserve + global scheduling and see 
what we can optimize here?

 

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch, 
> YARN-10293-003-WIP.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when the used capacity 
> is <= 1.0f ?*
> {code}
> The reason is even though the container is preempted - nodemanager has to 
> stop the container and heartbeat and update the available and unallocated 
> resources to ResourceManager.
> {code}
> 5. Now, no new allocation happens and reserved container stays at reserved.
> After reservation the used capacity becomes 1.0f, below will be in a loop and 
> no new allocate or reserve happens. The reserved container cannot be 
> allocated as reserved node does not have space. node2 has space for 1GB, 
> 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting 
> called causing the Hang.
> *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> 
> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container 
> on node*
> {code}
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1590046667304_0005 
> on node: 

[jira] [Commented] (YARN-10286) PendingContainers bugs in the scheduler outputs

2020-06-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124175#comment-17124175
 ] 

Hadoop QA commented on YARN-10286:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
21s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:blue}0{color} | {color:blue} markdownlint {color} | {color:blue}  0m  
0s{color} | {color:blue} markdownlint was not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} branch-3.3 Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  3m 
15s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 29m 
23s{color} | {color:green} branch-3.3 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
52s{color} | {color:green} branch-3.3 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
27s{color} | {color:green} branch-3.3 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
33s{color} | {color:green} branch-3.3 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
19m 10s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
4s{color} | {color:green} branch-3.3 passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
48s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m 
28s{color} | {color:blue} 
branch/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site no findbugs output file 
(findbugsXml.xml) {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
23s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  7m 
56s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m 22s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 1 new + 3 unchanged - 0 fixed = 4 total (was 3) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 46s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m 
22s{color} | {color:blue} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site has 
no data from findbugs {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 92m 
25s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
20s{color} | {color:green} hadoop-yarn-site in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
46s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}192m 17s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem 

[jira] [Commented] (YARN-10296) Make ContainerPBImpl#getId/setId synchronized

2020-06-02 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124121#comment-17124121
 ] 

Wangda Tan commented on YARN-10296:
---

[~bteke], 

It makes sense to covert other method which uses containerId to synchronized as 
well, for example, {{getProto}}, performance should not be a big concern here, 
as multiple threads not likely access same Container object. (Because we have 
many container objects in the memory).

> Make ContainerPBImpl#getId/setId synchronized
> -
>
> Key: YARN-10296
> URL: https://issues.apache.org/jira/browse/YARN-10296
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Minor
> Attachments: YARN-10296.001.patch
>
>
> ContainerPBImpl getId and setId methods can be accessed from multiple 
> threads. In order to avoid any simultaneous accesses and race conditions 
> these methods should be synchronized.
> The idea came from the issue described in YARN-10295, however that patch is 
> only applicable to branch-3.2 and 3.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10254) CapacityScheduler incorrect User Group Mapping after leaf queue change

2020-06-02 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124044#comment-17124044
 ] 

Hudson commented on YARN-10254:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18315 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/18315/])
YARN-10254. CapacityScheduler incorrect User Group Mapping after leaf (snemeth: 
rev b5efdea4fd385453fd9f9da7106e908d2a9a3812)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/placement/UserGroupMappingPlacementRule.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacitySchedulerQueueMappingFactory.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/placement/TestUserGroupMappingPlacementRule.java


> CapacityScheduler incorrect User Group Mapping after leaf queue change
> --
>
> Key: YARN-10254
> URL: https://issues.apache.org/jira/browse/YARN-10254
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10254.001.patch, YARN-10254.002.patch, 
> YARN-10254.003.patch, YARN-10254.004.patch, YARN-10254.005.patch, 
> YARN-10254.branch-3.3.001.patch
>
>
> YARN-9879 and YARN-10198 introduced some major changes to user group mapping, 
> and some of them unfortunately had some negative impact on the way mapping 
> works.
> In some cases incorrect PlacementContexts were created, where full queue path 
> was passed as leaf queue name. This affects how the yarn cli app list 
> displays the queues.
> u:%user:%primary_group.%user mapping fails with an incorrect validation error 
> when the %primary_group parent queue was a managed parent.
> Group based rules in certain cases are mapped to root.[primary_group] rules, 
> loosing the ability to create deeper structures.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9097) Investigate why GpuDiscoverer methods are synchronized

2020-06-02 Thread Bilwa S T (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123963#comment-17123963
 ] 

Bilwa S T edited comment on YARN-9097 at 6/2/20, 4:47 PM:
--

Hi [~snemeth]
Thanks for raising this issue. I could see that getGpuDeviceInformation is 
called on NM init and GpuResourcePlugin#getNMResourceInfo which is already 
synchronized and getGpusUsableByYarn is also called on NM init. Hence all three 
methods need not be synchronized
I think even GpuResourceHandlerImpl#postComplete need not be synchronized as 
GpuResourceAllocator#unassignGpus is already synchronized and we can make 
GpuResourceAllocator#usedDevices as ConcurrentHashMap and 
GpuResourceAllocator#allowedGpuDevices as ConcurrentHashSet as order need not 
be maintained. Thoughts?


was (Author: bilwast):
Hi [~snemeth]
Thanks for raising this issue. I could see that getGpuDeviceInformation is 
called on NM init and GpuResourcePlugin#getNMResourceInfo which is already 
synchronized and getGpusUsableByYarn is also called on NM init. Hence all three 
methods need not be synchronized
I think even GpuResourceHandlerImpl#postComplete need not be synchronized as 
GpuResourceAllocator#unassignGpus is already synchronized

> Investigate why GpuDiscoverer methods are synchronized
> --
>
> Key: YARN-9097
> URL: https://issues.apache.org/jira/browse/YARN-9097
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Bilwa S T
>Priority: Minor
>  Labels: newbie, newbie++
>
> GpuDiscoverer.initialize surely shouldn't have been synchronized.
> Please also investigate why getGpuDeviceInformation / getGpusUsableByYarn are 
> synchronized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10296) Make ContainerPBImpl#getId/setId synchronized

2020-06-02 Thread Benjamin Teke (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124042#comment-17124042
 ] 

Benjamin Teke commented on YARN-10296:
--

Hi [~leftnoteasy],

I wanted to ask your opinion on this task. Making the getId/setId synchronized 
triggers the findBugs issue: IS2_INCONSISTENT_SYNC. It is valid based on the 
code, as the class's other methods do access the containerId with no 
synchronisation. Is it worth the extra effort to refactor those methods? I'm 
having concerns about performance.

Thanks!

> Make ContainerPBImpl#getId/setId synchronized
> -
>
> Key: YARN-10296
> URL: https://issues.apache.org/jira/browse/YARN-10296
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Minor
> Attachments: YARN-10296.001.patch
>
>
> ContainerPBImpl getId and setId methods can be accessed from multiple 
> threads. In order to avoid any simultaneous accesses and race conditions 
> these methods should be synchronized.
> The idea came from the issue described in YARN-10295, however that patch is 
> only applicable to branch-3.2 and 3.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10254) CapacityScheduler incorrect User Group Mapping after leaf queue change

2020-06-02 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124018#comment-17124018
 ] 

Szilard Nemeth commented on YARN-10254:
---

Thanks [~shuzirra] for working on this.
Checked the latest patches for both, LGTM and committed it to trunk and 
branch-3.3.
Thanks [~pbacsko] for the review.

Resolving jira.

> CapacityScheduler incorrect User Group Mapping after leaf queue change
> --
>
> Key: YARN-10254
> URL: https://issues.apache.org/jira/browse/YARN-10254
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10254.001.patch, YARN-10254.002.patch, 
> YARN-10254.003.patch, YARN-10254.004.patch, YARN-10254.005.patch, 
> YARN-10254.branch-3.3.001.patch
>
>
> YARN-9879 and YARN-10198 introduced some major changes to user group mapping, 
> and some of them unfortunately had some negative impact on the way mapping 
> works.
> In some cases incorrect PlacementContexts were created, where full queue path 
> was passed as leaf queue name. This affects how the yarn cli app list 
> displays the queues.
> u:%user:%primary_group.%user mapping fails with an incorrect validation error 
> when the %primary_group parent queue was a managed parent.
> Group based rules in certain cases are mapped to root.[primary_group] rules, 
> loosing the ability to create deeper structures.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10254) CapacityScheduler incorrect User Group Mapping after leaf queue change

2020-06-02 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10254:
--
Fix Version/s: 3.3.1
   3.4.0

> CapacityScheduler incorrect User Group Mapping after leaf queue change
> --
>
> Key: YARN-10254
> URL: https://issues.apache.org/jira/browse/YARN-10254
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10254.001.patch, YARN-10254.002.patch, 
> YARN-10254.003.patch, YARN-10254.004.patch, YARN-10254.005.patch, 
> YARN-10254.branch-3.3.001.patch
>
>
> YARN-9879 and YARN-10198 introduced some major changes to user group mapping, 
> and some of them unfortunately had some negative impact on the way mapping 
> works.
> In some cases incorrect PlacementContexts were created, where full queue path 
> was passed as leaf queue name. This affects how the yarn cli app list 
> displays the queues.
> u:%user:%primary_group.%user mapping fails with an incorrect validation error 
> when the %primary_group parent queue was a managed parent.
> Group based rules in certain cases are mapped to root.[primary_group] rules, 
> loosing the ability to create deeper structures.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet

2020-06-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123965#comment-17123965
 ] 

Hadoop QA commented on YARN-10284:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 36m 
19s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} branch-3.3 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 31m 
15s{color} | {color:green} branch-3.3 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
35s{color} | {color:green} branch-3.3 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
22s{color} | {color:green} branch-3.3 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
40s{color} | {color:green} branch-3.3 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m 14s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} branch-3.3 passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
16s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
15s{color} | {color:green} branch-3.3 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 37s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
22s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
35s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
29s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}110m 50s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26101/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-10284 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13004630/YARN-10284.branch-3.3.001.patch
 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux 50a8656be957 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 
10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | branch-3.3 / 910d88e |
| Default Java | Private Build-1.8.0_252-8u252-b09-1~16.04-b09 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/26101/testReport/ |
| Max. process+thread count | 340 (vs. ulimit of 5500) |
| modules | C: 

[jira] [Commented] (YARN-9097) Investigate why GpuDiscoverer methods are synchronized

2020-06-02 Thread Bilwa S T (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123963#comment-17123963
 ] 

Bilwa S T commented on YARN-9097:
-

Hi [~snemeth]
Thanks for raising this issue. I could see that getGpuDeviceInformation is 
called on NM init and GpuResourcePlugin#getNMResourceInfo which is already 
synchronized and getGpusUsableByYarn is also called on NM init. Hence all three 
methods need not be synchronized
I think even GpuResourceHandlerImpl#postComplete need not be synchronized as 
GpuResourceAllocator#unassignGpus is already synchronized

> Investigate why GpuDiscoverer methods are synchronized
> --
>
> Key: YARN-9097
> URL: https://issues.apache.org/jira/browse/YARN-9097
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Bilwa S T
>Priority: Minor
>  Labels: newbie, newbie++
>
> GpuDiscoverer.initialize surely shouldn't have been synchronized.
> Please also investigate why getGpuDeviceInformation / getGpusUsableByYarn are 
> synchronized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10286) PendingContainers bugs in the scheduler outputs

2020-06-02 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123910#comment-17123910
 ] 

Andras Gyori commented on YARN-10286:
-

The patch smoothly applied to both branch-3.3 and 3.2.

> PendingContainers bugs in the scheduler outputs
> ---
>
> Key: YARN-10286
> URL: https://issues.apache.org/jira/browse/YARN-10286
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Andras Gyori
>Priority: Critical
> Fix For: 3.4.0
>
> Attachments: YARN-10286.001.patch, YARN-10286.branch-3.2.001.patch, 
> YARN-10286.branch-3.3.001.patch
>
>
> There are some problems with the  {{ws/v1/cluster/scheduler}} output of the 
> ResourceManager:
> Even though the pendingContainers field is 
> [documented|https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API]
>  in the FairScheduler output, it is 
> [missing|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerQueueInfo.java]
>  from the actual output. In case of the CapacityScheduler this field 
> [exists|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/CapacitySchedulerQueueInfo.java],
>  but it is missing from the documentation. Let's fix both cases!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10286) PendingContainers bugs in the scheduler outputs

2020-06-02 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10286:

Attachment: YARN-10286.branch-3.3.001.patch

> PendingContainers bugs in the scheduler outputs
> ---
>
> Key: YARN-10286
> URL: https://issues.apache.org/jira/browse/YARN-10286
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Andras Gyori
>Priority: Critical
> Fix For: 3.4.0
>
> Attachments: YARN-10286.001.patch, YARN-10286.branch-3.2.001.patch, 
> YARN-10286.branch-3.3.001.patch
>
>
> There are some problems with the  {{ws/v1/cluster/scheduler}} output of the 
> ResourceManager:
> Even though the pendingContainers field is 
> [documented|https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API]
>  in the FairScheduler output, it is 
> [missing|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerQueueInfo.java]
>  from the actual output. In case of the CapacityScheduler this field 
> [exists|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/CapacitySchedulerQueueInfo.java],
>  but it is missing from the documentation. Let's fix both cases!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10286) PendingContainers bugs in the scheduler outputs

2020-06-02 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10286:

Attachment: YARN-10286.branch-3.2.001.patch

> PendingContainers bugs in the scheduler outputs
> ---
>
> Key: YARN-10286
> URL: https://issues.apache.org/jira/browse/YARN-10286
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Andras Gyori
>Priority: Critical
> Fix For: 3.4.0
>
> Attachments: YARN-10286.001.patch, YARN-10286.branch-3.2.001.patch, 
> YARN-10286.branch-3.3.001.patch
>
>
> There are some problems with the  {{ws/v1/cluster/scheduler}} output of the 
> ResourceManager:
> Even though the pendingContainers field is 
> [documented|https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API]
>  in the FairScheduler output, it is 
> [missing|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerQueueInfo.java]
>  from the actual output. In case of the CapacityScheduler this field 
> [exists|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/CapacitySchedulerQueueInfo.java],
>  but it is missing from the documentation. Let's fix both cases!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet

2020-06-02 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123851#comment-17123851
 ] 

Hudson commented on YARN-10284:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18314 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/18314/])
YARN-10284. Add lazy initialization of (snemeth: rev 
aa6d13455b9435fb6c6d8f942c2b278dfada8f0c)
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/webapp/TestLogServlet.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/LogServlet.java


> Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
> 
>
> Key: YARN-10284
> URL: https://issues.apache.org/jira/browse/YARN-10284
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10284.001.patch, YARN-10284.002.patch, 
> YARN-10284.003.patch, YARN-10284.004.patch, YARN-10284.branch-3.3.001.patch
>
>
> Suppose the {{mapred}} user has no access to the remote folder. Pinging the 
> JHS if it's online in every few seconds will produce the following entry in 
> the log:
> {noformat}
> 2020-05-19 00:17:20,331 WARN 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController:
>  Unable to determine if the filesystem supports append operation
> java.nio.file.AccessDeniedException: test-bucket: 
> org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: There is no mapped role 
> for the group(s) associated with the authenticated user. (user: mapred)
>   at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:204)
> [...]
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:513)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.getRollOverLogMaxSize(LogAggregationIndexedFileController.java:1157)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initInternal(LogAggregationIndexedFileController.java:149)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.initialize(LogAggregationFileController.java:135)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileControllerFactory.(LogAggregationFileControllerFactory.java:139)
>   at 
> org.apache.hadoop.yarn.server.webapp.LogServlet.(LogServlet.java:66)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.(HsWebServices.java:99)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices$$FastClassByGuice$$1eb8d5d6.newInstance()
>   at 
> com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40)
> [...]
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> We should only create the {{LogAggregationFactory}} instance when we actually 
> need it, not every time the {{LogServlet}} object is instantiated (so 
> definitely not in the constructor). In this way we prevent pressure on the 
> S3A auth side, especially if the authentication request is a costly operation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources

2020-06-02 Thread Benjamin Teke (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123844#comment-17123844
 ] 

Benjamin Teke commented on YARN-10295:
--

Test issues are related to YARN-10249 not this patch.

> CapacityScheduler NPE can cause apps to get stuck without resources
> ---
>
> Key: YARN-10295
> URL: https://issues.apache.org/jira/browse/YARN-10295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10295.001.branch-3.1.patch, 
> YARN-10295.001.branch-3.2.patch
>
>
> When the CapacityScheduler Asynchronous scheduling is enabled there is an 
> edge-case where a NullPointerException can cause the scheduler thread to exit 
> and the apps to get stuck without allocated resources. Consider the following 
> log:
> {code:java}
> 2020-05-27 10:13:49,106 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:apply(681)) - Reserved 
> container=container_e10_1590502305306_0660_01_000115, on node=host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used= with 
> resource=
> 2020-05-27 10:13:49,134 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:internalUnreserve(743)) - Application 
> application_1590502305306_0660 unreserved  on node host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used=, currently 
> has 0 at priority 11; currentReservation  on node-label=
> 2020-05-27 10:13:49,134 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
> 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread 
> Thread[Thread-4953,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
> {code}
> A container gets allocated on a host, but the host doesn't have enough 
> memory, so after a short while it gets unreserved. However because the 
> scheduler thread is running asynchronously it might have entered into the 
> following if block located in 
> [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602],
>  because at the time _node.getReservedContainer()_ wasn't null. Calling it a 
> second time for getting the ApplicationAttemptId would be an NPE, as the 
> container got unreserved in the meantime.
> {code:java}
> // Do not schedule if there are any reservations to fulfill on the node
> if (node.getReservedContainer() != null) {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Skipping scheduling since node " + node.getNodeID()
> + " is reserved by application " + node.getReservedContainer()
> .getContainerId().getApplicationAttemptId());
>  }
>  return null;
> }
> {code}
> A fix would be to store the container object before the if block. 
> Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 
> which indirectly fixed this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet

2020-06-02 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal updated YARN-10284:
--
Attachment: YARN-10284.branch-3.3.001.patch

> Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
> 
>
> Key: YARN-10284
> URL: https://issues.apache.org/jira/browse/YARN-10284
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10284.001.patch, YARN-10284.002.patch, 
> YARN-10284.003.patch, YARN-10284.004.patch, YARN-10284.branch-3.3.001.patch
>
>
> Suppose the {{mapred}} user has no access to the remote folder. Pinging the 
> JHS if it's online in every few seconds will produce the following entry in 
> the log:
> {noformat}
> 2020-05-19 00:17:20,331 WARN 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController:
>  Unable to determine if the filesystem supports append operation
> java.nio.file.AccessDeniedException: test-bucket: 
> org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: There is no mapped role 
> for the group(s) associated with the authenticated user. (user: mapred)
>   at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:204)
> [...]
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:513)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.getRollOverLogMaxSize(LogAggregationIndexedFileController.java:1157)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initInternal(LogAggregationIndexedFileController.java:149)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.initialize(LogAggregationFileController.java:135)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileControllerFactory.(LogAggregationFileControllerFactory.java:139)
>   at 
> org.apache.hadoop.yarn.server.webapp.LogServlet.(LogServlet.java:66)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.(HsWebServices.java:99)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices$$FastClassByGuice$$1eb8d5d6.newInstance()
>   at 
> com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40)
> [...]
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> We should only create the {{LogAggregationFactory}} instance when we actually 
> need it, not every time the {{LogServlet}} object is instantiated (so 
> definitely not in the constructor). In this way we prevent pressure on the 
> S3A auth side, especially if the authentication request is a costly operation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet

2020-06-02 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123841#comment-17123841
 ] 

Adam Antal commented on YARN-10284:
---

Thanks for committing this [~snemeth]. Uploaded patch for branch 3.3.

> Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
> 
>
> Key: YARN-10284
> URL: https://issues.apache.org/jira/browse/YARN-10284
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10284.001.patch, YARN-10284.002.patch, 
> YARN-10284.003.patch, YARN-10284.004.patch, YARN-10284.branch-3.3.001.patch
>
>
> Suppose the {{mapred}} user has no access to the remote folder. Pinging the 
> JHS if it's online in every few seconds will produce the following entry in 
> the log:
> {noformat}
> 2020-05-19 00:17:20,331 WARN 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController:
>  Unable to determine if the filesystem supports append operation
> java.nio.file.AccessDeniedException: test-bucket: 
> org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: There is no mapped role 
> for the group(s) associated with the authenticated user. (user: mapred)
>   at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:204)
> [...]
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:513)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.getRollOverLogMaxSize(LogAggregationIndexedFileController.java:1157)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initInternal(LogAggregationIndexedFileController.java:149)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.initialize(LogAggregationFileController.java:135)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileControllerFactory.(LogAggregationFileControllerFactory.java:139)
>   at 
> org.apache.hadoop.yarn.server.webapp.LogServlet.(LogServlet.java:66)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.(HsWebServices.java:99)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices$$FastClassByGuice$$1eb8d5d6.newInstance()
>   at 
> com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40)
> [...]
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> We should only create the {{LogAggregationFactory}} instance when we actually 
> need it, not every time the {{LogServlet}} object is instantiated (so 
> definitely not in the constructor). In this way we prevent pressure on the 
> S3A auth side, especially if the authentication request is a costly operation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-06-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123839#comment-17123839
 ] 

Hadoop QA commented on YARN-10293:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
41s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 
 7s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
37s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
52s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m  0s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m  
1s{color} | {color:blue} Used deprecated FindBugs config; considering switching 
to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
57s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 32s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 1 new + 99 unchanged - 0 fixed = 100 total (was 99) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 48s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
51s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 93m 28s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
35s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}162m 26s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler |
|   | hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy 
|
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26100/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-10293 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13004610/YARN-10293-003-WIP.patch
 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux 3352849ee112 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 
10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | trunk / 

[jira] [Commented] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet

2020-06-02 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123830#comment-17123830
 ] 

Szilard Nemeth commented on YARN-10284:
---

Hi [~adam.antal],
Thanks for working on this.
LGTM, committed to trunk.

Thanks [~gandras] for the review.

[~adam.antal] Do you plan to backport this to 3.3 or any older branches?
Thanks.

> Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
> 
>
> Key: YARN-10284
> URL: https://issues.apache.org/jira/browse/YARN-10284
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-10284.001.patch, YARN-10284.002.patch, 
> YARN-10284.003.patch, YARN-10284.004.patch
>
>
> Suppose the {{mapred}} user has no access to the remote folder. Pinging the 
> JHS if it's online in every few seconds will produce the following entry in 
> the log:
> {noformat}
> 2020-05-19 00:17:20,331 WARN 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController:
>  Unable to determine if the filesystem supports append operation
> java.nio.file.AccessDeniedException: test-bucket: 
> org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: There is no mapped role 
> for the group(s) associated with the authenticated user. (user: mapred)
>   at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:204)
> [...]
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:513)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.getRollOverLogMaxSize(LogAggregationIndexedFileController.java:1157)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initInternal(LogAggregationIndexedFileController.java:149)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.initialize(LogAggregationFileController.java:135)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileControllerFactory.(LogAggregationFileControllerFactory.java:139)
>   at 
> org.apache.hadoop.yarn.server.webapp.LogServlet.(LogServlet.java:66)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.(HsWebServices.java:99)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices$$FastClassByGuice$$1eb8d5d6.newInstance()
>   at 
> com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40)
> [...]
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> We should only create the {{LogAggregationFactory}} instance when we actually 
> need it, not every time the {{LogServlet}} object is instantiated (so 
> definitely not in the constructor). In this way we prevent pressure on the 
> S3A auth side, especially if the authentication request is a costly operation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet

2020-06-02 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10284:
--
Fix Version/s: 3.4.0

> Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
> 
>
> Key: YARN-10284
> URL: https://issues.apache.org/jira/browse/YARN-10284
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10284.001.patch, YARN-10284.002.patch, 
> YARN-10284.003.patch, YARN-10284.004.patch
>
>
> Suppose the {{mapred}} user has no access to the remote folder. Pinging the 
> JHS if it's online in every few seconds will produce the following entry in 
> the log:
> {noformat}
> 2020-05-19 00:17:20,331 WARN 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController:
>  Unable to determine if the filesystem supports append operation
> java.nio.file.AccessDeniedException: test-bucket: 
> org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: There is no mapped role 
> for the group(s) associated with the authenticated user. (user: mapred)
>   at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:204)
> [...]
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:513)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.getRollOverLogMaxSize(LogAggregationIndexedFileController.java:1157)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initInternal(LogAggregationIndexedFileController.java:149)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.initialize(LogAggregationFileController.java:135)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileControllerFactory.(LogAggregationFileControllerFactory.java:139)
>   at 
> org.apache.hadoop.yarn.server.webapp.LogServlet.(LogServlet.java:66)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.(HsWebServices.java:99)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices$$FastClassByGuice$$1eb8d5d6.newInstance()
>   at 
> com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40)
> [...]
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> We should only create the {{LogAggregationFactory}} instance when we actually 
> need it, not every time the {{LogServlet}} object is instantiated (so 
> definitely not in the constructor). In this way we prevent pressure on the 
> S3A auth side, especially if the authentication request is a costly operation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10286) PendingContainers bugs in the scheduler outputs

2020-06-02 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123827#comment-17123827
 ] 

Hudson commented on YARN-10286:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18313 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/18313/])
YARN-10286. PendingContainers bugs in the scheduler outputs. Contributed 
(snemeth: rev e0a0741ac86dcc4f98ec2f9739b70b3697a4d0c0)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerQueueInfo.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md


> PendingContainers bugs in the scheduler outputs
> ---
>
> Key: YARN-10286
> URL: https://issues.apache.org/jira/browse/YARN-10286
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Andras Gyori
>Priority: Critical
> Fix For: 3.4.0
>
> Attachments: YARN-10286.001.patch
>
>
> There are some problems with the  {{ws/v1/cluster/scheduler}} output of the 
> ResourceManager:
> Even though the pendingContainers field is 
> [documented|https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API]
>  in the FairScheduler output, it is 
> [missing|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerQueueInfo.java]
>  from the actual output. In case of the CapacityScheduler this field 
> [exists|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/CapacitySchedulerQueueInfo.java],
>  but it is missing from the documentation. Let's fix both cases!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10286) PendingContainers bugs in the scheduler outputs

2020-06-02 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123798#comment-17123798
 ] 

Szilard Nemeth commented on YARN-10286:
---

Hi [~gandras],
Thanks for working on this.

Patch LGTM, committed to trunk.

Thanks [~adam.antal] for the review.

[~gandras] Can you please assess whether branch-3.3 and branch-3.2 needs this 
fix as well?
Thanks

> PendingContainers bugs in the scheduler outputs
> ---
>
> Key: YARN-10286
> URL: https://issues.apache.org/jira/browse/YARN-10286
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Andras Gyori
>Priority: Critical
> Fix For: 3.4.0
>
> Attachments: YARN-10286.001.patch
>
>
> There are some problems with the  {{ws/v1/cluster/scheduler}} output of the 
> ResourceManager:
> Even though the pendingContainers field is 
> [documented|https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API]
>  in the FairScheduler output, it is 
> [missing|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerQueueInfo.java]
>  from the actual output. In case of the CapacityScheduler this field 
> [exists|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/CapacitySchedulerQueueInfo.java],
>  but it is missing from the documentation. Let's fix both cases!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10286) PendingContainers bugs in the scheduler outputs

2020-06-02 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10286:
--
Fix Version/s: 3.4.0

> PendingContainers bugs in the scheduler outputs
> ---
>
> Key: YARN-10286
> URL: https://issues.apache.org/jira/browse/YARN-10286
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Andras Gyori
>Priority: Critical
> Fix For: 3.4.0
>
> Attachments: YARN-10286.001.patch
>
>
> There are some problems with the  {{ws/v1/cluster/scheduler}} output of the 
> ResourceManager:
> Even though the pendingContainers field is 
> [documented|https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API]
>  in the FairScheduler output, it is 
> [missing|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerQueueInfo.java]
>  from the actual output. In case of the CapacityScheduler this field 
> [exists|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/CapacitySchedulerQueueInfo.java],
>  but it is missing from the documentation. Let's fix both cases!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10304) Create an endpoint for remote application log directory path query

2020-06-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123766#comment-17123766
 ] 

Hadoop QA commented on YARN-10304:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  3m 
21s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  3m 
28s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 31m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 18m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
20m 58s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m  
1s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
19s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m  
0s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
23s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 17m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 17m 
18s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
2m 51s{color} | {color:orange} root: The patch generated 8 new + 11 unchanged - 
0 fixed = 19 total (was 11) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 25s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
25s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  4m  
3s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
40s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
47s{color} | {color:green} hadoop-mapreduce-client-hs in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
47s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}143m 33s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26099/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-10304 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13004561/YARN-10304.001.patch |
| Optional Tests | dupname 

[jira] [Commented] (YARN-10274) Merge QueueMapping and QueueMappingEntity

2020-06-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123751#comment-17123751
 ] 

Hadoop QA commented on YARN-10274:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  2m 
13s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
19m 29s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
40s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
38s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 31s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 5 new + 64 unchanged - 0 fixed = 69 total (was 64) {color} 
|
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 33s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
10s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 93m 32s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
29s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}166m 56s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler |
|   | hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26098/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-10274 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13004605/YARN-10274.002.patch |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux 55b270dd07a7 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 
10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | 

[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-06-02 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123686#comment-17123686
 ] 

Tao Yang commented on YARN-10293:
-

Hi, [~prabhujoseph], [~wangda]
This problem is similar to YARN-9598, which was in dispute so there's no 
further progress. In my opinion, YARN-9598 and this issue may just parts of 
reservation problems, it's better to refactor the reservation logic again to 
compatible with the scheduling framework which has been updated a lot by global 
scheduler, especially for multi-nodes lookup mechanism. At least we should 
rethink all referenced logic in scheduling cycle to have a more complete 
solution for current reservation. Thoughts?

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch, 
> YARN-10293-003-WIP.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when the used capacity 
> is <= 1.0f ?*
> {code}
> The reason is even though the container is preempted - nodemanager has to 
> stop the container and heartbeat and update the available and unallocated 
> resources to ResourceManager.
> {code}
> 5. Now, no new allocation happens and reserved container stays at reserved.
> After reservation the used capacity becomes 1.0f, below will be in a loop and 
> no new allocate or reserve happens. The reserved container cannot be 
> allocated as reserved node does not have space. node2 has space for 1GB, 
> 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting 
> called causing the Hang.
> *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> 
> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container 
> on node*
> {code}
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1590046667304_0005 
> on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignContainers: partition= #applications=1
> 2020-05-21 12:13:33,242 INFO 
> 

[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-06-02 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123652#comment-17123652
 ] 

Prabhu Joseph commented on YARN-10293:
--

Have attached a patch with removing the if condition. Will do some functional 
testing in test cluster. Thanks.

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch, 
> YARN-10293-003-WIP.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when the used capacity 
> is <= 1.0f ?*
> {code}
> The reason is even though the container is preempted - nodemanager has to 
> stop the container and heartbeat and update the available and unallocated 
> resources to ResourceManager.
> {code}
> 5. Now, no new allocation happens and reserved container stays at reserved.
> After reservation the used capacity becomes 1.0f, below will be in a loop and 
> no new allocate or reserve happens. The reserved container cannot be 
> allocated as reserved node does not have space. node2 has space for 1GB, 
> 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting 
> called causing the Hang.
> *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> 
> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container 
> on node*
> {code}
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1590046667304_0005 
> on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignContainers: partition= #applications=1
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> 2020-05-21 12:13:33,243 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Allocation proposal accepted
> {code}
> 

[jira] [Updated] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-06-02 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10293:
-
Attachment: YARN-10293-003-WIP.patch

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch, 
> YARN-10293-003-WIP.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when the used capacity 
> is <= 1.0f ?*
> {code}
> The reason is even though the container is preempted - nodemanager has to 
> stop the container and heartbeat and update the available and unallocated 
> resources to ResourceManager.
> {code}
> 5. Now, no new allocation happens and reserved container stays at reserved.
> After reservation the used capacity becomes 1.0f, below will be in a loop and 
> no new allocate or reserve happens. The reserved container cannot be 
> allocated as reserved node does not have space. node2 has space for 1GB, 
> 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting 
> called causing the Hang.
> *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> 
> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container 
> on node*
> {code}
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1590046667304_0005 
> on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignContainers: partition= #applications=1
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> 2020-05-21 12:13:33,243 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Allocation proposal accepted
> {code}
> CapacityScheduler#allocateOrReserveNewContainers won't be called as below 
> check in allocateContainersOnMultiNodes fails
> {code}
>  if 

[jira] [Updated] (YARN-10304) Create an endpoint for remote application log directory path query

2020-06-02 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal updated YARN-10304:
--
Parent: YARN-10025
Issue Type: Sub-task  (was: Improvement)

> Create an endpoint for remote application log directory path query
> --
>
> Key: YARN-10304
> URL: https://issues.apache.org/jira/browse/YARN-10304
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Minor
> Attachments: YARN-10304.001.patch
>
>
> The logic of the aggregated log directory path determination (currently based 
> on configuration) is scattered around the codebase and duplicated multiple 
> times. By providing a separate class for creating the path for a specific 
> user, it allows for an abstraction over this logic. This could be used in 
> place of the previously duplicated logic, moreover, we could provide an 
> endpoint to query this path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10274) Merge QueueMapping and QueueMappingEntity

2020-06-02 Thread Gergely Pollak (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergely Pollak updated YARN-10274:
--
Attachment: YARN-10274.002.patch

> Merge QueueMapping and QueueMappingEntity
> -
>
> Key: YARN-10274
> URL: https://issues.apache.org/jira/browse/YARN-10274
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-10274.001.patch, YARN-10274.002.patch
>
>
> The role, usage and internal behaviour of these classes are almost identical, 
> but it makes no sense to keep both of them. One is used by UserGroup 
> placement rule definitions the other is used by Application placement rules.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10305) Lost system-credentials when restarting RM

2020-06-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123589#comment-17123589
 ] 

Hadoop QA commented on YARN-10305:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
19s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
54s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
18m 47s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m 
18s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
14s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 43s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 3 new + 94 unchanged - 0 fixed = 97 total (was 94) {color} 
|
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
18m 15s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
22s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}101m 32s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
34s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}175m 36s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler |
|   | hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26097/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-10305 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13004564/YARN-10305.001.patch |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux db6d91116414 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 
10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | 

[jira] [Commented] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet

2020-06-02 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123426#comment-17123426
 ] 

Andras Gyori commented on YARN-10284:
-

Last patch LGTM (non-binding), I like the caching approach +1.

> Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
> 
>
> Key: YARN-10284
> URL: https://issues.apache.org/jira/browse/YARN-10284
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-10284.001.patch, YARN-10284.002.patch, 
> YARN-10284.003.patch, YARN-10284.004.patch
>
>
> Suppose the {{mapred}} user has no access to the remote folder. Pinging the 
> JHS if it's online in every few seconds will produce the following entry in 
> the log:
> {noformat}
> 2020-05-19 00:17:20,331 WARN 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController:
>  Unable to determine if the filesystem supports append operation
> java.nio.file.AccessDeniedException: test-bucket: 
> org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: There is no mapped role 
> for the group(s) associated with the authenticated user. (user: mapred)
>   at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:204)
> [...]
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:513)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.getRollOverLogMaxSize(LogAggregationIndexedFileController.java:1157)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initInternal(LogAggregationIndexedFileController.java:149)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.initialize(LogAggregationFileController.java:135)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileControllerFactory.(LogAggregationFileControllerFactory.java:139)
>   at 
> org.apache.hadoop.yarn.server.webapp.LogServlet.(LogServlet.java:66)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.(HsWebServices.java:99)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices$$FastClassByGuice$$1eb8d5d6.newInstance()
>   at 
> com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40)
> [...]
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> We should only create the {{LogAggregationFactory}} instance when we actually 
> need it, not every time the {{LogServlet}} object is instantiated (so 
> definitely not in the constructor). In this way we prevent pressure on the 
> S3A auth side, especially if the authentication request is a costly operation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10304) Create an endpoint for remote application log directory path query

2020-06-02 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123405#comment-17123405
 ] 

Andras Gyori commented on YARN-10304:
-

An early draft patch has been uploaded, that provides the foundation of the 
logic. As for now, it only contains the bare minimum of the path creation, 
without involving any bucket or group based naming convention. Whether to 
include anything more is up for discussion.

> Create an endpoint for remote application log directory path query
> --
>
> Key: YARN-10304
> URL: https://issues.apache.org/jira/browse/YARN-10304
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Minor
> Attachments: YARN-10304.001.patch
>
>
> The logic of the aggregated log directory path determination (currently based 
> on configuration) is scattered around the codebase and duplicated multiple 
> times. By providing a separate class for creating the path for a specific 
> user, it allows for an abstraction over this logic. This could be used in 
> place of the previously duplicated logic, moreover, we could provide an 
> endpoint to query this path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10304) Create an endpoint for remote application log directory path query

2020-06-02 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10304:

Attachment: YARN-10304.001.patch

> Create an endpoint for remote application log directory path query
> --
>
> Key: YARN-10304
> URL: https://issues.apache.org/jira/browse/YARN-10304
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Minor
> Attachments: YARN-10304.001.patch
>
>
> The logic of the aggregated log directory path determination (currently based 
> on configuration) is scattered around the codebase and duplicated multiple 
> times. By providing a separate class for creating the path for a specific 
> user, it allows for an abstraction over this logic. This could be used in 
> place of the previously duplicated logic, moreover, we could provide an 
> endpoint to query this path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet

2020-06-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123401#comment-17123401
 ] 

Hadoop QA commented on YARN-10284:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
18s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 41s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
11s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
8s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 15s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
15s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
30s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 64m 53s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26096/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-10284 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13004558/YARN-10284.004.patch |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux 244f95bd5d3a 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 
10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | trunk / 9fe4c37c25b |
| Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/26096/testReport/ |
| Max. process+thread count | 314 (vs. ulimit of 5500) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common U: 

[jira] [Created] (YARN-10305) Lost system-credentials when restarting RM

2020-06-02 Thread kyungwan nam (Jira)
kyungwan nam created YARN-10305:
---

 Summary: Lost system-credentials when restarting RM
 Key: YARN-10305
 URL: https://issues.apache.org/jira/browse/YARN-10305
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam


System-credentials introduced in YARN-2704, it makes it to keep the 
long-running apps.
I’ve met a situation where system-credentials lost when restarting RM.
Since then, if an app’s AM is stopped, restarting AM will be failed because NMs 
do not have HDFS delegation token which is needed for resource localization.


The app has a couple of delegation token including timeline-server token and 
HDFS delegation token.
When restarting RM, RM will request a new HDFS delegation token for an app that 
was submitted long ago. (It's fixed by YARN-5098)
But, If an app has a couple of delegation token and an exception occur in the 
token processed first, the next tokens are not processed.
I think that’s why lost system-credentials.

Here are RM’s logs at the time of restarting RM.
{code}
2020-05-19 14:25:05,712 WARN  security.DelegationTokenRenewer 
(DelegationTokenRenewer.java:handleDTRenewerAppRecoverEvent(955)) - Unable to 
add the application to the delegation token renewer on recovery.
java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, 
Service: 10.1.1.1:8190, Ident: (TIMELINE_DELEGATION_TOKEN owner=test-admin, 
renewer=yarn, realUser=yarn, issueDate=1586136363258, maxDate=1587000363258, 
sequenceNumber=2193, masterKeyId=340)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:503)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleDTRenewerAppRecoverEvent(DelegationTokenRenewer.java:953)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:79)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:912)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: HTTP status [403], message 
[org.apache.hadoop.security.token.SecretManager$InvalidToken: yarn tried to 
renew an expired token (TIMELINE_DELEGATION_TOKEN owner=test-admin, 
renewer=yarn, realUser=yarn, issueDate=1586136363258, maxDate=1587000363258, 
sequenceNumber=2193, masterKeyId=340) max expiration date: 2020-04-16 
10:26:03,258+0900 currentTime: 2020-05-19 14:25:05,700+0900]
at 
org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:166)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:319)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:235)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:437)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:247)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$2.run(TimelineClientImpl.java:227)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineConnector$TimelineClientRetryOpForOperateDelegationToken.run(TimelineConnector.java:431)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineConnector$TimelineClientConnectionRetry.retryOn(TimelineConnector.java:334)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineConnector.operateDelegationToken(TimelineConnector.java:218)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:250)
at 
org.apache.hadoop.yarn.security.client.TimelineDelegationTokenIdentifier$Renewer.renew(TimelineDelegationTokenIdentifier.java:81)
at org.apache.hadoop.security.token.Token.renew(Token.java:512)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:629)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:626)
at java.security.AccessController.doPrivileged(Native Method)
at 

[jira] [Created] (YARN-10304) Create an endpoint for remote application log directory path query

2020-06-02 Thread Andras Gyori (Jira)
Andras Gyori created YARN-10304:
---

 Summary: Create an endpoint for remote application log directory 
path query
 Key: YARN-10304
 URL: https://issues.apache.org/jira/browse/YARN-10304
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Andras Gyori
Assignee: Andras Gyori


The logic of the aggregated log directory path determination (currently based 
on configuration) is scattered around the codebase and duplicated multiple 
times. By providing a separate class for creating the path for a specific user, 
it allows for an abstraction over this logic. This could be used in place of 
the previously duplicated logic, moreover, we could provide an endpoint to 
query this path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org