[jira] [Commented] (YARN-10341) Yarn Service Container Completed event doesn't get processed
[ https://issues.apache.org/jira/browse/YARN-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154242#comment-17154242 ] Bilwa S T commented on YARN-10341: -- Fixed Checkstyle issues > Yarn Service Container Completed event doesn't get processed > - > > Key: YARN-10341 > URL: https://issues.apache.org/jira/browse/YARN-10341 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Critical > Attachments: YARN-10341.001.patch, YARN-10341.002.patch, > YARN-10341.003.patch, YARN-10341.004.patch > > > If there 10 workers running and if containers get killed , after a while we > see that there are just 9 workers runnning. This is due to CONTAINER > COMPLETED Event is not processed on AM side. > Issue is in below code: > {code:java} > public void onContainersCompleted(List statuses) { > for (ContainerStatus status : statuses) { > ContainerId containerId = status.getContainerId(); > ComponentInstance instance = > liveInstances.get(status.getContainerId()); > if (instance == null) { > LOG.warn( > "Container {} Completed. No component instance exists. > exitStatus={}. diagnostics={} ", > containerId, status.getExitStatus(), status.getDiagnostics()); > return; > } > ComponentEvent event = > new ComponentEvent(instance.getCompName(), CONTAINER_COMPLETED) > .setStatus(status).setInstance(instance) > .setContainerId(containerId); > dispatcher.getEventHandler().handle(event); > } > {code} > If component instance doesnt exist for a container, it doesnt iterate over > other containers as its returning from method -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10341) Yarn Service Container Completed event doesn't get processed
[ https://issues.apache.org/jira/browse/YARN-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154235#comment-17154235 ] Hadoop QA commented on YARN-10341: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 19s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 23s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 0m 55s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 53s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 44s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 19m 27s{color} | {color:green} hadoop-yarn-services-core in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 81m 8s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26262/artifact/out/Dockerfile | | JIRA Issue | YARN-10341 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13007346/YARN-10341.004.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 760c568266eb 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 10d218934c9 | | Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/26262/testReport/ | | Max. process+thread count | 777 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core
[jira] [Commented] (YARN-10324) Fetch data from NodeManager may case read timeout when disk is busy
[ https://issues.apache.org/jira/browse/YARN-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154217#comment-17154217 ] Yao Guangdong commented on YARN-10324: -- Fixed cached memory calculate incorrect bug. Add new patch YARN-10324.002.patch > Fetch data from NodeManager may case read timeout when disk is busy > --- > > Key: YARN-10324 > URL: https://issues.apache.org/jira/browse/YARN-10324 > Project: Hadoop YARN > Issue Type: Improvement > Components: auxservices >Affects Versions: 2.7.0, 3.2.1 >Reporter: Yao Guangdong >Priority: Minor > Labels: patch > Attachments: YARN-10324.001.patch, YARN-10324.002.patch > > > With the cluster size become more and more big.The cost time on Reduce > fetch Map's result from NodeManager become more and more long.We often see > the WARN logs in the reduce's logs as follow. > {quote}2020-06-19 15:43:15,522 WARN [fetcher#8] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to > TX-196-168-211.com:13562 with 5 map outputs > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492) > at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:434) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:400) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:271) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:330) > at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198) > {quote} > We check the NodeManager server find that the disk IO util and connections > became very high when the read timeout happened.We analyze that if we have > 20,000 maps and 1,000 reduces which will make NodeManager generate 20 million > times IO stream operate in the shuffle phase.If the reduce fetch data size is > very small from map output files.Which make the disk IO util become very high > in big cluster.Then read timeout happened frequently.The application finished > time become longer. > We find ShuffleHandler have IndexCache for cache file.out.index file.Then we > want to change the small IO to big IO which can reduce the small disk IO > times. So we try to cache all the small file data(file.out) in memory when > the first fetch request come.Then the others fetch request only need read > data from memory avoid disk IO operation.After we cache data to memory we > find the read timeout disappeared. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10324) Fetch data from NodeManager may case read timeout when disk is busy
[ https://issues.apache.org/jira/browse/YARN-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yao Guangdong updated YARN-10324: - Attachment: YARN-10324.002.patch > Fetch data from NodeManager may case read timeout when disk is busy > --- > > Key: YARN-10324 > URL: https://issues.apache.org/jira/browse/YARN-10324 > Project: Hadoop YARN > Issue Type: Improvement > Components: auxservices >Affects Versions: 2.7.0, 3.2.1 >Reporter: Yao Guangdong >Priority: Minor > Labels: patch > Attachments: YARN-10324.001.patch, YARN-10324.002.patch > > > With the cluster size become more and more big.The cost time on Reduce > fetch Map's result from NodeManager become more and more long.We often see > the WARN logs in the reduce's logs as follow. > {quote}2020-06-19 15:43:15,522 WARN [fetcher#8] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to > TX-196-168-211.com:13562 with 5 map outputs > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492) > at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:434) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:400) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:271) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:330) > at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198) > {quote} > We check the NodeManager server find that the disk IO util and connections > became very high when the read timeout happened.We analyze that if we have > 20,000 maps and 1,000 reduces which will make NodeManager generate 20 million > times IO stream operate in the shuffle phase.If the reduce fetch data size is > very small from map output files.Which make the disk IO util become very high > in big cluster.Then read timeout happened frequently.The application finished > time become longer. > We find ShuffleHandler have IndexCache for cache file.out.index file.Then we > want to change the small IO to big IO which can reduce the small disk IO > times. So we try to cache all the small file data(file.out) in memory when > the first fetch request come.Then the others fetch request only need read > data from memory avoid disk IO operation.After we cache data to memory we > find the read timeout disappeared. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10341) Yarn Service Container Completed event doesn't get processed
[ https://issues.apache.org/jira/browse/YARN-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated YARN-10341: - Attachment: YARN-10341.004.patch > Yarn Service Container Completed event doesn't get processed > - > > Key: YARN-10341 > URL: https://issues.apache.org/jira/browse/YARN-10341 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Critical > Attachments: YARN-10341.001.patch, YARN-10341.002.patch, > YARN-10341.003.patch, YARN-10341.004.patch > > > If there 10 workers running and if containers get killed , after a while we > see that there are just 9 workers runnning. This is due to CONTAINER > COMPLETED Event is not processed on AM side. > Issue is in below code: > {code:java} > public void onContainersCompleted(List statuses) { > for (ContainerStatus status : statuses) { > ContainerId containerId = status.getContainerId(); > ComponentInstance instance = > liveInstances.get(status.getContainerId()); > if (instance == null) { > LOG.warn( > "Container {} Completed. No component instance exists. > exitStatus={}. diagnostics={} ", > containerId, status.getExitStatus(), status.getDiagnostics()); > return; > } > ComponentEvent event = > new ComponentEvent(instance.getCompName(), CONTAINER_COMPLETED) > .setStatus(status).setInstance(instance) > .setContainerId(containerId); > dispatcher.getEventHandler().handle(event); > } > {code} > If component instance doesnt exist for a container, it doesnt iterate over > other containers as its returning from method -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10344) Sync netty versions in hadoop-yarn-csi
[ https://issues.apache.org/jira/browse/YARN-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154190#comment-17154190 ] Hudson commented on YARN-10344: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18420 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/18420/]) YARN-10344. Sync netty versions in hadoop-yarn-csi. (#2126) (github: rev 10d218934c9bc143bf8578c92cdbd6df6a4d3b98) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-csi/pom.xml > Sync netty versions in hadoop-yarn-csi > -- > > Key: YARN-10344 > URL: https://issues.apache.org/jira/browse/YARN-10344 > Project: Hadoop YARN > Issue Type: Bug > Components: build >Affects Versions: 3.3.0 >Reporter: Akira Ajisaka >Assignee: Akira Ajisaka >Priority: Major > Fix For: 3.4.0, 3.3.1 > > > netty-all is now 4.1.50.Final but the other netty libraries are 4.1.42.Final: > {noformat} > [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ hadoop-yarn-csi > --- > [INFO] org.apache.hadoop:hadoop-yarn-csi:jar:3.3.0 > [INFO] +- com.google.guava:guava:jar:20.0:compile > [INFO] +- com.google.protobuf:protobuf-java:jar:3.6.1:compile > [INFO] +- io.netty:netty-all:jar:4.1.50.Final:compile > [INFO] +- io.grpc:grpc-core:jar:1.26.0:compile > [INFO] | +- io.grpc:grpc-api:jar:1.26.0:compile (version selected from > constraint [1.26.0,1.26.0]) > [INFO] | | +- io.grpc:grpc-context:jar:1.26.0:compile > [INFO] | | +- > com.google.errorprone:error_prone_annotations:jar:2.3.3:compile > [INFO] | | \- org.codehaus.mojo:animal-sniffer-annotations:jar:1.17:compile > [INFO] | +- com.google.code.gson:gson:jar:2.2.4:compile > [INFO] | +- com.google.android:annotations:jar:4.1.1.4:compile > [INFO] | +- io.perfmark:perfmark-api:jar:0.19.0:compile > [INFO] | +- io.opencensus:opencensus-api:jar:0.24.0:compile > [INFO] | \- io.opencensus:opencensus-contrib-grpc-metrics:jar:0.24.0:compile > [INFO] +- io.grpc:grpc-protobuf:jar:1.26.0:compile > [INFO] | +- com.google.api.grpc:proto-google-common-protos:jar:1.12.0:compile > [INFO] | \- io.grpc:grpc-protobuf-lite:jar:1.26.0:compile > [INFO] +- io.grpc:grpc-stub:jar:1.26.0:compile > [INFO] +- io.grpc:grpc-netty:jar:1.26.0:compile > [INFO] | +- io.netty:netty-codec-http2:jar:4.1.42.Final:compile (version > selected from constraint [4.1.42.Final,4.1.42.Final]) > [INFO] | | +- io.netty:netty-common:jar:4.1.42.Final:compile > [INFO] | | +- io.netty:netty-buffer:jar:4.1.42.Final:compile > [INFO] | | +- io.netty:netty-transport:jar:4.1.42.Final:compile > [INFO] | | | \- io.netty:netty-resolver:jar:4.1.42.Final:compile > [INFO] | | +- io.netty:netty-codec:jar:4.1.42.Final:compile > [INFO] | | +- io.netty:netty-handler:jar:4.1.42.Final:compile > [INFO] | | \- io.netty:netty-codec-http:jar:4.1.42.Final:compile > [INFO] | \- io.netty:netty-handler-proxy:jar:4.1.42.Final:compile > [INFO] | \- io.netty:netty-codec-socks:jar:4.1.42.Final:compile > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10347) Fix double locking in CapacityScheduler#reinitialize in branch-3.1
[ https://issues.apache.org/jira/browse/YARN-10347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154189#comment-17154189 ] Masatake Iwasaki commented on YARN-10347: - Since this was brought by the backported patch of YARN-10022, only branch-3.1 and branch-3.2 are affected. I'm going to cherry-pick this to branch-3.2. The backported patch added duplicated call of [ReentrantReadWriteLock.WriteLock#lock|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/ReentrantReadWriteLock.WriteLock.html#lock--]. The {{lock()}} increases hold count by one. If it is called twice, the lock will be held until {{unlock()}} is called twice. > Fix double locking in CapacityScheduler#reinitialize in branch-3.1 > -- > > Key: YARN-10347 > URL: https://issues.apache.org/jira/browse/YARN-10347 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.4 >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Critical > Attachments: YARN-10347-branch-3.1.001.patch > > > Double locking blocks another threads in ResourceManager waiting for the lock. > I found the issue on testing hadoop-3.1.4-RC2 with RM-HA enabled deployment. > ResourceManager blocks on {{submitApplication}} waiting for the lock when I > run example MR applications. > {noformat} > "IPC Server handler 45 on default port 8032" #211 daemon prio=5 os_prio=0 > tid=0x7f0e45a40200 nid=0x418 waiting on condition [0x7f0e14abe000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x85d56510> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943) > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10347) Fix double locking in CapacityScheduler#reinitialize in branch-3.1
[ https://issues.apache.org/jira/browse/YARN-10347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154183#comment-17154183 ] Brahma Reddy Battula commented on YARN-10347: - [~iwasakims], thanks for reporting this.. Same will applicable to trunnk and other versions..? and can give more details on this..? > Fix double locking in CapacityScheduler#reinitialize in branch-3.1 > -- > > Key: YARN-10347 > URL: https://issues.apache.org/jira/browse/YARN-10347 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.4 >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Critical > Attachments: YARN-10347-branch-3.1.001.patch > > > Double locking blocks another threads in ResourceManager waiting for the lock. > I found the issue on testing hadoop-3.1.4-RC2 with RM-HA enabled deployment. > ResourceManager blocks on {{submitApplication}} waiting for the lock when I > run example MR applications. > {noformat} > "IPC Server handler 45 on default port 8032" #211 daemon prio=5 os_prio=0 > tid=0x7f0e45a40200 nid=0x418 waiting on condition [0x7f0e14abe000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x85d56510> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943) > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10324) Fetch data from NodeManager may case read timeout when disk is busy
[ https://issues.apache.org/jira/browse/YARN-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154155#comment-17154155 ] Brahma Reddy Battula commented on YARN-10324: - Updated the target version to 3.3.1 as 3.3.0 is going to release. > Fetch data from NodeManager may case read timeout when disk is busy > --- > > Key: YARN-10324 > URL: https://issues.apache.org/jira/browse/YARN-10324 > Project: Hadoop YARN > Issue Type: Improvement > Components: auxservices >Affects Versions: 2.7.0, 3.2.1 >Reporter: Yao Guangdong >Priority: Minor > Labels: patch > Attachments: YARN-10324.001.patch > > > With the cluster size become more and more big.The cost time on Reduce > fetch Map's result from NodeManager become more and more long.We often see > the WARN logs in the reduce's logs as follow. > {quote}2020-06-19 15:43:15,522 WARN [fetcher#8] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to > TX-196-168-211.com:13562 with 5 map outputs > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492) > at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:434) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:400) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:271) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:330) > at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198) > {quote} > We check the NodeManager server find that the disk IO util and connections > became very high when the read timeout happened.We analyze that if we have > 20,000 maps and 1,000 reduces which will make NodeManager generate 20 million > times IO stream operate in the shuffle phase.If the reduce fetch data size is > very small from map output files.Which make the disk IO util become very high > in big cluster.Then read timeout happened frequently.The application finished > time become longer. > We find ShuffleHandler have IndexCache for cache file.out.index file.Then we > want to change the small IO to big IO which can reduce the small disk IO > times. So we try to cache all the small file data(file.out) in memory when > the first fetch request come.Then the others fetch request only need read > data from memory avoid disk IO operation.After we cache data to memory we > find the read timeout disappeared. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10324) Fetch data from NodeManager may case read timeout when disk is busy
[ https://issues.apache.org/jira/browse/YARN-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-10324: Target Version/s: 2.7.8, 3.3.1 (was: 2.7.8, 3.3.0) > Fetch data from NodeManager may case read timeout when disk is busy > --- > > Key: YARN-10324 > URL: https://issues.apache.org/jira/browse/YARN-10324 > Project: Hadoop YARN > Issue Type: Improvement > Components: auxservices >Affects Versions: 2.7.0, 3.2.1 >Reporter: Yao Guangdong >Priority: Minor > Labels: patch > Attachments: YARN-10324.001.patch > > > With the cluster size become more and more big.The cost time on Reduce > fetch Map's result from NodeManager become more and more long.We often see > the WARN logs in the reduce's logs as follow. > {quote}2020-06-19 15:43:15,522 WARN [fetcher#8] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to > TX-196-168-211.com:13562 with 5 map outputs > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492) > at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:434) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:400) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:271) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:330) > at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198) > {quote} > We check the NodeManager server find that the disk IO util and connections > became very high when the read timeout happened.We analyze that if we have > 20,000 maps and 1,000 reduces which will make NodeManager generate 20 million > times IO stream operate in the shuffle phase.If the reduce fetch data size is > very small from map output files.Which make the disk IO util become very high > in big cluster.Then read timeout happened frequently.The application finished > time become longer. > We find ShuffleHandler have IndexCache for cache file.out.index file.Then we > want to change the small IO to big IO which can reduce the small disk IO > times. So we try to cache all the small file data(file.out) in memory when > the first fetch request come.Then the others fetch request only need read > data from memory avoid disk IO operation.After we cache data to memory we > find the read timeout disappeared. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10348) Allow RM to always cancel tokens after app completes
[ https://issues.apache.org/jira/browse/YARN-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154105#comment-17154105 ] Hadoop QA commented on YARN-10348: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 2m 36s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 34s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 12m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 9s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 21m 46s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 6s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 49s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 32s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 23s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 49s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 58s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 58s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 1s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 92m 58s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 43s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}208m 55s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.o
[jira] [Commented] (YARN-10347) Fix double locking in CapacityScheduler#reinitialize in branch-3.1
[ https://issues.apache.org/jira/browse/YARN-10347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154081#comment-17154081 ] Masatake Iwasaki commented on YARN-10347: - committed to branch-3.1. > Fix double locking in CapacityScheduler#reinitialize in branch-3.1 > -- > > Key: YARN-10347 > URL: https://issues.apache.org/jira/browse/YARN-10347 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.4 >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Critical > Attachments: YARN-10347-branch-3.1.001.patch > > > Double locking blocks another threads in ResourceManager waiting for the lock. > I found the issue on testing hadoop-3.1.4-RC2 with RM-HA enabled deployment. > ResourceManager blocks on {{submitApplication}} waiting for the lock when I > run example MR applications. > {noformat} > "IPC Server handler 45 on default port 8032" #211 daemon prio=5 os_prio=0 > tid=0x7f0e45a40200 nid=0x418 waiting on condition [0x7f0e14abe000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x85d56510> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943) > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10347) Fix double locking in CapacityScheduler#reinitialize in branch-3.1
[ https://issues.apache.org/jira/browse/YARN-10347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154075#comment-17154075 ] Masatake Iwasaki commented on YARN-10347: - I got no relevant failure. {noformat} [INFO] Results: [INFO] [ERROR] Failures: [ERROR] TestApplicationMasterService.testUpdateTrackingUrl:984 expected:<[hadoop.apache.org]> but was:<[N/A]> [INFO] [ERROR] Tests run: 2453, Failures: 1, Errors: 0, Skipped: 8 {noformat} The TestApplicationMasterService.testUpdateTrackingUrl looks flaky one. I can not reproduce the failure by rerunning the test. {noformat} $ mvn test -Dtest=TestApplicationMasterService ... [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService [INFO] Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 13.082 s - in org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService [INFO] [INFO] Results: [INFO] [INFO] Tests run: 15, Failures: 0, Errors: 0, Skipped: 0 {noformat} > Fix double locking in CapacityScheduler#reinitialize in branch-3.1 > -- > > Key: YARN-10347 > URL: https://issues.apache.org/jira/browse/YARN-10347 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.4 >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Critical > Attachments: YARN-10347-branch-3.1.001.patch > > > Double locking blocks another threads in ResourceManager waiting for the lock. > I found the issue on testing hadoop-3.1.4-RC2 with RM-HA enabled deployment. > ResourceManager blocks on {{submitApplication}} waiting for the lock when I > run example MR applications. > {noformat} > "IPC Server handler 45 on default port 8032" #211 daemon prio=5 os_prio=0 > tid=0x7f0e45a40200 nid=0x418 waiting on condition [0x7f0e14abe000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x85d56510> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943) > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10347) Fix double locking in CapacityScheduler#reinitialize in branch-3.1
[ https://issues.apache.org/jira/browse/YARN-10347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154071#comment-17154071 ] Masatake Iwasaki commented on YARN-10347: - Thanks, [~ayushtkn]. I'm running {{mvn test}} in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager on my local. I will commit the patch after checking the result. Since I can reproduce the docker failure by running ./start-build-env.sh, I filed HADOOP-17120. > Fix double locking in CapacityScheduler#reinitialize in branch-3.1 > -- > > Key: YARN-10347 > URL: https://issues.apache.org/jira/browse/YARN-10347 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.4 >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Critical > Attachments: YARN-10347-branch-3.1.001.patch > > > Double locking blocks another threads in ResourceManager waiting for the lock. > I found the issue on testing hadoop-3.1.4-RC2 with RM-HA enabled deployment. > ResourceManager blocks on {{submitApplication}} waiting for the lock when I > run example MR applications. > {noformat} > "IPC Server handler 45 on default port 8032" #211 daemon prio=5 os_prio=0 > tid=0x7f0e45a40200 nid=0x418 waiting on condition [0x7f0e14abe000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x85d56510> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943) > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10347) Fix double locking in CapacityScheduler#reinitialize in branch-3.1
[ https://issues.apache.org/jira/browse/YARN-10347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154062#comment-17154062 ] Ayush Saxena commented on YARN-10347: - hmm, need to see what triggered this at jenkins. Anyway the fix seems pretty straight forward. The original trunk patch didn't had it in YARN-10022, got added during backport. +1, Thanx [~iwasakims] for the find. This should ideally unblock 3.1.4 release as well? cc. [~gabor.bota] > Fix double locking in CapacityScheduler#reinitialize in branch-3.1 > -- > > Key: YARN-10347 > URL: https://issues.apache.org/jira/browse/YARN-10347 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.4 >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Critical > Attachments: YARN-10347-branch-3.1.001.patch > > > Double locking blocks another threads in ResourceManager waiting for the lock. > I found the issue on testing hadoop-3.1.4-RC2 with RM-HA enabled deployment. > ResourceManager blocks on {{submitApplication}} waiting for the lock when I > run example MR applications. > {noformat} > "IPC Server handler 45 on default port 8032" #211 daemon prio=5 os_prio=0 > tid=0x7f0e45a40200 nid=0x418 waiting on condition [0x7f0e14abe000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x85d56510> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943) > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10348) Allow RM to always cancel tokens after app completes
[ https://issues.apache.org/jira/browse/YARN-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154048#comment-17154048 ] Jim Brennan commented on YARN-10348: I have verified that TestDelegationTokenRenewer.testTokenThreadTimeout fails intermittently if I run it in a loop with or without my change. > Allow RM to always cancel tokens after app completes > > > Key: YARN-10348 > URL: https://issues.apache.org/jira/browse/YARN-10348 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.10.0, 3.1.3 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-10348.001.patch, YARN-10348.002.patch > > > (Note: this change was originally done on our internal branch by [~daryn]). > The RM currently has an option for a client to specify disabling token > cancellation when a job completes. This feature was an initial attempt to > address the use case of a job launching sub-jobs (ie. oozie launcher) and the > original job finishing prior to the sub-job(s) completion - ex. original job > completion triggered premature cancellation of tokens needed by the sub-jobs. > Many years ago, [~daryn] added a more robust implementation to ref count > tokens ([YARN-3055]). This prevented premature cancellation of the token > until all apps using the token complete, and invalidated the need for a > client to specify cancel=false. Unfortunately the config option was not > removed. > We have seen cases where oozie "java actions" and some users were explicitly > disabling token cancellation. This can lead to a buildup of defunct tokens > that may overwhelm the ZK buffer used by the KDC's backing store. At which > point the KMS fails to connect to ZK and is unable to issue/validate new > tokens - rendering the KDC only able to authenticate pre-existing tokens. > Production incidents have occurred due to the buffer size issue. > To avoid these issues, the RM should have the option to ignore/override the > client's request to not cancel tokens. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10347) Fix double locking in CapacityScheduler#reinitialize in branch-3.1
[ https://issues.apache.org/jira/browse/YARN-10347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated YARN-10347: Description: Double locking blocks another threads in ResourceManager waiting for the lock. I found the issue on testing hadoop-3.1.4-RC2 with RM-HA enabled deployment. ResourceManager blocks on {{submitApplication}} waiting for the lock when I run example MR applications. {noformat} "IPC Server handler 45 on default port 8032" #211 daemon prio=5 os_prio=0 tid=0x7f0e45a40200 nid=0x418 waiting on condition [0x7f0e14abe000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x85d56510> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943) {noformat} was:Double locking blocks another threads in ResourceManager waiting for the lock. > Fix double locking in CapacityScheduler#reinitialize in branch-3.1 > -- > > Key: YARN-10347 > URL: https://issues.apache.org/jira/browse/YARN-10347 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.4 >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Critical > Attachments: YARN-10347-branch-3.1.001.patch > > > Double locking blocks another threads in ResourceManager waiting for the lock. > I found the issue on testing hadoop-3.1.4-RC2 with RM-HA enabled deployment. > ResourceManager blocks on {{submitApplication}} waiting for the lock when I > run example MR applications. > {noformat} > "IPC Server handler 45 on default port 8032" #211 daemon prio=5 os_prio=0 > tid=0x7f0e45a40200 nid=0x418 waiting on condition [0x7f0e14abe000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x85d56510> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417
[jira] [Commented] (YARN-10347) Fix double locking in CapacityScheduler#reinitialize in branch-3.1
[ https://issues.apache.org/jira/browse/YARN-10347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154034#comment-17154034 ] Masatake Iwasaki commented on YARN-10347: - Yes. The QA build failed on docker image creation installing python packages. It seems not to be related to the patch. {noformat} Step 26/32 : RUN pip2 install configparser==4.0.2 pylint==1.9.2 ... The command '/bin/sh -c pip2 install configparser==4.0.2 pylint==1.9.2' returned a non-zero code: 1 ERROR: Docker failed to build yetus/hadoop:d84386ccf7a. {noformat} > Fix double locking in CapacityScheduler#reinitialize in branch-3.1 > -- > > Key: YARN-10347 > URL: https://issues.apache.org/jira/browse/YARN-10347 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.4 >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Critical > Attachments: YARN-10347-branch-3.1.001.patch > > > Double locking blocks another threads in ResourceManager waiting for the lock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10348) Allow RM to always cancel tokens after app completes
[ https://issues.apache.org/jira/browse/YARN-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154011#comment-17154011 ] Jim Brennan commented on YARN-10348: patch 002 addresses the checkstyle issues and adds an entry to yarn-default.xml to fix TestYarnConfigurationFields. I have not been able to repro the TestDelegationTokenRenewer.testTokenThreadTimeout() failure. It looks like the same timeout reported in [YARN-10155]. It appears to still fail intermittently. I don't think it is related to this patch. TestCapacityOverTimePolicy and TestFairSchedulerPreemption failures are unrelated to this change. > Allow RM to always cancel tokens after app completes > > > Key: YARN-10348 > URL: https://issues.apache.org/jira/browse/YARN-10348 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.10.0, 3.1.3 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-10348.001.patch, YARN-10348.002.patch > > > (Note: this change was originally done on our internal branch by [~daryn]). > The RM currently has an option for a client to specify disabling token > cancellation when a job completes. This feature was an initial attempt to > address the use case of a job launching sub-jobs (ie. oozie launcher) and the > original job finishing prior to the sub-job(s) completion - ex. original job > completion triggered premature cancellation of tokens needed by the sub-jobs. > Many years ago, [~daryn] added a more robust implementation to ref count > tokens ([YARN-3055]). This prevented premature cancellation of the token > until all apps using the token complete, and invalidated the need for a > client to specify cancel=false. Unfortunately the config option was not > removed. > We have seen cases where oozie "java actions" and some users were explicitly > disabling token cancellation. This can lead to a buildup of defunct tokens > that may overwhelm the ZK buffer used by the KDC's backing store. At which > point the KMS fails to connect to ZK and is unable to issue/validate new > tokens - rendering the KDC only able to authenticate pre-existing tokens. > Production incidents have occurred due to the buffer size issue. > To avoid these issues, the RM should have the option to ignore/override the > client's request to not cancel tokens. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10347) Fix double locking in CapacityScheduler#reinitialize in branch-3.1
[ https://issues.apache.org/jira/browse/YARN-10347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154009#comment-17154009 ] Ayush Saxena edited comment on YARN-10347 at 7/8/20, 8:59 PM: -- Thanx [~iwasakims] for the fix. Changes seems fair enough, but the jenkins doesn't seems behaving as expected. Any idea? was (Author: ayushtkn): Thanx [~iwasakims] for the fix. Changes seems fair enough, but the jenkins doesn't seems behaving expected. Any idea? > Fix double locking in CapacityScheduler#reinitialize in branch-3.1 > -- > > Key: YARN-10347 > URL: https://issues.apache.org/jira/browse/YARN-10347 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.4 >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Critical > Attachments: YARN-10347-branch-3.1.001.patch > > > Double locking blocks another threads in ResourceManager waiting for the lock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10347) Fix double locking in CapacityScheduler#reinitialize in branch-3.1
[ https://issues.apache.org/jira/browse/YARN-10347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154009#comment-17154009 ] Ayush Saxena commented on YARN-10347: - Thanx [~iwasakims] for the fix. Changes seems fair enough, but the jenkins doesn't seems behaving expected. Any idea? > Fix double locking in CapacityScheduler#reinitialize in branch-3.1 > -- > > Key: YARN-10347 > URL: https://issues.apache.org/jira/browse/YARN-10347 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.4 >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Critical > Attachments: YARN-10347-branch-3.1.001.patch > > > Double locking blocks another threads in ResourceManager waiting for the lock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10348) Allow RM to always cancel tokens after app completes
[ https://issues.apache.org/jira/browse/YARN-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-10348: --- Attachment: YARN-10348.002.patch > Allow RM to always cancel tokens after app completes > > > Key: YARN-10348 > URL: https://issues.apache.org/jira/browse/YARN-10348 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.10.0, 3.1.3 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-10348.001.patch, YARN-10348.002.patch > > > (Note: this change was originally done on our internal branch by [~daryn]). > The RM currently has an option for a client to specify disabling token > cancellation when a job completes. This feature was an initial attempt to > address the use case of a job launching sub-jobs (ie. oozie launcher) and the > original job finishing prior to the sub-job(s) completion - ex. original job > completion triggered premature cancellation of tokens needed by the sub-jobs. > Many years ago, [~daryn] added a more robust implementation to ref count > tokens ([YARN-3055]). This prevented premature cancellation of the token > until all apps using the token complete, and invalidated the need for a > client to specify cancel=false. Unfortunately the config option was not > removed. > We have seen cases where oozie "java actions" and some users were explicitly > disabling token cancellation. This can lead to a buildup of defunct tokens > that may overwhelm the ZK buffer used by the KDC's backing store. At which > point the KMS fails to connect to ZK and is unable to issue/validate new > tokens - rendering the KDC only able to authenticate pre-existing tokens. > Production incidents have occurred due to the buffer size issue. > To avoid these issues, the RM should have the option to ignore/override the > client's request to not cancel tokens. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10347) Fix double locking in CapacityScheduler#reinitialize in branch-3.1
[ https://issues.apache.org/jira/browse/YARN-10347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154001#comment-17154001 ] Hadoop QA commented on YARN-10347: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} docker {color} | {color:red} 0m 32s{color} | {color:red} Docker failed to build yetus/hadoop:d84386ccf7a. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-10347 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13007309/YARN-10347-branch-3.1.001.patch | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/26260/console | | versions | git=2.17.1 | | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org | This message was automatically generated. > Fix double locking in CapacityScheduler#reinitialize in branch-3.1 > -- > > Key: YARN-10347 > URL: https://issues.apache.org/jira/browse/YARN-10347 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.4 >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Critical > Attachments: YARN-10347-branch-3.1.001.patch > > > Double locking blocks another threads in ResourceManager waiting for the lock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10348) Allow RM to always cancel tokens after app completes
[ https://issues.apache.org/jira/browse/YARN-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153974#comment-17153974 ] Hadoop QA commented on YARN-10348: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 32s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 28s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 26s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 12m 48s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 48s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 22m 7s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 32s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 2m 7s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 12s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 26s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 8m 58s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 35s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 2 new + 301 unchanged - 0 fixed = 303 total (was 301) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 48s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 56s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 58s{color} | {color:red} hadoop-yarn-api in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 94m 55s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 42s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}202m 14s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.conf.TestYarnConfigurationFields | | | hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy | | | hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.
[jira] [Commented] (YARN-10348) Allow RM to always cancel tokens after app completes
[ https://issues.apache.org/jira/browse/YARN-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153754#comment-17153754 ] Jim Brennan commented on YARN-10348: patch 001 adds a new YARN configuration property: {noformat} public static final String RM_DELEGATION_TOKEN_ALWAYS_CANCEL = RM_PREFIX + "delegation-token.always-cancel"; public static final boolean DEFAULT_RM_DELEGATION_TOKEN_ALWAYS_CANCEL = false; {noformat} Internally we default this to true, but to maintain compatibility I've set it to false in this patch. If this property is true, we effectively ignore the {{shouldCancelAtEnd}} parameter that came from the client. We have been running with this change in production internally for about two years. > Allow RM to always cancel tokens after app completes > > > Key: YARN-10348 > URL: https://issues.apache.org/jira/browse/YARN-10348 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.10.0, 3.1.3 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-10348.001.patch > > > (Note: this change was originally done on our internal branch by [~daryn]). > The RM currently has an option for a client to specify disabling token > cancellation when a job completes. This feature was an initial attempt to > address the use case of a job launching sub-jobs (ie. oozie launcher) and the > original job finishing prior to the sub-job(s) completion - ex. original job > completion triggered premature cancellation of tokens needed by the sub-jobs. > Many years ago, [~daryn] added a more robust implementation to ref count > tokens ([YARN-3055]). This prevented premature cancellation of the token > until all apps using the token complete, and invalidated the need for a > client to specify cancel=false. Unfortunately the config option was not > removed. > We have seen cases where oozie "java actions" and some users were explicitly > disabling token cancellation. This can lead to a buildup of defunct tokens > that may overwhelm the ZK buffer used by the KDC's backing store. At which > point the KMS fails to connect to ZK and is unable to issue/validate new > tokens - rendering the KDC only able to authenticate pre-existing tokens. > Production incidents have occurred due to the buffer size issue. > To avoid these issues, the RM should have the option to ignore/override the > client's request to not cancel tokens. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10348) Allow RM to always cancel tokens after app completes
Jim Brennan created YARN-10348: -- Summary: Allow RM to always cancel tokens after app completes Key: YARN-10348 URL: https://issues.apache.org/jira/browse/YARN-10348 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.1.3, 2.10.0 Reporter: Jim Brennan Assignee: Jim Brennan (Note: this change was originally done on our internal branch by [~daryn]). The RM currently has an option for a client to specify disabling token cancellation when a job completes. This feature was an initial attempt to address the use case of a job launching sub-jobs (ie. oozie launcher) and the original job finishing prior to the sub-job(s) completion - ex. original job completion triggered premature cancellation of tokens needed by the sub-jobs. Many years ago, [~daryn] added a more robust implementation to ref count tokens ([YARN-3055]). This prevented premature cancellation of the token until all apps using the token complete, and invalidated the need for a client to specify cancel=false. Unfortunately the config option was not removed. We have seen cases where oozie "java actions" and some users were explicitly disabling token cancellation. This can lead to a buildup of defunct tokens that may overwhelm the ZK buffer used by the KDC's backing store. At which point the KMS fails to connect to ZK and is unable to issue/validate new tokens - rendering the KDC only able to authenticate pre-existing tokens. Production incidents have occurred due to the buffer size issue. To avoid these issues, the RM should have the option to ignore/override the client's request to not cancel tokens. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10347) Fix double locking in CapacityScheduler#reinitialize in branch-3.1
[ https://issues.apache.org/jira/browse/YARN-10347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153714#comment-17153714 ] Hadoop QA commented on YARN-10347: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} docker {color} | {color:red} 14m 5s{color} | {color:red} Docker failed to build yetus/hadoop:d84386ccf7a. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-10347 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13007309/YARN-10347-branch-3.1.001.patch | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/26258/console | | versions | git=2.17.1 | | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org | This message was automatically generated. > Fix double locking in CapacityScheduler#reinitialize in branch-3.1 > -- > > Key: YARN-10347 > URL: https://issues.apache.org/jira/browse/YARN-10347 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.4 >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Critical > Attachments: YARN-10347-branch-3.1.001.patch > > > Double locking blocks another threads in ResourceManager waiting for the lock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10341) Yarn Service Container Completed event doesn't get processed
[ https://issues.apache.org/jira/browse/YARN-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153700#comment-17153700 ] Hadoop QA commented on YARN-10341: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 19s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 42s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 0m 57s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 54s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 15s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core: The patch generated 2 new + 15 unchanged - 0 fixed = 17 total (was 15) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 28s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 19m 37s{color} | {color:green} hadoop-yarn-services-core in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 31s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 84m 7s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26257/artifact/out/Dockerfile | | JIRA Issue | YARN-10341 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13007305/YARN-10341.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux eb15ff0a25e6 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 3a4d05b8504 | | Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/26257/artifact/out/diff-checkstyle
[jira] [Updated] (YARN-10347) Fix double locking in CapacityScheduler#reinitialize in branch-3.1
[ https://issues.apache.org/jira/browse/YARN-10347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated YARN-10347: Attachment: YARN-10347-branch-3.1.001.patch > Fix double locking in CapacityScheduler#reinitialize in branch-3.1 > -- > > Key: YARN-10347 > URL: https://issues.apache.org/jira/browse/YARN-10347 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.4 >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Critical > Attachments: YARN-10347-branch-3.1.001.patch > > > Double locking blocks another threads in ResourceManager waiting for the lock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10347) Fix double locking in CapacityScheduler#reinitialize in branch-3.1
Masatake Iwasaki created YARN-10347: --- Summary: Fix double locking in CapacityScheduler#reinitialize in branch-3.1 Key: YARN-10347 URL: https://issues.apache.org/jira/browse/YARN-10347 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Affects Versions: 3.1.4 Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Double locking blocks another threads in ResourceManager waiting for the lock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10341) Yarn Service Container Completed event doesn't get processed
[ https://issues.apache.org/jira/browse/YARN-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated YARN-10341: - Attachment: YARN-10341.003.patch > Yarn Service Container Completed event doesn't get processed > - > > Key: YARN-10341 > URL: https://issues.apache.org/jira/browse/YARN-10341 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Critical > Attachments: YARN-10341.001.patch, YARN-10341.002.patch, > YARN-10341.003.patch > > > If there 10 workers running and if containers get killed , after a while we > see that there are just 9 workers runnning. This is due to CONTAINER > COMPLETED Event is not processed on AM side. > Issue is in below code: > {code:java} > public void onContainersCompleted(List statuses) { > for (ContainerStatus status : statuses) { > ContainerId containerId = status.getContainerId(); > ComponentInstance instance = > liveInstances.get(status.getContainerId()); > if (instance == null) { > LOG.warn( > "Container {} Completed. No component instance exists. > exitStatus={}. diagnostics={} ", > containerId, status.getExitStatus(), status.getDiagnostics()); > return; > } > ComponentEvent event = > new ComponentEvent(instance.getCompName(), CONTAINER_COMPLETED) > .setStatus(status).setInstance(instance) > .setContainerId(containerId); > dispatcher.getEventHandler().handle(event); > } > {code} > If component instance doesnt exist for a container, it doesnt iterate over > other containers as its returning from method -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10341) Yarn Service Container Completed event doesn't get processed
[ https://issues.apache.org/jira/browse/YARN-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153590#comment-17153590 ] Hadoop QA commented on YARN-10341: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 21s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 18s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 0m 55s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 52s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 15s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core: The patch generated 4 new + 15 unchanged - 0 fixed = 19 total (was 15) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 25s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 19m 25s{color} | {color:red} hadoop-yarn-services-core in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 29s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 81m 1s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.service.TestServiceAM | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26256/artifact/out/Dockerfile | | JIRA Issue | YARN-10341 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13007298/YARN-10341.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 13923a5cf4ff 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 3a4d05b8504 | | Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 | | checkstyle | https://b
[jira] [Updated] (YARN-10341) Yarn Service Container Completed event doesn't get processed
[ https://issues.apache.org/jira/browse/YARN-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated YARN-10341: - Attachment: YARN-10341.002.patch > Yarn Service Container Completed event doesn't get processed > - > > Key: YARN-10341 > URL: https://issues.apache.org/jira/browse/YARN-10341 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Critical > Attachments: YARN-10341.001.patch, YARN-10341.002.patch > > > If there 10 workers running and if containers get killed , after a while we > see that there are just 9 workers runnning. This is due to CONTAINER > COMPLETED Event is not processed on AM side. > Issue is in below code: > {code:java} > public void onContainersCompleted(List statuses) { > for (ContainerStatus status : statuses) { > ContainerId containerId = status.getContainerId(); > ComponentInstance instance = > liveInstances.get(status.getContainerId()); > if (instance == null) { > LOG.warn( > "Container {} Completed. No component instance exists. > exitStatus={}. diagnostics={} ", > containerId, status.getExitStatus(), status.getDiagnostics()); > return; > } > ComponentEvent event = > new ComponentEvent(instance.getCompName(), CONTAINER_COMPLETED) > .setStatus(status).setInstance(instance) > .setContainerId(containerId); > dispatcher.getEventHandler().handle(event); > } > {code} > If component instance doesnt exist for a container, it doesnt iterate over > other containers as its returning from method -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10341) Yarn Service Container Completed event doesn't get processed
[ https://issues.apache.org/jira/browse/YARN-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153525#comment-17153525 ] Bilwa S T commented on YARN-10341: -- Thanks [~brahmareddy] [~billie] for reviewing. I have added testcase in patch .002 > Yarn Service Container Completed event doesn't get processed > - > > Key: YARN-10341 > URL: https://issues.apache.org/jira/browse/YARN-10341 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Critical > Attachments: YARN-10341.001.patch, YARN-10341.002.patch > > > If there 10 workers running and if containers get killed , after a while we > see that there are just 9 workers runnning. This is due to CONTAINER > COMPLETED Event is not processed on AM side. > Issue is in below code: > {code:java} > public void onContainersCompleted(List statuses) { > for (ContainerStatus status : statuses) { > ContainerId containerId = status.getContainerId(); > ComponentInstance instance = > liveInstances.get(status.getContainerId()); > if (instance == null) { > LOG.warn( > "Container {} Completed. No component instance exists. > exitStatus={}. diagnostics={} ", > containerId, status.getExitStatus(), status.getDiagnostics()); > return; > } > ComponentEvent event = > new ComponentEvent(instance.getCompName(), CONTAINER_COMPLETED) > .setStatus(status).setInstance(instance) > .setContainerId(containerId); > dispatcher.getEventHandler().handle(event); > } > {code} > If component instance doesnt exist for a container, it doesnt iterate over > other containers as its returning from method -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8047) RMWebApp make external class pluggable
[ https://issues.apache.org/jira/browse/YARN-8047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153466#comment-17153466 ] Bilwa S T commented on YARN-8047: - Thanks [~prabhujoseph] . Raised YARN-10346 for adding testcase. > RMWebApp make external class pluggable > -- > > Key: YARN-8047 > URL: https://issues.apache.org/jira/browse/YARN-8047 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Bilwa S T >Priority: Minor > Fix For: 3.4.0 > > Attachments: YARN-8047-001.patch, YARN-8047-002.patch, > YARN-8047-003.patch, YARN-8047.004.patch, YARN-8047.005.patch, > YARN-8047.006.patch > > > JIra should make sure we should be able to plugin webservices and web pages > of scheduler in Resourcemanager > * RMWebApp allow to bind external classes > * RMController allow to plugin scheduler classes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10346) Add testcase for RMWebApp make external class pluggable
Bilwa S T created YARN-10346: Summary: Add testcase for RMWebApp make external class pluggable Key: YARN-10346 URL: https://issues.apache.org/jira/browse/YARN-10346 Project: Hadoop YARN Issue Type: Bug Reporter: Bilwa S T Assignee: Bilwa S T Add testcase for Jira YARN-8047 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10333) YarnClient obtain Delegation Token for Log Aggregation Path
[ https://issues.apache.org/jira/browse/YARN-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153405#comment-17153405 ] Zhankun Tang commented on YARN-10333: - It LGTM. +1. Thanks for your contribution! [~prabhujoseph], [~sunilg] > YarnClient obtain Delegation Token for Log Aggregation Path > --- > > Key: YARN-10333 > URL: https://issues.apache.org/jira/browse/YARN-10333 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10333-001.patch, YARN-10333-002.patch, > YARN-10333-003.patch > > > There are use cases where Yarn Log Aggregation Path is configured to a > FileSystem like S3 or ABFS different from what is configured in fs.defaultFS > (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS > and not for log aggregation path. > This Jira is to improve YarnClient by obtaining delegation token for log > aggregation path and add it to the Credential of Container Launch Context > similar to how it does for Timeline Delegation Token. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10333) YarnClient obtain Delegation Token for Log Aggregation Path
[ https://issues.apache.org/jira/browse/YARN-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153326#comment-17153326 ] Sunil G commented on YARN-10333: This change looks fine to me. cc [~ztang] [~bibinchundatt] [~rohithsharmaks] thoughts? > YarnClient obtain Delegation Token for Log Aggregation Path > --- > > Key: YARN-10333 > URL: https://issues.apache.org/jira/browse/YARN-10333 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10333-001.patch, YARN-10333-002.patch, > YARN-10333-003.patch > > > There are use cases where Yarn Log Aggregation Path is configured to a > FileSystem like S3 or ABFS different from what is configured in fs.defaultFS > (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS > and not for log aggregation path. > This Jira is to improve YarnClient by obtaining delegation token for log > aggregation path and add it to the Credential of Container Launch Context > similar to how it does for Timeline Delegation Token. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10340) HsWebServices getContainerReport uses loginUser instead of remoteUser to access ApplicationClientProtocol
[ https://issues.apache.org/jira/browse/YARN-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10340: - Parent: YARN-10025 Issue Type: Sub-task (was: Bug) > HsWebServices getContainerReport uses loginUser instead of remoteUser to > access ApplicationClientProtocol > - > > Key: YARN-10340 > URL: https://issues.apache.org/jira/browse/YARN-10340 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Prabhu Joseph >Assignee: Tarun Parimi >Priority: Major > > HsWebServices getContainerReport uses loginUser instead of remoteUser to > access ApplicationClientProtocol > > [http://:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs|http://pjoseph-secure-1.pjoseph-secure.root.hwx.site:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs] > While accessing above link using systest user, the request fails saying > mapred user does not have access to the job > > {code:java} > 2020-07-06 14:02:59,178 WARN org.apache.hadoop.yarn.server.webapp.LogServlet: > Could not obtain node HTTP address from provider. > javax.ws.rs.WebApplicationException: > org.apache.hadoop.yarn.exceptions.YarnException: User mapred does not have > privilege to see this application application_1593997842459_0214 > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:516) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:985) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:913) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882) > at > org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowThrowable(WebServices.java:544) > at > org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowException(WebServices.java:530) > at > org.apache.hadoop.yarn.server.webapp.WebServices.getContainer(WebServices.java:405) > at > org.apache.hadoop.yarn.server.webapp.WebServices.getNodeHttpAddress(WebServices.java:373) > at > org.apache.hadoop.yarn.server.webapp.LogServlet.getContainerLogsInfo(LogServlet.java:268) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.getContainerLogs(HsWebServices.java:461) > > {code} > On Analyzing, found WebServices#getContainer uses doAs using UGI created by > createRemoteUser(end user) to access RM#ApplicationClientProtocol which does > not work. Need to use createProxyUser to do the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10345: - Parent: YARN-10025 Issue Type: Sub-task (was: Bug) > HsWebServices containerlogs does not honor ACLs for completed jobs > -- > > Key: YARN-10345 > URL: https://issues.apache.org/jira/browse/YARN-10345 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Critical > Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png > > > HsWebServices containerlogs does not honor ACLs. User who does not have > permission to view a job is allowed to view the job logs for completed jobs > from YARN UI2 through HsWebServices. > *Repro:* > Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + > HistoryServer runs as mapred > # Run a sample MR job using systest user > # Once the job is complete, access the job logs using hue user from YARN > UI2. > !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! > > YARN CLI works fine and does not allow hue user to view systest user job logs. > {code:java} > [hue@pjoseph-cm-2 /]$ > [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at > rmhostname:8032 > Permission denied: user=hue, access=EXECUTE, > inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10345: - Affects Version/s: (was: 3.2.0) 3.2.2 3.3.0 > HsWebServices containerlogs does not honor ACLs for completed jobs > -- > > Key: YARN-10345 > URL: https://issues.apache.org/jira/browse/YARN-10345 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Critical > Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png > > > HsWebServices containerlogs does not honor ACLs. User who does not have > permission to view a job is allowed to view the job logs for completed jobs > from YARN UI2 through HsWebServices. > *Repro:* > Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + > HistoryServer runs as mapred > # Run a sample MR job using systest user > # Once the job is complete, access the job logs using hue user from YARN > UI2. > !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! > > YARN CLI works fine and does not allow hue user to view systest user job logs. > {code:java} > [hue@pjoseph-cm-2 /]$ > [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at > rmhostname:8032 > Permission denied: user=hue, access=EXECUTE, > inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10345: - Description: HsWebServices containerlogs does not honor ACLs. User who does not have permission to view a job is allowed to view the job logs for completed jobs from YARN UI2 through HsWebServices. *Repro:* Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + HistoryServer runs as mapred # Run a sample MR job using systest user # Once the job is complete, access the job logs using hue user from YARN UI2. !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! YARN CLI works fine and does not allow hue user to view systest user job logs. {code:java} [hue@pjoseph-cm-2 /]$ [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at rmhostname:8032 Permission denied: user=hue, access=EXECUTE, inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) {code} was: HsWebServices containerlogs does not honor ACLs. User who does not have permission to view a job is allowed to view the job logs from YARN UI2 through HsWebServices. *Repro:* Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + HistoryServer runs as mapred # Run a sample MR job using systest user # Once the job is complete, access the job logs using hue user from YARN UI2. !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! YARN CLI works fine and does not allow hue user to view systest user job logs. {code:java} [hue@pjoseph-cm-2 /]$ [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at rmhostname:8032 Permission denied: user=hue, access=EXECUTE, inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) {code} > HsWebServices containerlogs does not honor ACLs for completed jobs > -- > > Key: YARN-10345 > URL: https://issues.apache.org/jira/browse/YARN-10345 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.2.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Critical > Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png > > > HsWebServices containerlogs does not honor ACLs. User who does not have > permission to view a job is allowed to view the job logs for completed jobs > from YARN UI2 through HsWebServices. > *Repro:* > Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + > HistoryServer runs as mapred > # Run a sample MR job using systest user > # Once the job is complete, access the job logs using hue user from YARN > UI2. > !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! > > YARN CLI works fine and does not allow hue user to view systest user job logs. > {code:java} > [hue@pjoseph-cm-2 /]$ > [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at > rmhostname:8032 > Permission denied: user=hue, access=EXECUTE, > inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10345: - Description: HsWebServices containerlogs does not honor ACLs. User who does not have permission to view a job is allowed to view the job logs from YARN UI2 through HsWebServices. *Repro:* Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + HistoryServer runs as mapred # Run a sample MR job using systest user # Once the job is complete, access the job logs using hue user from YARN UI2. !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! YARN CLI works fine and does not allow hue user to view systest user job logs. {code:java} [hue@pjoseph-cm-2 /]$ [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at rmhostname:8032 Permission denied: user=hue, access=EXECUTE, inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) {code} was: HsWebServices containerlogs does not honor ACLs. User who does not have permission to view a job is allowed to view the job logs from YARN UI2 through HsWebServices. *Repro:* Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + HistoryServer runs as mapred 1. Run a sample MR job using systest user 2. Once the job is complete, access the job logs using hue user from YARN UI2. YARN CLI works fine. {code} [hue@pjoseph-cm-2 /]$ [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at rmhostname:8032 Permission denied: user=hue, access=EXECUTE, inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) {code} > HsWebServices containerlogs does not honor ACLs for completed jobs > -- > > Key: YARN-10345 > URL: https://issues.apache.org/jira/browse/YARN-10345 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.2.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Critical > Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png > > > HsWebServices containerlogs does not honor ACLs. User who does not have > permission to view a job is allowed to view the job logs from YARN UI2 > through HsWebServices. > *Repro:* > Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + > HistoryServer runs as mapred > # Run a sample MR job using systest user > # Once the job is complete, access the job logs using hue user from YARN > UI2. > !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! > > YARN CLI works fine and does not allow hue user to view systest user job logs. > {code:java} > [hue@pjoseph-cm-2 /]$ > [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at > rmhostname:8032 > Permission denied: user=hue, access=EXECUTE, > inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
Prabhu Joseph created YARN-10345: Summary: HsWebServices containerlogs does not honor ACLs for completed jobs Key: YARN-10345 URL: https://issues.apache.org/jira/browse/YARN-10345 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.2.0, 3.4.0 Reporter: Prabhu Joseph Assignee: Prabhu Joseph Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png HsWebServices containerlogs does not honor ACLs. User who does not have permission to view a job is allowed to view the job logs from YARN UI2 through HsWebServices. *Repro:* Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + HistoryServer runs as mapred 1. Run a sample MR job using systest user 2. Once the job is complete, access the job logs using hue user from YARN UI2. YARN CLI works fine. {code} [hue@pjoseph-cm-2 /]$ [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at rmhostname:8032 Permission denied: user=hue, access=EXECUTE, inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10345: - Attachment: Screen Shot 2020-07-08 at 12.54.21 PM.png > HsWebServices containerlogs does not honor ACLs for completed jobs > -- > > Key: YARN-10345 > URL: https://issues.apache.org/jira/browse/YARN-10345 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.2.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Critical > Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png > > > HsWebServices containerlogs does not honor ACLs. User who does not have > permission to view a job is allowed to view the job logs from YARN UI2 > through HsWebServices. > *Repro:* > Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + > HistoryServer runs as mapred > 1. Run a sample MR job using systest user > 2. Once the job is complete, access the job logs using hue user from YARN > UI2. > YARN CLI works fine. > {code} > [hue@pjoseph-cm-2 /]$ > [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at > rmhostname:8032 > Permission denied: user=hue, access=EXECUTE, > inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8047) RMWebApp make external class pluggable
[ https://issues.apache.org/jira/browse/YARN-8047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153315#comment-17153315 ] Hudson commented on YARN-8047: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18418 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/18418/]) YARN-8047. RMWebApp make external class pluggable. (pjoseph: rev 3a4d05b850449c51a13f3a15fe0d756fdf50b4b2) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebApp.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RmController.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml > RMWebApp make external class pluggable > -- > > Key: YARN-8047 > URL: https://issues.apache.org/jira/browse/YARN-8047 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Bilwa S T >Priority: Minor > Fix For: 3.4.0 > > Attachments: YARN-8047-001.patch, YARN-8047-002.patch, > YARN-8047-003.patch, YARN-8047.004.patch, YARN-8047.005.patch, > YARN-8047.006.patch > > > JIra should make sure we should be able to plugin webservices and web pages > of scheduler in Resourcemanager > * RMWebApp allow to bind external classes > * RMController allow to plugin scheduler classes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8047) RMWebApp make external class pluggable
[ https://issues.apache.org/jira/browse/YARN-8047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153309#comment-17153309 ] Prabhu Joseph commented on YARN-8047: - Thanks [~BilwaST] for the patch. Have committed the latest patch [^YARN-8047.006.patch] to trunk. Can you report a separate Jira to handle the testcase. > RMWebApp make external class pluggable > -- > > Key: YARN-8047 > URL: https://issues.apache.org/jira/browse/YARN-8047 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Bilwa S T >Priority: Minor > Attachments: YARN-8047-001.patch, YARN-8047-002.patch, > YARN-8047-003.patch, YARN-8047.004.patch, YARN-8047.005.patch, > YARN-8047.006.patch > > > JIra should make sure we should be able to plugin webservices and web pages > of scheduler in Resourcemanager > * RMWebApp allow to bind external classes > * RMController allow to plugin scheduler classes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org