[ https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342525#comment-17342525 ]
Hadoop QA commented on YARN-10642: ---------------------------------- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 10m 1s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} branch-3.1 Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 2s{color} | {color:green}{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s{color} | {color:green}{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 32s{color} | {color:green}{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 46s{color} | {color:green}{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 7s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 48s{color} | {color:green}{color} | {color:green} branch-3.1 passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 15m 39s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 44s{color} | {color:green}{color} | {color:green} branch-3.1 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 40s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 36s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 23s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 54s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 43s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 52s{color} | {color:green}{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 58s{color} | {color:green}{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 30s{color} | {color:green}{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 76m 15s{color} | {color:black}{color} | {color:black}{color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/973/artifact/out/Dockerfile | | JIRA Issue | YARN-10642 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13025310/YARN-10642-branch-3.1.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle spotbugs | | uname | Linux 94469edd1854 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | branch-3.1 / 39bf9e270e3 | | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 | | Test Results | https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/973/testReport/ | | Max. process+thread count | 410 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common | | Console output | https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/973/console | | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org | This message was automatically generated. > Race condition: AsyncDispatcher can get stuck by the changes introduced in > YARN-8995 > ------------------------------------------------------------------------------------ > > Key: YARN-10642 > URL: https://issues.apache.org/jira/browse/YARN-10642 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Affects Versions: 3.2.1 > Reporter: zhengchenyu > Assignee: zhengchenyu > Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.2.3 > > Attachments: MockForDeadLoop.java, YARN-10642-branch-3.1.001.patch, > YARN-10642-branch-3.2.001.patch, YARN-10642-branch-3.2.002.patch, > YARN-10642-branch-3.3.001.patch, YARN-10642.001.patch, YARN-10642.002.patch, > YARN-10642.003.patch, YARN-10642.004.patch, YARN-10642.005.patch, > deadloop.png, debugfornode.png, put.png, take.png > > > In our cluster, ResouceManager stuck twice within twenty days. Yarn client > can't submit application. I got jstack info at second time, then found the > reason. > I analyze all the jstack, I found many thread stuck because can't get > LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the > analytical process) > The reason is that one thread hold the putLock all the time, > printEventQueueDetails will called forEachRemaining, then hold putLock and > readLock. The AsyncDispatcher will stuck. > {code} > Thread 6526 (IPC Server handler 454 on default port 8030): > State: RUNNABLE > Blocked count: 29988 > Waited count: 2035029 > Stack: > > java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926) > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270) > > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295) > > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408) > > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215) > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75) > > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432) > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040) > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958) > java.security.AccessController.doPrivileged(Native Method) > {code} > I analyze LinkedBlockingQueue's source code. I found forEachRemaining in > LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take > are called in different thread. > YARN-8995 introduce printEventQueueDetails method, > "eventQueue.stream().collect" will called forEachRemaining method. > Let's see why? "put.png" shows that how to put("a"), "take.png" shows that > how to take()。Specical Node: The removed Node will point itself for help gc!!! > The key point code is in forEachRemaining, we see LBQSpliterator use > forEachRemaining to visit all Node. But when got item value from Node, will > release the lock. If at this time, take() will be called. > The variable 'p' in forEachRemaining may point a Node which point itself, > then forEachRemaining will be in dead loop. You can see it in "deadloop.png" > Let's see a simple uni-test, Let's forEachRemaining called more slow than > take, the problem will reproduction。uni-test is MockForDeadLoop.java. > I debug MockForDeadLoop.java, and see a Node point itself. You can see pic > "debugfornode.png" > Environment: > OS: CentOS Linux release 7.5.1804 (Core) > JDK: jdk1.8.0_281 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org