[
https://issues.apache.org/jira/browse/YARN-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15363545#comment-15363545
]
Hudson commented on YARN-5214:
------------------------------
SUCCESS: Integrated in Hadoop-trunk-Commit #10052 (See
[https://builds.apache.org/job/Hadoop-trunk-Commit/10052/])
YARN-5214. Fixed locking in DirectoryCollection to avoid hanging NMs (vinodkv:
rev ce9c006430d13a28bc1ca57c5c70cc1b7cba1692)
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java
> Pending on synchronized method DirectoryCollection#checkDirs can hang NM's
> NodeStatusUpdater
> --------------------------------------------------------------------------------------------
>
> Key: YARN-5214
> URL: https://issues.apache.org/jira/browse/YARN-5214
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: Junping Du
> Assignee: Junping Du
> Priority: Critical
> Fix For: 2.8.0
>
> Attachments: YARN-5214-v2.patch, YARN-5214-v3.patch, YARN-5214.patch
>
>
> In one cluster, we notice NM's heartbeat to RM is suddenly stopped and wait a
> while and marked LOST by RM. From the log, the NM daemon is still running,
> but jstack hints NM's NodeStatusUpdater thread get blocked:
> 1. Node Status Updater thread get blocked by 0x000000008065eae8
> {noformat}
> "Node Status Updater" #191 prio=5 os_prio=0 tid=0x00007f0354194000 nid=0x26fa
> waiting for monitor entry [0x00007f035945a000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.getFailedDirs(DirectoryCollection.java:170)
> - waiting to lock <0x000000008065eae8> (a
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
> at
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getDisksHealthReport(LocalDirsHandlerService.java:287)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService.getHealthReport(NodeHealthCheckerService.java:58)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:389)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:83)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:643)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> 2. The actual holder of this lock is DiskHealthMonitor:
> {noformat}
> "DiskHealthMonitor-Timer" #132 daemon prio=5 os_prio=0 tid=0x00007f0397393000
> nid=0x26bd runnable [0x00007f035e511000]
> java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.createDirectory(Native Method)
> at java.io.File.mkdir(File.java:1316)
> at
> org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:67)
> at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:104)
> at
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.verifyDirUsingMkdir(DirectoryCollection.java:340)
> at
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.testDirs(DirectoryCollection.java:312)
> at
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:231)
> - locked <0x000000008065eae8> (a
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
> at
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:389)
> at
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$400(LocalDirsHandlerService.java:50)
> at
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:122)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}
> This disk operation could take longer time than expectation especially in
> high IO throughput case and we should have fine-grained lock for related
> operations here.
> The same issue on HDFS get raised and fixed in HDFS-7489, and we probably
> should have similar fix here.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]