[
https://issues.apache.org/jira/browse/YARN-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323073#comment-15323073
]
Junping Du commented on YARN-5214:
----------------------------------
Thanks [~nroberts] for sharing the solution on this!
I agree that to fix the root cause of this particular issue, we may need to
configure deadline IO scheduler in Linux. Otherwise, IO waiting too long time
should definitely cause other serious issues, like we also noticed that
ResourceLocalizationService get blocked as well.
On the other side, we need to check if hanging NM heartbeat or localizer in
case of busy IO with wrong IO scheduler setting is something we really want
here: at least, we should replace the synchronized method lock with something
we can try to lock and print some useful debug log if pending too long time.
May be we can do more with the same principle of HDFS-9239 that to release
unnecessary lock for NM-RM heartbeat as much as possible? Thoughts?
> Pending on synchronized method DirectoryCollection#checkDirs can hang NM's
> NodeStatusUpdater
> --------------------------------------------------------------------------------------------
>
> Key: YARN-5214
> URL: https://issues.apache.org/jira/browse/YARN-5214
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: Junping Du
> Assignee: Junping Du
> Priority: Critical
>
> In one cluster, we notice NM's heartbeat to RM is suddenly stopped and wait a
> while and marked LOST by RM. From the log, the NM daemon is still running,
> but jstack hints NM's NodeStatusUpdater thread get blocked:
> 1. Node Status Updater thread get blocked by 0x000000008065eae8
> {noformat}
> "Node Status Updater" #191 prio=5 os_prio=0 tid=0x00007f0354194000 nid=0x26fa
> waiting for monitor entry [0x00007f035945a000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.getFailedDirs(DirectoryCollection.java:170)
> - waiting to lock <0x000000008065eae8> (a
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
> at
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getDisksHealthReport(LocalDirsHandlerService.java:287)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService.getHealthReport(NodeHealthCheckerService.java:58)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:389)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:83)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:643)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> 2. The actual holder of this lock is DiskHealthMonitor:
> {noformat}
> "DiskHealthMonitor-Timer" #132 daemon prio=5 os_prio=0 tid=0x00007f0397393000
> nid=0x26bd runnable [0x00007f035e511000]
> java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.createDirectory(Native Method)
> at java.io.File.mkdir(File.java:1316)
> at
> org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:67)
> at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:104)
> at
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.verifyDirUsingMkdir(DirectoryCollection.java:340)
> at
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.testDirs(DirectoryCollection.java:312)
> at
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:231)
> - locked <0x000000008065eae8> (a
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
> at
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:389)
> at
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$400(LocalDirsHandlerService.java:50)
> at
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:122)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}
> This disk operation could take longer time than expectation especially in
> high IO throughput case and we should have fine-grained lock for related
> operations here.
> The same issue on HDFS get raised and fixed in HDFS-7489, and we probably
> should have similar fix here.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]