[jira] [Commented] (YARN-5214) Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater

Junping Du (JIRA) Tue, 21 Jun 2016 11:56:27 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15342451#comment-15342451
 ]


Junping Du commented on YARN-5214:
----------------------------------

Thanks [~leftnoteasy] and [~vinodkv] for review and comments!
bq.  dirsChangeListeners doesn't change except on service-start and stop. So no 
need to grab the global / read / write lock, we can simply make it use a 
thread-safe collection?
That's good point. Will get rid of lock for dirsChangeListeners in next patch.

bq. In createNonExistentDirs(), you don't need to make a manual copy of 
localDirs, the iterator() method already does it for you.
I would like to get rid of CopyOnWriteArrayList and replace with normal 
ArrayList given we already have read/write lock on related dirs. In 
createNonExistentDirs(), explicitly copy localDirs() under a read lock and make 
createDir() lock free is something more clear. Isn't it?

bq.  Can you not use the java8 stuff (diamond operator etc), so that this patch 
can be backported to the older releases?
Diamond operator is supported since Java 7...

> Pending on synchronized method DirectoryCollection#checkDirs can hang NM's 
> NodeStatusUpdater
> --------------------------------------------------------------------------------------------
>
>                 Key: YARN-5214
>                 URL: https://issues.apache.org/jira/browse/YARN-5214
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: YARN-5214.patch
>
>
> In one cluster, we notice NM's heartbeat to RM is suddenly stopped and wait a 
> while and marked LOST by RM. From the log, the NM daemon is still running, 
> but jstack hints NM's NodeStatusUpdater thread get blocked:
> 1.  Node Status Updater thread get blocked by 0x000000008065eae8 
> {noformat}
> "Node Status Updater" #191 prio=5 os_prio=0 tid=0x00007f0354194000 nid=0x26fa 
> waiting for monitor entry [0x00007f035945a000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.getFailedDirs(DirectoryCollection.java:170)
>         - waiting to lock <0x000000008065eae8> (a 
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getDisksHealthReport(LocalDirsHandlerService.java:287)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService.getHealthReport(NodeHealthCheckerService.java:58)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:389)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:83)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:643)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> 2. The actual holder of this lock is DiskHealthMonitor:
> {noformat}
> "DiskHealthMonitor-Timer" #132 daemon prio=5 os_prio=0 tid=0x00007f0397393000 
> nid=0x26bd runnable [0x00007f035e511000]
>    java.lang.Thread.State: RUNNABLE
>         at java.io.UnixFileSystem.createDirectory(Native Method)
>         at java.io.File.mkdir(File.java:1316)
>         at 
> org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:67)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:104)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.verifyDirUsingMkdir(DirectoryCollection.java:340)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.testDirs(DirectoryCollection.java:312)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:231)
>         - locked <0x000000008065eae8> (a 
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:389)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$400(LocalDirsHandlerService.java:50)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:122)
>         at java.util.TimerThread.mainLoop(Timer.java:555)
>         at java.util.TimerThread.run(Timer.java:505)
> {noformat}
> This disk operation could take longer time than expectation especially in 
> high IO throughput case and we should have fine-grained lock for related 
> operations here. 
> The same issue on HDFS get raised and fixed in HDFS-7489, and we probably 
> should have similar fix here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-5214) Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater

Reply via email to