[ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576761#comment-14576761
 ] 

zhihai xu commented on YARN-3591:
---------------------------------

Hi [~lavkesh], thanks for the update.
IMHO, although storing local error directories in NM state store will be 
implemented in a separate follow-up JIRA, it will be good to make this patch to 
accommodate with it. Upon NM start, we can consider the previous error Dirs is 
the error Dirs stored in NM state store.
{{DirectoryCollection#checkDirs}} is already called at 
{{LocalDirsHandlerService#serviceInit}} before 
{{registerLocalDirsChangeListener}} is called at 
{{ResourceLocalizationService#serviceStart}}. {{onDirsChanged}} will be called 
in {{registerLocalDirsChangeListener}} for the first time. You can see we 
already have previous error Dirs when {{onDirsChanged}} is called for the first 
time, we just need current error Dirs to calculate newErrorDirs and 
newRepairedDirs, which is implemented at my proposal #4.
So instead of adding three APIs({{getDiskNewErrorDirs}}, 
{{getDiskNewRepairedDirs}} and {{getErrorDirs}}) in DirectoryCollection, we can 
just add one API {{getErrorDirs}}. It will make the interface simpler and make 
the code cleaner.
And also even you have three APIs, when {{onDirsChanged}} is called for the 
first time, you still need to recalculate newErrorDirs and newRepairedDirs 
based on the errors Dirs stored in NM state store.

bq. upon start we can do a cleanUpLocalDir on the errordirs.
we needn't do it because we can handle it in {{onDirsChanged}}

As [~sunilg] suggested, changing checkLocalizedResources implementation to call 
removeResource on those localized resources whose parent is present in 
newErrorDirs will be better. Because it will give better performance.

By the way, {{checkAndInitializeLocalDirs}} should be called after 
{{cleanUpLocalDir}}, because once the directory is cleaned up, it need be 
reinitialized.

> Resource Localisation on a bad disk causes subsequent containers failure 
> -------------------------------------------------------------------------
>
>                 Key: YARN-3591
>                 URL: https://issues.apache.org/jira/browse/YARN-3591
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Lavkesh Lahngir
>            Assignee: Lavkesh Lahngir
>         Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch
>
>
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to