[
https://issues.apache.org/jira/browse/YARN-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16183000#comment-16183000
]
Jason Lowe commented on YARN-7244:
----------------------------------
bq. Only potential issue which I see is that, once a set of dirs are pulled
from LocalDirAllocator#ctx.localDirs, these dirs will be validated only when
one more getLocalPathForWrite/Read is invoked. So there could be a window where
we may get a stale dirs.
I wouldn't worry too much about that window. Think of the much larger window a
container gets, since it is only told once, on startup, what the list of valid
dirs are. I think we're fine as long as aux services are notified fairly soon
after a disk fails. It doesn't have to be instantaneous nor atomic. We could
make a pull API where the aux service can essentially directly call the NM's
LocalDirHandlerService for getting a path to read or a path to write, then the
aux service doesn't even have to manage the directories itself if all it cares
about is finding a place to write or read.
> ShuffleHandler is not aware of disks that are added
> ---------------------------------------------------
>
> Key: YARN-7244
> URL: https://issues.apache.org/jira/browse/YARN-7244
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Kuhu Shukla
> Assignee: Kuhu Shukla
> Attachments: YARN-7244.001.patch, YARN-7244.002.patch
>
>
> The ShuffleHandler permanently remembers the list of "good" disks on NM
> startup. If disks later are added to the node then map tasks will start using
> them but the ShuffleHandler will not be aware of them. The end result is that
> the data cannot be shuffled from the node leading to fetch failures and
> re-runs of the map tasks.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]