[
https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235212#comment-15235212
]
Jason Lowe commented on YARN-4924:
----------------------------------
Thanks for updating the patch!
It may not be clear to others reading the code why we're removing the finished
apps keys. I'd recommend factoring out the few lines in loadApplicationState
into a separate function called something like cleanupDeprecatedFinishedApps or
something similar so it should be very clear what's going on.
This will leak resources associated with the patch if iter.close throws:
{code}
private void removeKeysWithPrefix(String prefix) throws IOException {
LeveldbIterator iter = null;
WriteBatch batch = null;
try {
iter = new LeveldbIterator(db);
iter.seek(bytes(prefix));
batch = db.createWriteBatch();
[...]
} catch (DBException e) {
throw new IOException(e);
} finally {
if (iter != null) {
iter.close();
}
if (batch != null) {
batch.close();
}
}
{code}
Something like this would handle it:
{code}
private void removeKeysWithPrefix(String prefix) throws IOException {
LeveldbIterator iter = null;
WriteBatch batch = null;
iter = new LeveldbIterator(db);
try {
iter.seek(bytes(prefix));
try {
batch = db.createWriteBatch();
[...]
} catch (DBException e) {
throw new IOException(e);
} finally {
if (batch != null) {
batch.close();
}
}
} finally {
iter.close();
}
{code}
Do we really want to have removeKeysWithPrefix log at the info level? We
normally don't log the removal of apps, containers, etc. I could see cases
where nodes have leaked many thousands of finished app events before YARN-4520
was fixed. I think we should either make this a debug log and/or log a single
info log stating all keys with the specified prefix are being removed.
> NM recovery race can lead to container not cleaned up
> -----------------------------------------------------
>
> Key: YARN-4924
> URL: https://issues.apache.org/jira/browse/YARN-4924
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 3.0.0, 2.7.2
> Reporter: Nathan Roberts
> Assignee: sandflee
> Attachments: YARN-4924.01.patch, YARN-4924.02.patch
>
>
> It's probably a small window but we observed a case where the NM crashed and
> then a container was not properly cleaned up during recovery.
> I will add details in first comment.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)