[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

Jason Lowe (JIRA) Mon, 11 Apr 2016 07:53:38 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235212#comment-15235212
 ]


Jason Lowe commented on YARN-4924:
----------------------------------

Thanks for updating the patch!

It may not be clear to others reading the code why we're removing the finished 
apps keys.  I'd recommend factoring out the few lines in loadApplicationState 
into a separate function called something like cleanupDeprecatedFinishedApps or 
something similar so it should be very clear what's going on.

This will leak resources associated with the patch if iter.close throws:
{code}
  private void removeKeysWithPrefix(String prefix) throws IOException {
    LeveldbIterator iter = null;
    WriteBatch batch = null;
    try {
      iter = new LeveldbIterator(db);
      iter.seek(bytes(prefix));
      batch = db.createWriteBatch();
[...]
    } catch (DBException e) {
      throw new IOException(e);
    } finally {
      if (iter != null) {
        iter.close();
      }
      if (batch != null) {
        batch.close();
      }
    }
{code}
Something like this would handle it:
{code}
  private void removeKeysWithPrefix(String prefix) throws IOException {
    LeveldbIterator iter = null;
    WriteBatch batch = null;
    iter = new LeveldbIterator(db);
    try {
      iter.seek(bytes(prefix));
      try {
        batch = db.createWriteBatch();
[...]
      } catch (DBException e) {
        throw new IOException(e);
      } finally {
        if (batch != null) {
          batch.close();
        }
      }
    } finally {
      iter.close();
    }
{code}

Do we really want to have removeKeysWithPrefix log at the info level?  We 
normally don't log the removal of apps, containers, etc.  I could see cases 
where nodes have leaked many thousands of finished app events before YARN-4520 
was fixed.  I think we should either make this a debug log and/or log a single 
info log stating all keys with the specified prefix are being removed.


> NM recovery race can lead to container not cleaned up
> -----------------------------------------------------
>
>                 Key: YARN-4924
>                 URL: https://issues.apache.org/jira/browse/YARN-4924
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.0.0, 2.7.2
>            Reporter: Nathan Roberts
>            Assignee: sandflee
>         Attachments: YARN-4924.01.patch, YARN-4924.02.patch
>
>
> It's probably a small window but we observed a case where the NM crashed and 
> then a container was not properly cleaned up during recovery.
> I will add details in first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

Reply via email to