[ 
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14599532#comment-14599532
 ] 

Jason Lowe commented on YARN-2902:
----------------------------------

bq.  But here the issue can be what if in between checks, localizer dies and 
PID is taken by some other process.
The NM is tracking containers.  If it can track containers then it can track 
localizers.  There may be issues with PID recycling, but they're not specific 
to tracking localizers.

bq. One option would be to add a status in heartbeat asking localizer to 
cleanup(stop its downloading threads) and once that is done, indicate NM to do 
the deletion in another heartbeat.
That would be a bit more consistent since the process creating the files is the 
process that cleans them when it aborts.  It also helps keep the disk clean if 
there ever were a rogue localizer that the NM doesn't know about.  Expecting 
the NM to clean up after a localizer that the NM has no idea what it's doing is 
going to be difficult.  However we still have to handle the fallback case where 
the localizer simply crashes and something needs to clean up after it.  
Speaking of the localizer crashing, today the localizer can "go rogue" and stop 
heartbeating, and I don't think anything will detect this.  In that situation 
we also fail to cleanup since we'll wait for a heartbeat that will never arrive.

So how about the following approach:
- ContainerLocalizer cleans up temporary files when processing the DIE command
- We give the localizer a significant amount of time (e.g.: a minute or two?) 
to clean up on its own after receiving a DIE command
- After the localizer exits (or if it does not exit after the grace period) 
then the NM cleans up the resources itself.  We aren't tracking the PID 
explicitly, but LocalizerRunner can take actions when the localizer exits 
(e.g.: either process the requested deletions directly or send an event to a 
subsystem that will).

With this approach rogue localizers will make some attempt to cleanup after 
themselves (which they don't do today), and we won't wait any longer than 
necessary before the NM tries to cleanup after well-behaved localizers or 
localizers that crash.

> Killing a container that is localizing can orphan resources in the 
> DOWNLOADING state
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-2902
>                 URL: https://issues.apache.org/jira/browse/YARN-2902
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-2902.002.patch, YARN-2902.03.patch, 
> YARN-2902.04.patch, YARN-2902.patch
>
>
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to