[ 
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950919#comment-14950919
 ] 

Jason Lowe commented on YARN-2902:
----------------------------------

Thanks for updating the patch, Varun!

Sorry, I'm still a little confused on why we need to complicate the localizer 
protocol to fix this issue.  Seems like this is a hack to help the NM figure 
out what's going on, but it should already know this stuff.  That prompted me 
to dig around for an alternative solution, and I think I found one.

The NM knows the local path where a resource is localized, since it tells the 
localizer where to put it in the download request.  Also each localizer has a 
LocalizerRunner thread that is tracking it, and it knows which resources were 
pending when the localizer process exits.  That's tracked in the {{scheduled}} 
map so the runner thread can unlock every pending resource to allow a 
subsequent localizer to try downloading it again.  Seems to me all we need to 
do is have the LocalizerRunner issue a delete of the local path and temporary 
download path for each resource that was pending at the time the localizer 
process died, since we know any pending resources when a localizer exits must 
have been orphaned.  Resources that were successfully localized are pulled out 
of the {{scheduled}} map, so the only things left should be the ones we need to 
process for cleanup.

That seems like a much simpler implementation as it doesn't change any 
protocols and doesn't rely on the container localizer doing any cleanup.  The 
NM will automatically do so when it exits.  We also don't need delayed deletion 
support in DeletionService, since we know the container localizer process is 
dead.

Maybe I'm missing something and that approach can't work.  If it can then that 
seems like a preferable solution for 2.7 as it will be a smaller, simpler patch.

> Killing a container that is localizing can orphan resources in the 
> DOWNLOADING state
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-2902
>                 URL: https://issues.apache.org/jira/browse/YARN-2902
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-2902.002.patch, YARN-2902.03.patch, 
> YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, 
> YARN-2902.07.patch, YARN-2902.patch
>
>
> If a container is in the process of localizing when it is stopped/killed then 
> resources are left in the DOWNLOADING state.  If no other container comes 
> along and requests these resources they linger around with no reference 
> counts but aren't cleaned up during normal cache cleanup scans since it will 
> never delete resources in the DOWNLOADING state even if their reference count 
> is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to