[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950919#comment-14950919 ]
Jason Lowe commented on YARN-2902: ---------------------------------- Thanks for updating the patch, Varun! Sorry, I'm still a little confused on why we need to complicate the localizer protocol to fix this issue. Seems like this is a hack to help the NM figure out what's going on, but it should already know this stuff. That prompted me to dig around for an alternative solution, and I think I found one. The NM knows the local path where a resource is localized, since it tells the localizer where to put it in the download request. Also each localizer has a LocalizerRunner thread that is tracking it, and it knows which resources were pending when the localizer process exits. That's tracked in the {{scheduled}} map so the runner thread can unlock every pending resource to allow a subsequent localizer to try downloading it again. Seems to me all we need to do is have the LocalizerRunner issue a delete of the local path and temporary download path for each resource that was pending at the time the localizer process died, since we know any pending resources when a localizer exits must have been orphaned. Resources that were successfully localized are pulled out of the {{scheduled}} map, so the only things left should be the ones we need to process for cleanup. That seems like a much simpler implementation as it doesn't change any protocols and doesn't rely on the container localizer doing any cleanup. The NM will automatically do so when it exits. We also don't need delayed deletion support in DeletionService, since we know the container localizer process is dead. Maybe I'm missing something and that approach can't work. If it can then that seems like a preferable solution for 2.7 as it will be a smaller, simpler patch. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > ------------------------------------------------------------------------------------ > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Affects Versions: 2.5.0 > Reporter: Jason Lowe > Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)