[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14805279#comment-14805279 ]
Varun Saxena commented on YARN-2902: ------------------------------------ [~jlowe], bq. As far as properly handling DIE so we actually stop downloading and problems canceling active transfers, can't we just have the localizer forcibly tear down the JVM? If we're being told to DIE then I assume we really don't care about pending transfers completing and just want to get out. If the NM is going to clean up after the localizer anyway, seems like we can drastically simplify DIE handling and just exit the JVM. That seems like a change that's targeted enough to be appropriate for 2.7 instead of adding localizer kill support, etc. In container localizer, when processing HB DIE response, we send another localizer status to NM. Is it really required ? What do you think ? I think as soon as we get DIE, we can follow current code of cancelling pending tasks, although not wait for them to complete(as is being done in newly added code in patch) and delete paths reported in last status. And then just return from the loop for a graceful shutdown(after stopping executors). Or are you suggesting System exit ? >From the NM side, we can have a deletion task after some configured delay(same >as right now). We will never cancel this deletion task though unlike code in >patch now. This way localizer should quit quickly and NM can cleanup. I will change the behavior of executor on deletion as well i.e. I will ignore missing paths by default. Wont add flag. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > ------------------------------------------------------------------------------------ > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Affects Versions: 2.5.0 > Reporter: Jason Lowe > Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)