Ashwin Shankar created YARN-4011: ------------------------------------ Summary: Jobs fail since nm-local-dir not cleaned up when rogue job fills up disk Key: YARN-4011 URL: https://issues.apache.org/jira/browse/YARN-4011 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.4.0 Reporter: Ashwin Shankar
We observed jobs failed since tasks couldn't launch on nodes due to "java.io.IOException No space left on device". On digging in further, we found a rogue job which filled up disk. Specifically it was wrote a lot of map spills(like attempt_1432082376223_461647_m_000421_0_spill_10000.out) to nm-local-dir causing disk to fill up, and it failed/got killed, but didn't clean up its files in nm-local-dir. So the disk remained full, causing subsequent jobs to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)