I am running Hadoop with a non-HDFS file system backend (Ceph), and have noticed that some processes exit or are being killed before the file system client is properly shutdown (i.e. FileSystem::close completing). We need clean shutdowns right now because they release resources that when not cleaned up lead to fs timeouts that slow every other client down. We've adjusted the yarn timeout affecting the delay before SIGKILL is sent to containers which resolves the problem for containers with map tasks, but there is one instance of an unclean shutdown that I'm having trouble tracking down.
Based on the file system trace of this unknown process it appears that it is the AppMaster, or some other manager process. In particular it examines all of the files related to the job (e.g. all of the teragen files for each map task /in-dir/_temporary/1/task_1413987694759_0002_m_000018/part-m-00018), and the very last set of operations is the removal of many configuration files, jar files, and directories and finally the job directory is removed (i.e. /tmp/hadoop-yarn/staging/hadoop/.staging/job_1413987694759_0002). So the first question is what process is this based on the behavior (full trace is here http://pastebin.com/SVCfRfA4)? After that final job directory is removed the fs trace is truncated suggesting the process immediately exited or was killed. So the second question is, based on what process this is (e.g. app master) what might be causing the unclean shutdown and is there a way to control this? Thanks, Noah
