[ https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Varun Vasudev updated YARN-4309: -------------------------------- Attachment: YARN-4309.001.patch Uploaded an initial version of the patch. It's a little difficult to collect the information only for failures and easier to collect it for all runs. Essentially, collecting the information for failures in secure mode is a lot harder and requires changes to container-executor. I've made generation of the additional debug information optional, with the default set to false. The patch creates a copy of launch_container.sh, the output of ls and the output of "find -L . -maxdepth 5 -ls". There's no particular reason for maxdepth 5 - I'm happy to change it if someone feels some other value is more appropriate. The reason for find and ls is that ls will output the symlinks whereas find gives you the size of the file pointed to by the symlink. This version of the patch is for Linux only. If someone knows the changes for Windows, I'll add those in. Just for information, for a mapreduce pi job, this is what was generated for the directory contents: {code} ls: total 32 -rw-r--r-- 1 varun varun 129 Nov 26 19:47 container_tokens -rwx------ 1 varun varun 702 Nov 26 19:47 default_container_executor_session.sh -rwx------ 1 varun varun 756 Nov 26 19:47 default_container_executor.sh lrwxrwxrwx 1 varun varun 113 Nov 26 19:47 job.jar -> /var/hadoop/hadoop-3-data/grid/local/usercache/varun/appcache/application_1448547413698_0001/filecache/10/job.jar lrwxrwxrwx 1 varun varun 114 Nov 26 19:47 job.xml -> /var/hadoop/hadoop-3-data/grid2/local/usercache/varun/appcache/application_1448547413698_0001/filecache/13/job.xml -rwx------ 1 varun varun 4941 Nov 26 19:47 launch_container.sh drwx--x--- 2 varun varun 4096 Nov 26 19:47 tmp find: 1079692 4 drwx--x--- 3 varun varun 4096 Nov 26 19:47 . 1074586 4 -rw-r--r-- 1 varun varun 16 Nov 26 19:47 ./.default_container_executor.sh.crc 1074581 8 -rwx------ 1 varun varun 4941 Nov 26 19:47 ./launch_container.sh 1049070 104 -r-x------ 1 varun varun 105105 Nov 26 19:47 ./job.xml 1873872 4 drwx------ 2 varun varun 4096 Nov 26 19:47 ./job.jar 1873870 272 -r-x------ 1 varun varun 275886 Nov 26 19:47 ./job.jar/job.jar 1079695 4 drwx--x--- 2 varun varun 4096 Nov 26 19:47 ./tmp 1074582 4 -rw-r--r-- 1 varun varun 48 Nov 26 19:47 ./.launch_container.sh.crc 1074580 4 -rw-r--r-- 1 varun varun 12 Nov 26 19:47 ./.container_tokens.crc 1074585 4 -rwx------ 1 varun varun 756 Nov 26 19:47 ./default_container_executor.sh 1074583 4 -rwx------ 1 varun varun 702 Nov 26 19:47 ./default_container_executor_session.sh 1074579 4 -rw-r--r-- 1 varun varun 129 Nov 26 19:47 ./container_tokens 1074584 4 -rw-r--r-- 1 varun varun 16 Nov 26 19:47 ./.default_container_executor_session.sh.crc {code} > Add debug information to application logs when a container fails > ---------------------------------------------------------------- > > Key: YARN-4309 > URL: https://issues.apache.org/jira/browse/YARN-4309 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Reporter: Varun Vasudev > Assignee: Varun Vasudev > Attachments: YARN-4309.001.patch > > > Sometimes when a container fails, it can be pretty hard to figure out why it > failed. > My proposal is that if a container fails, we collect information about the > container local dir and dump it into the container log dir. Ideally, I'd like > to tar up the directory entirely, but I'm not sure of the security and space > implications of such a approach. At the very least, we can list all the files > in the container local dir, and dump the contents of launch_container.sh(into > the container log dir). > When log aggregation occurs, all this information will automatically get > collected and make debugging such failures much easier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)