[ 
https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-4309:
--------------------------------
    Attachment: YARN-4309.001.patch

Uploaded an initial version of the patch. It's a little difficult to collect 
the information only for failures and easier to collect it for all runs. 
Essentially, collecting the information for failures in secure mode is a lot 
harder and requires changes to container-executor. I've made generation of the 
additional debug information optional, with the default set to false.

The patch creates a copy of launch_container.sh, the output of ls and the 
output of "find -L . -maxdepth 5 -ls".

There's no particular reason for maxdepth 5 - I'm happy to change it if someone 
feels some other value is more appropriate. The reason for find and ls is that 
ls will output the symlinks whereas find gives you the size of the file pointed 
to by the symlink.

This version of the patch is for Linux only. If someone knows the changes for 
Windows, I'll add those in.

Just for information, for a mapreduce pi job, this is what was generated for 
the directory contents:
{code}
ls:
total 32
-rw-r--r-- 1 varun varun  129 Nov 26 19:47 container_tokens
-rwx------ 1 varun varun  702 Nov 26 19:47 default_container_executor_session.sh
-rwx------ 1 varun varun  756 Nov 26 19:47 default_container_executor.sh
lrwxrwxrwx 1 varun varun  113 Nov 26 19:47 job.jar -> 
/var/hadoop/hadoop-3-data/grid/local/usercache/varun/appcache/application_1448547413698_0001/filecache/10/job.jar
lrwxrwxrwx 1 varun varun  114 Nov 26 19:47 job.xml -> 
/var/hadoop/hadoop-3-data/grid2/local/usercache/varun/appcache/application_1448547413698_0001/filecache/13/job.xml
-rwx------ 1 varun varun 4941 Nov 26 19:47 launch_container.sh
drwx--x--- 2 varun varun 4096 Nov 26 19:47 tmp
find:
1079692    4 drwx--x---   3 varun    varun        4096 Nov 26 19:47 .
1074586    4 -rw-r--r--   1 varun    varun          16 Nov 26 19:47 
./.default_container_executor.sh.crc
1074581    8 -rwx------   1 varun    varun        4941 Nov 26 19:47 
./launch_container.sh
1049070  104 -r-x------   1 varun    varun      105105 Nov 26 19:47 ./job.xml
1873872    4 drwx------   2 varun    varun        4096 Nov 26 19:47 ./job.jar
1873870  272 -r-x------   1 varun    varun      275886 Nov 26 19:47 
./job.jar/job.jar
1079695    4 drwx--x---   2 varun    varun        4096 Nov 26 19:47 ./tmp
1074582    4 -rw-r--r--   1 varun    varun          48 Nov 26 19:47 
./.launch_container.sh.crc
1074580    4 -rw-r--r--   1 varun    varun          12 Nov 26 19:47 
./.container_tokens.crc
1074585    4 -rwx------   1 varun    varun         756 Nov 26 19:47 
./default_container_executor.sh
1074583    4 -rwx------   1 varun    varun         702 Nov 26 19:47 
./default_container_executor_session.sh
1074579    4 -rw-r--r--   1 varun    varun         129 Nov 26 19:47 
./container_tokens
1074584    4 -rw-r--r--   1 varun    varun          16 Nov 26 19:47 
./.default_container_executor_session.sh.crc
{code}

> Add debug information to application logs when a container fails
> ----------------------------------------------------------------
>
>                 Key: YARN-4309
>                 URL: https://issues.apache.org/jira/browse/YARN-4309
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Varun Vasudev
>            Assignee: Varun Vasudev
>         Attachments: YARN-4309.001.patch
>
>
> Sometimes when a container fails, it can be pretty hard to figure out why it 
> failed.
> My proposal is that if a container fails, we collect information about the 
> container local dir and dump it into the container log dir. Ideally, I'd like 
> to tar up the directory entirely, but I'm not sure of the security and space 
> implications of such a approach. At the very least, we can list all the files 
> in the container local dir, and dump the contents of launch_container.sh(into 
> the container log dir).
> When log aggregation occurs, all this information will automatically get 
> collected and make debugging such failures much easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to