Wangda Tan commented on YARN-4309:

Hi [~vvasudev],

Thanks for working on this task, it's really useful to identify container 
launch issues, some questions/comments:
- Since debug information fetch script (like copy script and list files) is at 
the end of launch_container.sh, is it possible that a container is killed so 
such script cannot be executed?
- Do you think is it better to generate a separated script file to fetch debug 
information before launch user code? Which we can 
1. Guarantee it will be executed
2. It won't add debug information to normal launch_container.sh.
3. Return code of script won't affected by debug script.
- Is it possible to enable/disable this function while NM is running? 


> Add debug information to application logs when a container fails
> ----------------------------------------------------------------
>                 Key: YARN-4309
>                 URL: https://issues.apache.org/jira/browse/YARN-4309
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Varun Vasudev
>            Assignee: Varun Vasudev
>         Attachments: YARN-4309.001.patch, YARN-4309.002.patch, 
> YARN-4309.003.patch, YARN-4309.004.patch, YARN-4309.005.patch
> Sometimes when a container fails, it can be pretty hard to figure out why it 
> failed.
> My proposal is that if a container fails, we collect information about the 
> container local dir and dump it into the container log dir. Ideally, I'd like 
> to tar up the directory entirely, but I'm not sure of the security and space 
> implications of such a approach. At the very least, we can list all the files 
> in the container local dir, and dump the contents of launch_container.sh(into 
> the container log dir).
> When log aggregation occurs, all this information will automatically get 
> collected and make debugging such failures much easier.

This message was sent by Atlassian JIRA

Reply via email to