[ 
https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15045879#comment-15045879
 ] 

Ivan Mitic commented on YARN-4309:
----------------------------------

Thanks [~vvasudev], the proposal and the patch look good overall. I have a few 
mainly minor comments below, mostly related to Windows side changes.

1. {code}
    public abstract void copyDebugInformation(Path src, Path dst)
        throws IOException;
    public abstract void listDebugInformation(Path output) throws IOException;
{code}
I think it would be helpful to document what the methods are supposed to do.

2. {code}
    @Override
    public void copyDebugInformation(Path src, Path dest)
        throws IOException {
      // no need to worry about permissions - in secure mode
      // WindowsSecureContainerExecutor will set permissions
      // to allow NM to read the file
      lineWithLenCheck(String.format("cp \"%s\" \"%s\"", src.toString(),
          dest.toString()));
      errorCheck();
    }
{code}
Do we want to remove the error check above, to be consistent with Linux, and to 
avoid failing due to a "logging" failure?
Also, {{cp}} command does not exist on Windows. Please use {{copy}} instead. 

3. Why do you have both "dir" and "dir /AL /S" on Windows? Can you please 
include an inline comment with rationale.

4. In copyDebugInformation() you are also doing a chmod() internally. Wondering 
this this command should be injected by the call site given that only the 
caller has context on what the destination is and whether special permission 
handling is needed. It might be possible to change the method to only accept 
src and copy the file to the current folder, in which case it might be fine to 
use chmod() given that there is an assumption on what the current folder is. 
Just a thought you make the call.

> Add debug information to application logs when a container fails
> ----------------------------------------------------------------
>
>                 Key: YARN-4309
>                 URL: https://issues.apache.org/jira/browse/YARN-4309
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Varun Vasudev
>            Assignee: Varun Vasudev
>         Attachments: YARN-4309.001.patch, YARN-4309.002.patch, 
> YARN-4309.003.patch, YARN-4309.004.patch, YARN-4309.005.patch, 
> YARN-4309.006.patch, YARN-4309.007.patch, YARN-4309.008.patch
>
>
> Sometimes when a container fails, it can be pretty hard to figure out why it 
> failed.
> My proposal is that if a container fails, we collect information about the 
> container local dir and dump it into the container log dir. Ideally, I'd like 
> to tar up the directory entirely, but I'm not sure of the security and space 
> implications of such a approach. At the very least, we can list all the files 
> in the container local dir, and dump the contents of launch_container.sh(into 
> the container log dir).
> When log aggregation occurs, all this information will automatically get 
> collected and make debugging such failures much easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to