[ 
https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902291#comment-13902291
 ] 

Ming Ma commented on YARN-221:
------------------------------

[Chris 
Trezzo|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ctrezzo] and 
[Gera 
Shegalov|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=jira.shegalov]
 and I discussed more on this. We would like to give some updates and get 
feedback from others. Similar to what Robert suggested originally, we need to 
provide a way for AM to update the log aggregation policy when it stops the 
container.

One likely log aggregation policy for MRAppMaster is to log all failed tasks 
and sample logs of some successful tasks. What we found is container exitcode 
isn't a reliable indication whether a MR task finishes successfully. That is 
due to the fact MRAppMaster calls stopContainer while the YarnChild JVM exits 
by itself. Depending on the timing, you might get non-zero exitcode for 
successful tasks. So specifying the log aggregation policy up front during 
ContainerLaunchContext isn't enough.

The mechanism for AM to pass log aggregation policy to YARN needs to address 
different scenarios.

1. Containers exit by themselves. DistributedShell belongs to this category.
2. AM has to explicitly stop the containers. MR belongs to this category.
3. AM might want to inform NM to do on-demand log aggregation without stopping 
the container. This might be useful for some long running applications.

To support #1, we have to specify the log aggregation policy as part of 
startContainer call. Chris' patch handles that.

To support #2, AM has to indicate to NM whether the log aggregation is needed 
during stopContainer call. AM can uses different types of policies such as 
successful tasks sampling. For that, AM will specify the log aggregation policy 
as part of StopContainerRequest.

{code:title=StopContainerRequest.java|borderStyle=solid}

...

  /**
   * Get the <code>ContainerLogAggregationPolicy</code> for the container.
   *
   * @return The <code>ContainerLogAggregationPolicy</code> for the container.
   */  
  @Public
  @Stable
  public ContainerLogAggregationPolicy getLogAggregationPolicy();

  /**
   * Set the <code>ContainerLogAggregationPolicy</code> for the container.
   *
   * @param policy The <code>ContainerLogAggregationPolicy</code> for the 
container.
   */
  @Public
  @Stable
  public void setLogAggregationPolicy(ContainerLogAggregationPolicy policy);
{code}


Alternatively we can define a new interface called ContainerStopContext to 
capture log aggregation policy and other information we want to include later, 
etc.

{code:title=StopContainerRequest.java|borderStyle=solid}

  @Public
  @Stable
  public abstract ContainerStopContext getContainerStopContext();

  @Public
  @Stable
  public abstract void setContainerStopContext(ContainerStopContext context);

{code}


To support #3, we need some new API such as updateContainer so that AM can ask 
NM to roll container log and update the log aggregation policy, etc.


> NM should provide a way for AM to tell it not to aggregate logs.
> ----------------------------------------------------------------
>
>                 Key: YARN-221
>                 URL: https://issues.apache.org/jira/browse/YARN-221
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Robert Joseph Evans
>            Assignee: Chris Trezzo
>         Attachments: YARN-221-trunk-v1.patch
>
>
> The NodeManager should provide a way for an AM to tell it that either the 
> logs should not be aggregated, that they should be aggregated with a high 
> priority, or that they should be aggregated but with a lower priority.  The 
> AM should be able to do this in the ContainerLaunch context to provide a 
> default value, but should also be able to update the value when the container 
> is released.
> This would allow for the NM to not aggregate logs in some cases, and avoid 
> connection to the NN at all.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to