[jira] [Comment Edited] (YARN-10743) Add a policy for not aggregating for containers which are killed because exceeding container log size limit.

Qi Zhu (Jira) Mon, 19 Apr 2021 20:03:05 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325432#comment-17325432
 ]


Qi Zhu edited comment on YARN-10743 at 4/20/21, 3:02 AM:
---------------------------------------------------------

Thanks [~ebadger] for reply.

One case in our cluster :

If the container killed with large size log in long running flink, flink side 
we can find the problem actually, and another container will launch to run it. 
If the problem will happen again, the new container also will aggregated when 
the problem also happened (actually large than 100G log , always the user 
print), the abnormal log size may have more than 1TB in long running case, it 
will be a big pressure to HDFS.

I also agree with you that we want to check the logs in normal case, but if we 
can add it as an option, i think it is useful for long running jobs when the 
abnormal use printing happened.

 

!image-2021-04-20-10-41-01-057.png|width=781,height=85!

{color:#de350b}Here the exit code of log context is always 0{color}, the policy 
actually not takes effect such as :
{code:java}
public class FailedContainerLogAggregationPolicy extends
    AbstractContainerLogAggregationPolicy {
  public boolean shouldDoLogAggregation(ContainerLogContext logContext) {
    int exitCode = logContext.getExitCode();
    return exitCode != 0 && exitCode != ExitCode.FORCE_KILLED.getExitCode()
        && exitCode != ExitCode.TERMINATED.getExitCode();
  }
}
{code}
If i missed some code where those policy will take effect?

cc [~ebadger] [~Jim_Brennan] [~epayne] 

What's your opinions about this code in 
AppLogAggregatorImpl#uploadLogsForContainers?
{code:java}
if (shouldUploadLogs(new ContainerLogContext(
    container.getContainerId(), containerType, 0))) {
  pendingContainerInThisCycle.add(container.getContainerId());
}{code}
Thanks.


was (Author: zhuqi):
Thanks [~ebadger] for reply.

One case in our cluster :

If the container killed with large size log in long running flink, flink side 
we can find the problem actually, and another container will launch to run it. 
If the problem will happen again, the new container also will aggregated when 
the problem also happened (actually large than 100G log , always the user 
print), the abnormal log size may have more than 1TB in long running case, it 
will be a big pressure to HDFS.

I also agree with you that we want to check the logs in normal case, but if we 
can add it as an option, i think it is useful for long running jobs when the 
abnormal use printing happened.

 

Another problem about the code:

!image-2021-04-20-10-41-01-057.png|width=781,height=85!

{color:#de350b}Here the exit code of log context is always 0{color}, the policy 
actually not takes effect such as :
{code:java}
public class FailedContainerLogAggregationPolicy extends
    AbstractContainerLogAggregationPolicy {
  public boolean shouldDoLogAggregation(ContainerLogContext logContext) {
    int exitCode = logContext.getExitCode();
    return exitCode != 0 && exitCode != ExitCode.FORCE_KILLED.getExitCode()
        && exitCode != ExitCode.TERMINATED.getExitCode();
  }
}
{code}
I think is's wrong.

cc [~ebadger] [~Jim_Brennan] [~epayne] 

What's your opinions about this code in 
AppLogAggregatorImpl#uploadLogsForContainers?
{code:java}
if (shouldUploadLogs(new ContainerLogContext(
    container.getContainerId(), containerType, 0))) {
  pendingContainerInThisCycle.add(container.getContainerId());
}
{code}

> Add a policy for not aggregating for containers which are killed because 
> exceeding container log size limit.
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-10743
>                 URL: https://issues.apache.org/jira/browse/YARN-10743
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Qi Zhu
>            Assignee: Qi Zhu
>            Priority: Major
>         Attachments: YARN-10743.001.patch, image-2021-04-20-10-41-01-057.png
>
>
> Since YARN-10471 supported container log size limited for kill.
> We'd better to add a policy that can not aggregated for those containers, so 
> that to reduce the pressure of HDFS etc.
> cc [~epayne] [~Jim_Brennan] [~ebadger]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YARN-10743) Add a policy for not aggregating for containers which are killed because exceeding container log size limit.

Reply via email to