[
https://issues.apache.org/jira/browse/YARN-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325432#comment-17325432
]
Qi Zhu edited comment on YARN-10743 at 4/20/21, 3:02 AM:
---------------------------------------------------------
Thanks [~ebadger] for reply.
One case in our cluster :
If the container killed with large size log in long running flink, flink side
we can find the problem actually, and another container will launch to run it.
If the problem will happen again, the new container also will aggregated when
the problem also happened (actually large than 100G log , always the user
print), the abnormal log size may have more than 1TB in long running case, it
will be a big pressure to HDFS.
I also agree with you that we want to check the logs in normal case, but if we
can add it as an option, i think it is useful for long running jobs when the
abnormal use printing happened.
!image-2021-04-20-10-41-01-057.png|width=781,height=85!
{color:#de350b}Here the exit code of log context is always 0{color}, the policy
actually not takes effect such as :
{code:java}
public class FailedContainerLogAggregationPolicy extends
AbstractContainerLogAggregationPolicy {
public boolean shouldDoLogAggregation(ContainerLogContext logContext) {
int exitCode = logContext.getExitCode();
return exitCode != 0 && exitCode != ExitCode.FORCE_KILLED.getExitCode()
&& exitCode != ExitCode.TERMINATED.getExitCode();
}
}
{code}
If i missed some code where those policy will take effect?
cc [~ebadger] [~Jim_Brennan] [~epayne]
What's your opinions about this code in
AppLogAggregatorImpl#uploadLogsForContainers?
{code:java}
if (shouldUploadLogs(new ContainerLogContext(
container.getContainerId(), containerType, 0))) {
pendingContainerInThisCycle.add(container.getContainerId());
}{code}
Thanks.
was (Author: zhuqi):
Thanks [~ebadger] for reply.
One case in our cluster :
If the container killed with large size log in long running flink, flink side
we can find the problem actually, and another container will launch to run it.
If the problem will happen again, the new container also will aggregated when
the problem also happened (actually large than 100G log , always the user
print), the abnormal log size may have more than 1TB in long running case, it
will be a big pressure to HDFS.
I also agree with you that we want to check the logs in normal case, but if we
can add it as an option, i think it is useful for long running jobs when the
abnormal use printing happened.
Another problem about the code:
!image-2021-04-20-10-41-01-057.png|width=781,height=85!
{color:#de350b}Here the exit code of log context is always 0{color}, the policy
actually not takes effect such as :
{code:java}
public class FailedContainerLogAggregationPolicy extends
AbstractContainerLogAggregationPolicy {
public boolean shouldDoLogAggregation(ContainerLogContext logContext) {
int exitCode = logContext.getExitCode();
return exitCode != 0 && exitCode != ExitCode.FORCE_KILLED.getExitCode()
&& exitCode != ExitCode.TERMINATED.getExitCode();
}
}
{code}
I think is's wrong.
cc [~ebadger] [~Jim_Brennan] [~epayne]
What's your opinions about this code in
AppLogAggregatorImpl#uploadLogsForContainers?
{code:java}
if (shouldUploadLogs(new ContainerLogContext(
container.getContainerId(), containerType, 0))) {
pendingContainerInThisCycle.add(container.getContainerId());
}
{code}
> Add a policy for not aggregating for containers which are killed because
> exceeding container log size limit.
> ------------------------------------------------------------------------------------------------------------
>
> Key: YARN-10743
> URL: https://issues.apache.org/jira/browse/YARN-10743
> Project: Hadoop YARN
> Issue Type: Improvement
> Reporter: Qi Zhu
> Assignee: Qi Zhu
> Priority: Major
> Attachments: YARN-10743.001.patch, image-2021-04-20-10-41-01-057.png
>
>
> Since YARN-10471 supported container log size limited for kill.
> We'd better to add a policy that can not aggregated for those containers, so
> that to reduce the pressure of HDFS etc.
> cc [~epayne] [~Jim_Brennan] [~ebadger]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]