Yarn local logs filling all disk space with long running applications

Pierre Beauvois Wed, 20 Sep 2023 07:34:11 -0700

Hello Hadoop and Yarn Dev Community !

I'm working as a Hadoop sysadmin on big Hadoop clusters at work. We have 
various kind of applications running on Yarn and Spark on Yarn.


Some Spark app are causing troubles because those are Spark Streaming ones 
(Long term running apps). In fact, those are just filling up all the disk space 
with the stderr and stdout container files we have into:

=================================================================

/var/log/hadoop-yarn/application_*/container_*/(stderr|stdout)

=================================================================

In such context we are purging the directory completely and restart the 
nodemanager. This is not clean at all and I personally want to fix that.

Regarding yarn configurations the following has been set on cluster side:

=================================================================

yarn.log-aggregation-enable: true
yarn.nodemanager.remote-app-log-dir: /app-logs
yarn.nodemanager.remote-app-log-dir-suffix: logs
yarn.nodemanager.log-aggregation.compression-type: gz
yarn.log-aggregation.retain-seconds: 432000
yarn.log-aggregation.retain-check-interval-seconds: -1
yarn.nodemanager.log-aggregation.policy.class: 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AllContainerLogAggregationPolicy
yarn.log-aggregation.file-formats: TFile
yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds: 3600
yarn.nodemanager.log-aggregation.num-log-files-per-app: 30
yarn.nodemanager.remote-app-log-dir-include-older: true
yarn.nodemanager.log.retain-seconds: 604800
yarn.nodemanager.delete.debug-delay-sec: 0
yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds.min: -1

====================================================================================

Log4j config is customed for the following part only:

=================================================================

# Appender for ResourceManager Application Summary Log
# Requires the following properties to be set
#    - hadoop.log.dir (Hadoop Log directory)
#    - yarn.server.resourcemanager.appsummary.log.file (resource manager app 
summary log filename)
#    - yarn.server.resourcemanager.appsummary.logger (resource manager app 
summary log level and appender)
log4j.appender.RMSUMMARY=org.apache.log4j.RollingFileAppender
log4j.appender.RMSUMMARY.File=${yarn.log.dir}/${yarn.server.resourcemanager.appsummary.log.file}
log4j.appender.RMSUMMARY.MaxFileSize={{yarn_rm_summary_log_max_backup_size}}MB
log4j.appender.RMSUMMARY.MaxBackupIndex={{yarn_rm_summary_log_number_of_backup_files}}
log4j.appender.RMSUMMARY.layout=org.apache.log4j.PatternLayout
log4j.appender.RMSUMMARY.layout.ConversionPattern=%d{ISO8601} %p %c{2}: %m%n

=================================================================

Additional notes on used versions: Hadoop v3.1.1.3.1 (HDP based + internal 
patches), Spark 3.2.2 (Apache + internal patches)

What can I do to prevent stderr and sdtout to get so big it can break a cluster 
entirely. Can those be rotated during the run ? Am I missing a part on the 
log4j configuration ? In addition to that are there specific configs our devs 
can do so the application rotates logs or generate less outputs ? (I have not 
been able to see the code)

Thanks for the help / advices.

Regards,

Pierre

Yarn local logs filling all disk space with long running applications

Reply via email to