Hello Hadoop and Yarn Dev Community !
I'm working as a Hadoop sysadmin on big Hadoop clusters at work. We have
various kind of applications running on Yarn and Spark on Yarn.
Some Spark app are causing troubles because those are Spark Streaming ones
(Long term running apps). In fact, those are just filling up all the disk space
with the stderr and stdout container files we have into:
=================================================================
/var/log/hadoop-yarn/application_*/container_*/(stderr|stdout)
=================================================================
In such context we are purging the directory completely and restart the
nodemanager. This is not clean at all and I personally want to fix that.
Regarding yarn configurations the following has been set on cluster side:
=================================================================
yarn.log-aggregation-enable: true
yarn.nodemanager.remote-app-log-dir: /app-logs
yarn.nodemanager.remote-app-log-dir-suffix: logs
yarn.nodemanager.log-aggregation.compression-type: gz
yarn.log-aggregation.retain-seconds: 432000
yarn.log-aggregation.retain-check-interval-seconds: -1
yarn.nodemanager.log-aggregation.policy.class:
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AllContainerLogAggregationPolicy
yarn.log-aggregation.file-formats: TFile
yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds: 3600
yarn.nodemanager.log-aggregation.num-log-files-per-app: 30
yarn.nodemanager.remote-app-log-dir-include-older: true
yarn.nodemanager.log.retain-seconds: 604800
yarn.nodemanager.delete.debug-delay-sec: 0
yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds.min: -1
====================================================================================
Log4j config is customed for the following part only:
=================================================================
# Appender for ResourceManager Application Summary Log
# Requires the following properties to be set
# - hadoop.log.dir (Hadoop Log directory)
# - yarn.server.resourcemanager.appsummary.log.file (resource manager app
summary log filename)
# - yarn.server.resourcemanager.appsummary.logger (resource manager app
summary log level and appender)
log4j.appender.RMSUMMARY=org.apache.log4j.RollingFileAppender
log4j.appender.RMSUMMARY.File=${yarn.log.dir}/${yarn.server.resourcemanager.appsummary.log.file}
log4j.appender.RMSUMMARY.MaxFileSize={{yarn_rm_summary_log_max_backup_size}}MB
log4j.appender.RMSUMMARY.MaxBackupIndex={{yarn_rm_summary_log_number_of_backup_files}}
log4j.appender.RMSUMMARY.layout=org.apache.log4j.PatternLayout
log4j.appender.RMSUMMARY.layout.ConversionPattern=%d{ISO8601} %p %c{2}: %m%n
=================================================================
Additional notes on used versions: Hadoop v3.1.1.3.1 (HDP based + internal
patches), Spark 3.2.2 (Apache + internal patches)
What can I do to prevent stderr and sdtout to get so big it can break a cluster
entirely. Can those be rotated during the run ? Am I missing a part on the
log4j configuration ? In addition to that are there specific configs our devs
can do so the application rotates logs or generate less outputs ? (I have not
been able to see the code)
Thanks for the help / advices.
Regards,
Pierre