wy created YARN-11962:
-------------------------

             Summary: LogAggregationIndexedFileController.write() does not 
track uploaded files, breaking local cleanup and dedup in rolling log 
aggregation
                 Key: YARN-11962
                 URL: https://issues.apache.org/jira/browse/YARN-11962
             Project: Hadoop YARN
          Issue Type: Bug
          Components: log-aggregation, yarn
    Affects Versions: 3.4.3, 3.5.0, 3.3.6, 3.2.4, 3.1.4, 3.0.3, 2.9.2
         Environment: * OS: Ubuntu 24.04 (WSL2)
 * Java: OpenJDK 21
 * Hadoop: 3.4.2
 * Cluster: Single-node (localhost), pseudo-distributed mode
            Reporter: wy


h3. Problem

When YARN rolling log aggregation is enabled with the {{IndexedFormat}} file 
controller ({{{}LogAggregationIndexedFileController{}}}), two features are 
broken:
 # {*}Local log file cleanup{*}: Uploaded local log files are never deleted 
during rolling aggregation cycles, even when 
{{yarn.log-aggregation.enable-local-cleanup=true}} (the default).
 # {*}Upload deduplication{*}: The same log files are re-uploaded in every 
rolling cycle, causing HDFS storage waste proportional to {{{}(number_of_cycles 
× total_log_size){}}}.

Both features work correctly when using {{TFile}} format.
h3. Root Cause

{{LogAggregationIndexedFileController.write()}} (line 360–424) never calls 
{{logValue.uploadedFiles.add(logFile)}} after successfully writing a log file 
to the aggregated output. In contrast, the TFile write path 
({{{}AggregatedLogFormat.LogValue.write(){}}}, line 287) correctly calls 
{{this.uploadedFiles.add(logFile)}} after each successful write.

Because {{uploadedFiles}} is never populated in the IndexedFormat path:
 * {{LogValue.getCurrentUpLoadedFilesPath()}} always returns an empty set
 * {{LogValue.getCurrentUpLoadedFileMeta()}} always returns an empty set

This cascades into {{{}AppLogAggregatorImpl.uploadLogsForContainers(){}}}:
{code:java}
Set<Path> uploadedFilePathsInThisCycle =
    aggregator.doContainerLogAggregation(...);
// Returns Sets.union(getCurrentUpLoadedFilesPath() /*empty*/,
//                    getObsoleteRetentionLogFiles() /*usually empty*/)
// = empty set

if (uploadedFilePathsInThisCycle.size() > 0) {   // always false
    // Local deletion logic is NEVER entered
    deletionTask = new FileDeletionTask(...);       // never created
}
{code}
And in {{{}ContainerLogAggregator.doContainerLogAggregation(){}}}:
{code:java}
this.uploadedFileMeta.addAll(
    logValue.getCurrentUpLoadedFileMeta());  // addAll(empty) → no change
// → alreadyUploadedLogFiles stays empty
// → dedup filter passes all files
// → same files re-uploaded every cycle
{code}
h3. Code Comparison

*TFile path (correct)* — {{{}AggregatedLogFormat.LogValue.write(){}}}:
{code:java}
for (File logFile : fileList) {
    // ... write bytes ...
    this.uploadedFiles.add(logFile);  // ← tracks uploaded file
}
{code}
*IndexedFormat path (buggy)* — 
{{{}LogAggregationIndexedFileController.write(){}}}:
{code:java}
for (File logFile : pendingUploadFiles) {
    // ... write bytes ...
    // ← missing: logValue.uploadedFiles.add(logFile)
    metas.add(meta);  // only IndexedFileLogMeta is tracked, not uploadedFiles
}
{code}
h3. Reproduction Steps

*Prerequisites:*
 * Hadoop 3.4.2 single-node cluster (HDFS + YARN)
 * TRACE logging for {{AppLogAggregatorImpl}} in 
{{{}$HADOOP_HOME/etc/hadoop/log4j.properties{}}}:
{noformat}
log4j.logger.org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl=TRACE
log4j.logger.org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor=DEBUG
{noformat}

*Step 1 — IndexedFormat test:*

1. Configure {{{}yarn-site.xml{}}}:
{code:xml}
<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
</property>
<property>
  <name>yarn.log-aggregation.file-formats</name>
  <value>IndexedFormat</value>
</property>
<property>
  <name>yarn.log-aggregation.file-controller.IndexedFormat.class</name>
  
<value>org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController</value>
</property>
<property>
  <name>yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds</name>
  <value>15</value>
</property>
{code}
2. Submit a DistributedShell app that writes to stderr for 90 seconds:
{code:bash}
# test-app.sh
#!/bin/bash
for i in $(seq 1 90); do echo "line_$i" >&2; sleep 1; done
{code}
{code:bash}
yarn jar hadoop-yarn-applications-distributedshell-*.jar \
  org.apache.hadoop.yarn.applications.distributedshell.Client \
  --jar hadoop-yarn-applications-distributedshell-*.jar \
  --shell_script test-app.sh --shell_args "90" \
  --num_containers 1 --container_memory 256 --master_memory 256 \
  --rolling_log_pattern "stderr"
{code}
3. After completion, check NM log:
{code:bash}
grep "$APP_ID" $NM_LOG | grep "Uploaded the following files"
# Expected: multiple TRACE entries (one per cycle per container)
# Actual:   0 entries

grep "$APP_ID" $NM_LOG | grep "Deleting path.*stderr"
# Expected: deletion events for stderr file during rolling cycles
# Actual:   0 events (only whole-directory deletion at app finish)
{code}
*Step 2 — TFile control (same config, only change format):*
 # Change {{yarn.log-aggregation.file-formats}} to {{{}TFile{}}}.
 # Repeat the same test.
 # Now NM log shows TRACE "Uploaded the following files" entries *and* 
"Deleting path" events for stderr during rolling cycles.

h3. Expected Behavior

When using IndexedFormat with rolling log aggregation and 
{{{}enable-local-cleanup=true{}}}:
 * After each rolling cycle, uploaded local log files should be deleted
 * Already-uploaded files should not be re-uploaded in subsequent cycles (dedup 
via {{alreadyUploadedLogFiles}} metadata tracking)

h3. Actual Behavior
 * No local log files are ever deleted during rolling cycles (only at app 
finish via {{{}doAppLogAggregationPostCleanUp{}}})
 * Same files are re-uploaded in every rolling cycle (no dedup)
 * Zero "Uploaded the following files" TRACE log entries (code path never 
entered)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to