[
https://issues.apache.org/jira/browse/YARN-9877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744667#comment-17744667
]
ASF GitHub Bot commented on YARN-9877:
--------------------------------------
hadoop-yetus commented on PR #5784:
URL: https://github.com/apache/hadoop/pull/5784#issuecomment-1642271707
:confetti_ball: **+1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|:----:|----------:|--------:|:--------:|:-------:|
| +0 :ok: | reexec | 0m 38s | | Docker mode activated. |
|||| _ Prechecks _ |
| +1 :green_heart: | dupname | 0m 0s | | No case conflicting files
found. |
| +0 :ok: | codespell | 0m 0s | | codespell was not available. |
| +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available.
|
| +1 :green_heart: | @author | 0m 0s | | The patch does not contain
any @author tags. |
| +1 :green_heart: | test4tests | 0m 0s | | The patch appears to
include 1 new or modified test files. |
|||| _ trunk Compile Tests _ |
| +1 :green_heart: | mvninstall | 44m 14s | | trunk passed |
| +1 :green_heart: | compile | 1m 3s | | trunk passed with JDK
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 |
| +1 :green_heart: | compile | 1m 0s | | trunk passed with JDK
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
| +1 :green_heart: | checkstyle | 0m 59s | | trunk passed |
| +1 :green_heart: | mvnsite | 1m 4s | | trunk passed |
| +1 :green_heart: | javadoc | 1m 0s | | trunk passed with JDK
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 |
| +1 :green_heart: | javadoc | 0m 54s | | trunk passed with JDK
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
| +1 :green_heart: | spotbugs | 2m 1s | | trunk passed |
| +1 :green_heart: | shadedclient | 34m 21s | | branch has no errors
when building and testing our client artifacts. |
|||| _ Patch Compile Tests _ |
| +1 :green_heart: | mvninstall | 0m 51s | | the patch passed |
| +1 :green_heart: | compile | 0m 54s | | the patch passed with JDK
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 |
| +1 :green_heart: | javac | 0m 54s | | the patch passed |
| +1 :green_heart: | compile | 0m 50s | | the patch passed with JDK
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
| +1 :green_heart: | javac | 0m 50s | | the patch passed |
| +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks
issues. |
| +1 :green_heart: | checkstyle | 0m 45s | | the patch passed |
| +1 :green_heart: | mvnsite | 0m 52s | | the patch passed |
| +1 :green_heart: | javadoc | 0m 46s | | the patch passed with JDK
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 |
| +1 :green_heart: | javadoc | 0m 41s | | the patch passed with JDK
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
| +1 :green_heart: | spotbugs | 1m 55s | | the patch passed |
| +1 :green_heart: | shadedclient | 35m 9s | | patch has no errors
when building and testing our client artifacts. |
|||| _ Other Tests _ |
| +1 :green_heart: | unit | 100m 17s | |
hadoop-yarn-server-resourcemanager in the patch passed. |
| +1 :green_heart: | asflicense | 0m 38s | | The patch does not
generate ASF License warnings. |
| | | 232m 55s | | |
| Subsystem | Report/Notes |
|----------:|:-------------|
| Docker | ClientAPI=1.43 ServerAPI=1.43 base:
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5784/7/artifact/out/Dockerfile
|
| GITHUB PR | https://github.com/apache/hadoop/pull/5784 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
| uname | Linux 5426d334c7ba 4.15.0-212-generic #223-Ubuntu SMP Tue May 23
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/bin/hadoop.sh |
| git revision | trunk / b9f9a2c9f778647df2b191cae5c9da15a77c0ee5 |
| Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
| Multi-JDK versions |
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
/usr/lib/jvm/java-8-openjdk-amd64:Private
Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
| Test Results |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5784/7/testReport/ |
| Max. process+thread count | 933 (vs. ulimit of 5500) |
| modules | C:
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
U:
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
|
| Console output |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5784/7/console |
| versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
| Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
This message was automatically generated.
> Intermittent TIME_OUT of LogAggregationReport
> ---------------------------------------------
>
> Key: YARN-9877
> URL: https://issues.apache.org/jira/browse/YARN-9877
> Project: Hadoop YARN
> Issue Type: Bug
> Components: log-aggregation, resourcemanager, yarn
> Affects Versions: 3.0.3, 3.3.0, 3.2.1, 3.1.3
> Reporter: Adam Antal
> Assignee: Adam Antal
> Priority: Major
> Labels: pull-request-available
> Attachments: YARN-9877.001.patch
>
>
> I noticed some intermittent TIME_OUT in some downstream log-aggregation based
> tests.
> Steps to reproduce:
> - Let's run a MR job
> {code}
> hadoop jar hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep
> -Dmapreduce.job.queuename=root.default -m 10 -r 10 -mt 5000 -rt 5000
> {code}
> - Suppose the AM is requesting more containers, but as soon as they're
> allocated - the AM realizes it doesn't need them. The container's state
> changes are: ALLOCATED -> ACQUIRED -> RELEASED.
> Let's suppose these extra containers are allocated in a different node from
> the other 21 (AM + 10 mapper + 10 reducer) containers' node.
> - All the containers finish successfully and the app is finished successfully
> as well. Log aggregation status for the whole app seemingly stucks in RUNNING
> state.
> - After a while the final log aggregation status for the app changes to
> TIME_OUT.
> Root cause:
> - As unused containers are getting through the state transition in the RM's
> internal representation, {{RMAppImpl$AppRunningOnNodeTransition}}'s
> transition function is called. This calls the
> {{RMAppLogAggregation$addReportIfNecessary}} which forcefully adds the
> "NOT_START" LogAggregationStatus associated with this NodeId for the app,
> even though it does not have any running container on it.
> - The node's LogAggregationStatus is never updated to "SUCCEEDED" by the
> NodeManager because it does not have any running container on it (Note that
> the AM immediately released them after acquisition). The LogAggregationStatus
> remains NOT_START until time out is reached. After that point the RM
> aggregates the LogAggregationReports for all the nodes, and though all the
> containers have SUCCEEDED state, one particular node has NOT_START, so the
> final log aggregation will be TIME_OUT.
> (I crawled the RM UI for the log aggregation statuses, and it was always
> NOT_START for this particular node).
> This situation is highly unlikely, but has an estimated ~0.8% of failure rate
> based on a year's 1500 run on an unstressed cluster.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]