showuon opened a new pull request #10301: URL: https://github.com/apache/kafka/pull/10301
Found the root cause about why the the test `shouldNotViolateEosIfOneTaskFailsWithState` failed sometimes with unexpected committed/uncommitted result. The reason is unexpected rebalance during committing messages, and it causes the fail over mechanism. And the reason why the rebalance is triggered is because we reduce the `max.poll.interval.ms` value for the `shouldNotViolateEosIfOneTaskGetsFencedUsingIsolatedAppInstances`, which is trying to stall a thread, and wait for exceeding the `max.poll.interval.ms`, and trigger the rebalance. As we know, under `withState` situation, we have more things to handle with the state and additional topics..., so it explains why only the `shouldNotViolateEosIfOneTaskFailsWithState` is flaky, not other tests. I increased the `max.poll.interval.ms` for the `withState` test to fix the flaky test. Also, did some enhancement: 1. add failed reason. Currently, the failed message is like: `Expected: <[...]>, but: was: <[...]>`, and it didn't tell us the result is committed or uncommitted result, before injected error or after. We have to map the stack trace to know it. Improve it 2. The fail() in `uncaughtException` will only fail the stream thread, not the test. fix it/ 3. add the capacity for the ArrayList to avoid memory reallocation. 4. Improve the comments, and add the state view for each phase ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org