[jira] [Commented] (KAFKA-16349) ShutdownableThread fails build by calling Exit with race condition

2024-03-25 Thread Greg Harris (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830590#comment-17830590
 ] 

Greg Harris commented on KAFKA-16349:
-

I added a tactical fix for the Exit class in my PR to resolve this bug. I'll 
pursue this refactor in a separate ticket KAFKA-16420.

> ShutdownableThread fails build by calling Exit with race condition
> --
>
> Key: KAFKA-16349
> URL: https://issues.apache.org/jira/browse/KAFKA-16349
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.8.0
>Reporter: Greg Harris
>Priority: Minor
>
> `ShutdownableThread` calls `Exit.exit()` when the thread's operation throws 
> FatalExitError. In normal operation, this calls System.exit, and exits the 
> process. In tests, the exit procedure is masked with Exit.setExitProcedure to 
> prevent tests that encounter a FatalExitError from crashing the test JVM.
> Masking of exit procedures is usually done in BeforeEach/AfterEach 
> annotations, with the exit procedures cleaned up immediately after the test 
> finishes. If the body of the test creates a ShutdownableThread that outlives 
> the test, such as by omitting `ShutdownableThread#awaitShutdown`, by having 
> `ShutdownableThread#awaitShutdown` be interrupted by a test timeout, or by a 
> race condition between `Exit.resetExitProcedure` and `Exit.exit`, then 
> System.exit() can be called erroneously.
>  
> {noformat}
> // First, in the test thread:
> Exit.setExitProcedure(...)
> try {
> new ShutdownableThread(...).start()
> } finally {
> Exit.resetExitProcedure()
> }
> // Second, in the ShutdownableThread:
> try {
> throw new FatalExitError(...)
> } catch (FatalExitError e) {
> Exit.exit(...) // Calls real System.exit()
> }{noformat}
>  
> This can be resolved by one of the following:
>  # Eliminate FatalExitError usages in code when setExitProcedure is in-use
>  # Eliminate the Exit.exit call from ShutdownableThread, and instead 
> propagate this error to another thread to handle without a race-condition
>  # Eliminate resetExitProcedure by refactoring Exit to be non-static
> FatalExitError is in use in a small number of places, but may be difficult to 
> eliminate:
>  * FinalizedFeatureChangeListener
>  * InterBrokerSendThread
>  * TopicBasedRemoteLogMetadataManager
> There are many other places where Exit is called from a background thread, 
> including some implementations of ShutdownableThread which don't use 
> FatalExitError.
> The effect of this bug is that the build is flaky, as race 
> conditions/timeouts in tests can cause the gradle executors to exit with 
> status code 1, which has happened 26 times in the last 28 days. I have not 
> yet been able to confirm this bug is happening in other tests, but I do have 
> a deterministic reproduction case with the exact same symptoms:
> {noformat}
> Gradle Test Run :core:test > Gradle Test Executor 38 > ShutdownableThreadTest 
> > testShutdownWhenTestTimesOut(boolean) > 
> "testShutdownWhenTestTimesOut(boolean).isInterruptible=true" SKIPPED
> FAILURE: Build failed with an exception.
> * What went wrong:
> Execution failed for task ':core:test'.
> > Process 'Gradle Test Executor 38' finished with non-zero exit value 1
>   This problem might be caused by incorrect test process configuration.
>   For more on test execution, please refer to 
> https://docs.gradle.org/8.6/userguide/java_testing.html#sec:test_execution in 
> the Gradle documentation.{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16349) ShutdownableThread fails build by calling Exit with race condition

2024-03-22 Thread Greg Harris (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830010#comment-17830010
 ] 

Greg Harris commented on KAFKA-16349:
-

I did some more exploration here.
 # Some of the "exit 1" failures I included in my count earlier are caused by 
https://issues.apache.org/jira/browse/KAFKA-15343 and so was an overestimate. 
This particular failure doesn't happen all that often.
 # There are more places affected than just those using FatalExitError.
 ## I found the ReplicaVerificationTool calls Exit from 
ReplicaBuffer#verifyChecksum, which is within a ShutdownableThread#doWork 
implementation.
 ## When a test uses org.junit.rules.Timeout, it runs the test in a separate 
thread. Fortunately, most of these usages are in streams which doesn't use Exit.
 # There are _many_ tests of classes that use Exit without mocking, that happen 
to deterministically never test those code branches (e.g. testing the 
CommandDefaultOptions subclasses).
 ## The signatures of classes don't make it clear to developers that they 
sometimes call Exit
 ## The branches which call Exit are less-tested, possibly because they require 
additional mocking, and are not visible from the signature.
 # Some usages of Exit have an extremely deep call-stack, and are therefore 
difficult to refactor. Here are some possible approaches, which can be applied 
on a case-by-case basis:
 ## Passing an instance down the call-stack
 ## Throwing a checked exception up the call-stack
 ## Throwing an unchecked exception up the call-stack (e.g. FatalExitError)
 ## Changing the return type from `void` to `int` or T to Optional and use 
c-like error handling

I think that it would be good to move away from the pattern of hiding a call to 
Exit deep in the call stack, so the self-documenting function parameter and 
checked exceptions are good options to recommend. In particular, I think I like 
using exceptions to propagate the need to exit up the call-stack, giving an 
opportunity for finally clauses to run. Then the top-level of each thread/entry 
point can handle the exception and call out. The unchecked exception has the 
chance to not be handled further up the call stack, and the c-like error 
handling isn't idiomatic Java.

Because of the number of calls to Exit and the breadth of changes needed, it 
isn't reasonable to perform this migration in one step, so both the old and new 
solution will need to coexist. I prototyped a "SafeExit" class that I feel good 
about promoting as the replacement for Exit. It behaves similar to the Time 
class, and can be used by some classes while others still use Exit.
 # I'll propose SafeExit in a PR soon, with a test that demonstrates how it 
should be used, along with a small migration guide
 # I'll do some more research to try and figure out which call-sites are most 
likely to be causing the failed builds
 # Myself and other contributors can address call-sites for Exit over time, 
prioritizing the places most likely to have race conditions
 # After all call-sites for Exit are removed, we can remove the class entirely

> ShutdownableThread fails build by calling Exit with race condition
> --
>
> Key: KAFKA-16349
> URL: https://issues.apache.org/jira/browse/KAFKA-16349
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.8.0
>Reporter: Greg Harris
>Priority: Minor
>
> `ShutdownableThread` calls `Exit.exit()` when the thread's operation throws 
> FatalExitError. In normal operation, this calls System.exit, and exits the 
> process. In tests, the exit procedure is masked with Exit.setExitProcedure to 
> prevent tests that encounter a FatalExitError from crashing the test JVM.
> Masking of exit procedures is usually done in BeforeEach/AfterEach 
> annotations, with the exit procedures cleaned up immediately after the test 
> finishes. If the body of the test creates a ShutdownableThread that outlives 
> the test, such as by omitting `ShutdownableThread#awaitShutdown`, by having 
> `ShutdownableThread#awaitShutdown` be interrupted by a test timeout, or by a 
> race condition between `Exit.resetExitProcedure` and `Exit.exit`, then 
> System.exit() can be called erroneously.
>  
> {noformat}
> // First, in the test thread:
> Exit.setExitProcedure(...)
> try {
> new ShutdownableThread(...).start()
> } finally {
> Exit.resetExitProcedure()
> }
> // Second, in the ShutdownableThread:
> try {
> throw new FatalExitError(...)
> } catch (FatalExitError e) {
> Exit.exit(...) // Calls real System.exit()
> }{noformat}
>  
> This can be resolved by one of the following:
>  # Eliminate FatalExitError usages in code when setExitProcedure is in-use
>  # Eliminate the Exit.exit call from ShutdownableThread, and instead 
> propagate this error to 

[jira] [Commented] (KAFKA-16349) ShutdownableThread fails build by calling Exit with race condition

2024-03-06 Thread Ismael Juma (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824207#comment-17824207
 ] 

Ismael Juma commented on KAFKA-16349:
-

Good catch.

> ShutdownableThread fails build by calling Exit with race condition
> --
>
> Key: KAFKA-16349
> URL: https://issues.apache.org/jira/browse/KAFKA-16349
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.8.0
>Reporter: Greg Harris
>Priority: Minor
>
> `ShutdownableThread` calls `Exit.exit()` when the thread's operation throws 
> FatalExitError. In normal operation, this calls System.exit, and exits the 
> process. In tests, the exit procedure is masked with Exit.setExitProcedure to 
> prevent tests that encounter a FatalExitError from crashing the test JVM.
> Masking of exit procedures is usually done in BeforeEach/AfterEach 
> annotations, with the exit procedures cleaned up immediately after the test 
> finishes. If the body of the test creates a ShutdownableThread that outlives 
> the test, such as by omitting `ShutdownableThread#awaitShutdown`, by having 
> `ShutdownableThread#awaitShutdown` be interrupted by a test timeout, or by a 
> race condition between `Exit.resetExitProcedure` and `Exit.exit`, then 
> System.exit() can be called erroneously.
>  
> {noformat}
> // First, in the test thread:
> Exit.setExitProcedure(...)
> try {
> new ShutdownableThread(...).start()
> } finally {
> Exit.resetExitProcedure()
> }
> // Second, in the ShutdownableThread:
> try {
> throw new FatalExitError(...)
> } catch (FatalExitError e) {
> Exit.exit(...) // Calls real System.exit()
> }{noformat}
>  
> This can be resolved by one of the following:
>  # Eliminate FatalExitError usages in code when setExitProcedure is in-use
>  # Eliminate the Exit.exit call from ShutdownableThread, and instead 
> propagate this error to another thread to handle without a race-condition
>  # Eliminate resetExitProcedure by refactoring Exit to be non-static
> FatalExitError is in use in a small number of places, but may be difficult to 
> eliminate:
>  * FinalizedFeatureChangeListener
>  * InterBrokerSendThread
>  * TopicBasedRemoteLogMetadataManager
> It appears that every other use of Exit.setExitProcedure/Exit.exit is done on 
> a single thread, so ShutdownableThread is the only place where this race 
> condition is present.
> The effect of this bug is that the build is flaky, as race 
> conditions/timeouts in tests can cause the gradle executors to exit with 
> status code 1, which has happened 26 times in the last 28 days. I have not 
> yet been able to confirm this bug is happening in other tests, but I do have 
> a deterministic reproduction case with the exact same symptoms:
> {noformat}
> Gradle Test Run :core:test > Gradle Test Executor 38 > ShutdownableThreadTest 
> > testShutdownWhenTestTimesOut(boolean) > 
> "testShutdownWhenTestTimesOut(boolean).isInterruptible=true" SKIPPED
> FAILURE: Build failed with an exception.
> * What went wrong:
> Execution failed for task ':core:test'.
> > Process 'Gradle Test Executor 38' finished with non-zero exit value 1
>   This problem might be caused by incorrect test process configuration.
>   For more on test execution, please refer to 
> https://docs.gradle.org/8.6/userguide/java_testing.html#sec:test_execution in 
> the Gradle documentation.{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)