Hi team, We see a failure with nested asynchronous entry processor calls where the nested calls appear to complete, but the parent entry processor is not aware the child entry processor is done.
In this sample test setup... 1. a test runner invokes an "outer" entry processor twice asynchronously. 2. The two outer entry processor invocations each invoke two inner entry processors asynchronously. 3. All four inner entry processor invocations appear to complete. 4. One outer entry processor is stuck waiting (via IgniteFuture.get()) for the last inner entry processor to complete. The program hangs, prints thread dumps, and prints the following error: "Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=sys-stripe-6, threadName=sys-stripe-6-#7%server1%, blockedFor=13s]" I've uploaded this in a Github repo called "ignite-blocked-thread-sample": https://github.com/Philosobyte/ignite-blocked-thread-sample/blob/main/src/test/java/com/philosobyte/igniteblockedthreadsample/BlockedThreadTest.java While this sample uses asynchronous invocations, a version of this test with synchronous invocations runs into the same issue. This appears to happen only when there is more than one server node and the nested invocation goes to a different server node than the outer invocation. A debug breakpoint shows that, on the server executing the inner entry processor, the sys-stripe thread pool is idle; but on the server executing the outer entry processor, there is one sys-stripe thread pool waiting on IgniteFuture.get(). We originally suspected a deadlock, but this error also occurs when we try setting thread pool sizes to a large number. This almost appears to be a failure to communicate that a future should be complete. Nested compute tasks, such as `IgniteCallable`s and `IgniteClosure`s executing on the public thread pool, do not run into this problem. So, we worked around this issue by wrapping our entry processors inside `IgniteCallable`s. That alone was already enough to resolve the issue; of course, at higher scale, we'd run into deadlocks, so we also followed the advice in the "thread pools tuning" documentation page about using a custom thread pool. Regardless, we would like to make sure we understand the issue. Can anyone think of an explanation? Is this expected behavior or possibly a bug? Thank you, Raymond