Looks like a deadlock to me. Processor 1 invokes processor 2 and blocks on a future, stripe thread T is blocked Processor 2 invokes processor 3, which needs stripe thread T and waits forever
General advice - do not block striped pool threads. Avoid IO, waiting for futures, etc. An entry processor is supposed to perform a local computation and return quickly. On Wed, Oct 30, 2024 at 7:05 AM Raymond Liu <philosob...@gmail.com> wrote: > Hi team, > > We see a failure with nested asynchronous entry processor calls where the > nested calls appear to complete, but > the parent entry processor is not aware the child entry processor is done. > > In this sample test setup... > 1. a test runner invokes an "outer" entry processor twice asynchronously. > 2. The two outer entry processor invocations each invoke two inner entry > processors asynchronously. > 3. All four inner entry processor invocations appear to complete. > 4. One outer entry processor is stuck waiting (via IgniteFuture.get()) for > the last inner entry processor to complete. > The program hangs, prints thread dumps, and prints the following error: > "Blocked system-critical thread has been detected. This can lead to > cluster-wide undefined behaviour [workerName=sys-stripe-6, > threadName=sys-stripe-6-#7%server1%, blockedFor=13s]" > > I've uploaded this in a Github repo called "ignite-blocked-thread-sample": > https://github.com/Philosobyte/ignite-blocked-thread-sample/blob/main/src/test/java/com/philosobyte/igniteblockedthreadsample/BlockedThreadTest.java > > While this sample uses asynchronous invocations, a version of this test with > synchronous invocations runs into the same > issue. > > This appears to happen only when there is more than one server node and the > nested invocation goes to a different > server node than the outer invocation. A debug breakpoint shows that, on the > server executing the inner entry processor, > the sys-stripe thread pool is idle; but on the server executing the outer > entry processor, there is one sys-stripe > thread pool waiting on IgniteFuture.get(). > > We originally suspected a deadlock, but this error also occurs when we try > setting thread pool sizes to a large number. > This almost appears to be a failure to communicate that a future should be > complete. > > Nested compute tasks, such as `IgniteCallable`s and `IgniteClosure`s > executing on the public thread pool, do not run > into this problem. So, we worked around this issue by wrapping our entry > processors inside `IgniteCallable`s. That alone > was already enough to resolve the issue; of course, at higher scale, we'd run > into deadlocks, so we also followed the > advice in the "thread pools tuning" documentation page about using a custom > thread pool. > > Regardless, we would like to make sure we understand the issue. Can anyone > think of an explanation? Is this expected > behavior or possibly a bug? > > Thank you, > Raymond > >