Looks like a deadlock to me.

Processor 1 invokes processor 2 and blocks on a future, stripe thread T is
blocked
Processor 2 invokes processor 3, which needs stripe thread T and waits
forever

General advice - do not block striped pool threads. Avoid IO, waiting for
futures, etc. An entry processor is supposed to perform a local computation
and return quickly.

On Wed, Oct 30, 2024 at 7:05 AM Raymond Liu <philosob...@gmail.com> wrote:

> Hi team,
>
> We see a failure with nested asynchronous entry processor calls where the 
> nested calls appear to complete, but
> the parent entry processor is not aware the child entry processor is done.
>
> In this sample test setup...
> 1. a test runner invokes an "outer" entry processor twice asynchronously.
> 2. The two outer entry processor invocations each invoke two inner entry 
> processors asynchronously.
> 3. All four inner entry processor invocations appear to complete.
> 4. One outer entry processor is stuck waiting (via IgniteFuture.get()) for 
> the last inner entry processor to complete.
>    The program hangs, prints thread dumps, and prints the following error:
>    "Blocked system-critical thread has been detected. This can lead to 
> cluster-wide undefined behaviour [workerName=sys-stripe-6, 
> threadName=sys-stripe-6-#7%server1%, blockedFor=13s]"
>
> I've uploaded this in a Github repo called "ignite-blocked-thread-sample": 
> https://github.com/Philosobyte/ignite-blocked-thread-sample/blob/main/src/test/java/com/philosobyte/igniteblockedthreadsample/BlockedThreadTest.java
>
> While this sample uses asynchronous invocations, a version of this test with 
> synchronous invocations runs into the same
> issue.
>
> This appears to happen only when there is more than one server node and the 
> nested invocation goes to a different
> server node than the outer invocation. A debug breakpoint shows that, on the 
> server executing the inner entry processor,
> the sys-stripe thread pool is idle; but on the server executing the outer 
> entry processor, there is one sys-stripe
> thread pool waiting on IgniteFuture.get().
>
> We originally suspected a deadlock, but this error also occurs when we try 
> setting thread pool sizes to a large number.
> This almost appears to be a failure to communicate that a future should be 
> complete.
>
> Nested compute tasks, such as `IgniteCallable`s and `IgniteClosure`s 
> executing on the public thread pool, do not run
> into this problem. So, we worked around this issue by wrapping our entry 
> processors inside `IgniteCallable`s. That alone
> was already enough to resolve the issue; of course, at higher scale, we'd run 
> into deadlocks, so we also followed the
> advice in the "thread pools tuning" documentation page about using a custom 
> thread pool.
>
> Regardless, we would like to make sure we understand the issue. Can anyone 
> think of an explanation? Is this expected
> behavior or possibly a bug?
>
> Thank you,
> Raymond
>
>

Reply via email to