with nested entry processors, child completes but parent IgniteFuture does not

Raymond Liu Tue, 29 Oct 2024 22:05:19 -0700

Hi team,

We see a failure with nested asynchronous entry processor calls where
the nested calls appear to complete, but
the parent entry processor is not aware the child entry processor is done.

In this sample test setup...
1. a test runner invokes an "outer" entry processor twice asynchronously.
2. The two outer entry processor invocations each invoke two inner
entry processors asynchronously.
3. All four inner entry processor invocations appear to complete.
4. One outer entry processor is stuck waiting (via IgniteFuture.get())
for the last inner entry processor to complete.
The program hangs, prints thread dumps, and prints the following error:
"Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour [workerName=sys-stripe-6,
threadName=sys-stripe-6-#7%server1%, blockedFor=13s]"

I've uploaded this in a Github repo called
"ignite-blocked-thread-sample":
https://github.com/Philosobyte/ignite-blocked-thread-sample/blob/main/src/test/java/com/philosobyte/igniteblockedthreadsample/BlockedThreadTest.java

While this sample uses asynchronous invocations, a version of this
test with synchronous invocations runs into the same
issue.

This appears to happen only when there is more than one server node
and the nested invocation goes to a different
server node than the outer invocation. A debug breakpoint shows that,
on the server executing the inner entry processor,
the sys-stripe thread pool is idle; but on the server executing the
outer entry processor, there is one sys-stripe
thread pool waiting on IgniteFuture.get().

We originally suspected a deadlock, but this error also occurs when we
try setting thread pool sizes to a large number.
This almost appears to be a failure to communicate that a future
should be complete.

Nested compute tasks, such as `IgniteCallable`s and `IgniteClosure`s
executing on the public thread pool, do not run
into this problem. So, we worked around this issue by wrapping our
entry processors inside `IgniteCallable`s. That alone
was already enough to resolve the issue; of course, at higher scale,
we'd run into deadlocks, so we also followed the
advice in the "thread pools tuning" documentation page about using a
custom thread pool.

Regardless, we would like to make sure we understand the issue. Can
anyone think of an explanation? Is this expected
behavior or possibly a bug?

Thank you,
Raymond

with nested entry processors, child completes but parent IgniteFuture does not

Reply via email to