Hm, the scenario you described doesn't sound accurate here. I guess it's my
fault for not including logs. I will paste them below. The logs show all
four *inner* entry processors complete without being blocked - we see four
"Entered InnerEntryProcessor" and four "Exiting InnerEntryProcessor" lines.
And yet, one of the OuterEntryProcessors does not seem to get notified that
the inner entry processors have all completed. This is indicated by the
presence of "OuterEntryProcessor with key outer1 waiting for inner future
1" without a corresponding log stating "OuterEntryProcessor with key outer1
done waiting for all inner futures".
-------
2024-10-30T23:27:55.056-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
               : Starting test
2024-10-30T23:27:55.057-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
               : Invoking OuterEntryProcessor with key outer0
2024-10-30T23:27:55.077-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [mmunicationSpi%]
o.a.i.s.c.t.TcpCommunicationSpi          : Accepted incoming communication
connection [locAddr=/127.0.0.1:47101, rmtAddr=/127.0.0.1:61560]
2024-10-30T23:27:55.077-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [mmunicationSpi%]
o.a.i.s.c.t.TcpCommunicationSpi          : Established outgoing
communication connection [locAddr=/127.0.0.1:61560, rmtAddr=/127.0.0.1:47101
]
2024-10-30T23:27:55.081-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
               : Invoked OuterEntryProcessor with key outer0
2024-10-30T23:27:55.081-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
               : Invoking OuterEntryProcessor with key outer1
2024-10-30T23:27:55.082-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
               : Invoked OuterEntryProcessor with key outer1
2024-10-30T23:27:55.082-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
               : Done invoking OuterEntryProcessors. Waiting for them to
complete...
2024-10-30T23:27:55.082-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
               : Test Runner waiting for outer future 0
2024-10-30T23:27:55.093-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
               : Entered OuterEntryProcessor with key outer1. Sleeping
before further actions...
2024-10-30T23:27:55.093-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
               : Entered OuterEntryProcessor with key outer0. Sleeping
before further actions...
2024-10-30T23:27:55.195-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
               : Invoking with inner key: outer1-inner0
2024-10-30T23:27:55.195-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
               : Invoking with inner key: outer0-inner0
2024-10-30T23:27:55.209-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
               : Invoked with inner key: outer1-inner0
2024-10-30T23:27:55.209-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
               : Invoked with inner key: outer0-inner0
2024-10-30T23:27:55.209-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
               : Invoking with inner key: outer1-inner1
2024-10-30T23:27:55.209-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
               : Invoking with inner key: outer0-inner1
2024-10-30T23:27:55.209-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
               : Invoked with inner key: outer0-inner1
2024-10-30T23:27:55.209-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
               : Invoked with inner key: outer1-inner1
2024-10-30T23:27:55.209-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
               : OuterEntryProcessor with key outer1 waiting for inner
future 0
2024-10-30T23:27:55.209-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
               : OuterEntryProcessor with key outer0 waiting for inner
future 0
2024-10-30T23:27:55.209-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [e-1-#2%server1%] c.p.i.BlockedThreadTest
               : Entered InnerEntryProcessor with inner key: outer0-inner0
2024-10-30T23:27:55.209-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [e-1-#2%server1%] c.p.i.BlockedThreadTest
               : Exiting InnerEntryProcessor with inner key: outer0-inner0
2024-10-30T23:27:55.209-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [e-0-#1%server1%] c.p.i.BlockedThreadTest
               : Entered InnerEntryProcessor with inner key: outer0-inner1
2024-10-30T23:27:55.209-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [e-0-#1%server1%] c.p.i.BlockedThreadTest
               : Exiting InnerEntryProcessor with inner key: outer0-inner1
2024-10-30T23:27:55.210-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [6-#298%server3%] c.p.i.BlockedThreadTest
               : Entered InnerEntryProcessor with inner key: outer1-inner1
2024-10-30T23:27:55.210-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [6-#298%server3%] c.p.i.BlockedThreadTest
               : Exiting InnerEntryProcessor with inner key: outer1-inner1
2024-10-30T23:27:55.215-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
               : OuterEntryProcessor with key outer0 waiting for inner
future 1
2024-10-30T23:27:55.216-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
               : OuterEntryProcessor with key outer0 done waiting for all
inner futures
2024-10-30T23:27:55.216-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
               : Entered InnerEntryProcessor with inner key: outer1-inner0
2024-10-30T23:27:55.216-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
               : Exiting InnerEntryProcessor with inner key: outer1-inner0
2024-10-30T23:27:55.216-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
               : Test Runner waiting for outer future 1
2024-10-30T23:27:55.217-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
               : OuterEntryProcessor with key outer1 waiting for inner
future 1
2024-10-30T23:27:59.621-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [-notifier-timer]
o.a.i.i.p.c.GridUpdateNotifier           : Your version is up to date.
2024-10-30T23:28:01.283-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [-notifier-timer]
o.a.i.i.p.c.GridUpdateNotifier           : Your version is up to date.
2024-10-30T23:28:02.877-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [-notifier-timer]
o.a.i.i.p.c.GridUpdateNotifier           : Your version is up to date.
2024-10-30T23:28:04.399-04:00  INFO 17056 ---
[ignite-blocked-thread-sample] [-notifier-timer]
o.a.i.i.p.c.GridUpdateNotifier           : Your version is up to date.
2024-10-30T23:28:08.508-04:00 ERROR 17056 ---
[ignite-blocked-thread-sample] [%-#118%server1%] o.a.i.i.u.t.G
               : Blocked system-critical thread has been detected. This can
lead to cluster-wide undefined behaviour [workerName=sys-stripe-6,
threadName=sys-stripe-6-#7%server1%, blockedFor=13s]
-------

Thanks for the advice - we are probably using entry processors in a way
originally intended for compute jobs with transactions, and we will
probably move towards that direction eventually. But to learn from this and
prevent future issues, I'd like to understand exactly what the problem is,
and with these logs, I can't understand how it's a deadlock.

- Raymond


On Wed, Oct 30, 2024 at 7:33 AM Pavel Tupitsyn <ptupit...@apache.org> wrote:

> Looks like a deadlock to me.
>
> Processor 1 invokes processor 2 and blocks on a future, stripe thread T is
> blocked
> Processor 2 invokes processor 3, which needs stripe thread T and waits
> forever
>
> General advice - do not block striped pool threads. Avoid IO, waiting for
> futures, etc. An entry processor is supposed to perform a local computation
> and return quickly.
>
> On Wed, Oct 30, 2024 at 7:05 AM Raymond Liu <philosob...@gmail.com> wrote:
>
>> Hi team,
>>
>> We see a failure with nested asynchronous entry processor calls where the 
>> nested calls appear to complete, but
>> the parent entry processor is not aware the child entry processor is done.
>>
>> In this sample test setup...
>> 1. a test runner invokes an "outer" entry processor twice asynchronously.
>> 2. The two outer entry processor invocations each invoke two inner entry 
>> processors asynchronously.
>> 3. All four inner entry processor invocations appear to complete.
>> 4. One outer entry processor is stuck waiting (via IgniteFuture.get()) for 
>> the last inner entry processor to complete.
>>    The program hangs, prints thread dumps, and prints the following error:
>>    "Blocked system-critical thread has been detected. This can lead to 
>> cluster-wide undefined behaviour [workerName=sys-stripe-6, 
>> threadName=sys-stripe-6-#7%server1%, blockedFor=13s]"
>>
>> I've uploaded this in a Github repo called "ignite-blocked-thread-sample": 
>> https://github.com/Philosobyte/ignite-blocked-thread-sample/blob/main/src/test/java/com/philosobyte/igniteblockedthreadsample/BlockedThreadTest.java
>>
>> While this sample uses asynchronous invocations, a version of this test with 
>> synchronous invocations runs into the same
>> issue.
>>
>> This appears to happen only when there is more than one server node and the 
>> nested invocation goes to a different
>> server node than the outer invocation. A debug breakpoint shows that, on the 
>> server executing the inner entry processor,
>> the sys-stripe thread pool is idle; but on the server executing the outer 
>> entry processor, there is one sys-stripe
>> thread pool waiting on IgniteFuture.get().
>>
>> We originally suspected a deadlock, but this error also occurs when we try 
>> setting thread pool sizes to a large number.
>> This almost appears to be a failure to communicate that a future should be 
>> complete.
>>
>> Nested compute tasks, such as `IgniteCallable`s and `IgniteClosure`s 
>> executing on the public thread pool, do not run
>> into this problem. So, we worked around this issue by wrapping our entry 
>> processors inside `IgniteCallable`s. That alone
>> was already enough to resolve the issue; of course, at higher scale, we'd 
>> run into deadlocks, so we also followed the
>> advice in the "thread pools tuning" documentation page about using a custom 
>> thread pool.
>>
>> Regardless, we would like to make sure we understand the issue. Can anyone 
>> think of an explanation? Is this expected
>> behavior or possibly a bug?
>>
>> Thank you,
>> Raymond
>>
>>

Reply via email to