> both invocations use sys-stripe-6

Ignite needs sys-stripe-6 to complete the future for the invocation, but
the thread is blocked waiting on that future.

On Thu, Oct 31, 2024 at 3:05 PM Raymond Liu <philosob...@gmail.com> wrote:

> Each node has 32 sys-stripe threads. With 3 servers and 1 client, there
> are a total of 128 sys-stripe threads. 127 out of 128 of those sys-stripe
> threads are parked on StripeConcurrentQueue.take(). The remaining one,
> sys-stripe-6 on server node 1, is parked on IgniteFutureImpl.get().
> Specifically, that future.get is the OuterEntryProcessor on server1 for
> outer key "outer1" waiting for the InnerEntryProcessor on server3 for inner
> key "outer1-inner1" to complete. Logs show that particular inner invocation
> completed on sys-stripe-6 on server3.
>
> Here is the thread state for sys-stripe-6 on server1:
> Thread [name="sys-stripe-6-#7%server1%", id=40, state=WAITING,
> blockCnt=29, waitCnt=6]
>         at java.base@17.0.11/jdk.internal.misc.Unsafe.park(Native Method)
>         at java.base@17.0.11
> /java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
>         at
> app//o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:181)
>         at
> app//o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:144)
>         at
> app//o.a.i.i.util.future.IgniteFutureImpl.get(IgniteFutureImpl.java:152)
>         at
> app//com.philosobyte.igniteblockedthreadsample.BlockedThreadTest$OuterEntryProcessor.process(BlockedThreadTest.java:105)
>         at
> app//com.philosobyte.igniteblockedthreadsample.BlockedThreadTest$OuterEntryProcessor.process(BlockedThreadTest.java:75)
>         at
> app//o.a.i.i.processors.cache.EntryProcessorResourceInjectorProxy.process(EntryProcessorResourceInjectorProxy.java:68)
>         at
> app//o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.runEntryProcessor(GridCacheMapEntry.java:6334)
>         at
> app//o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5650)
>         at
> app//o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5437)
>         at
> app//o.a.i.i.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:4320)
>         at
> app//o.a.i.i.processors.cache.persistence.tree.BPlusTree$Invoke.access$5700(BPlusTree.java:4214)
>         at
> app//o.a.i.i.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:2292)
>         at
> app//o.a.i.i.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:2182)
>         at
> app//o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1692)
>         at
> app//o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1675)
>         at
> app//o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:424)
>         at
> app//o.a.i.i.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:1903)
>         at
> app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateSingle(GridDhtAtomicCache.java:2536)
>         at
> app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.update(GridDhtAtomicCache.java:1996)
>         at
> app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1815)
>         at
> app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1688)
>         at
> app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3179)
>         at
> app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$200(GridDhtAtomicCache.java:147)
>         at
> app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:270)
>         at
> app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:265)
>         at
> app//o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1164)
>         at
> app//o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:605)
>         at
> app//o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:406)
>         at
> app//o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:324)
>         at
> app//o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:112)
>         at
> app//o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:314)
>         at
> app//o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1906)
>         at
> app//o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1527)
>         at
> app//o.a.i.i.managers.communication.GridIoManager.access$5300(GridIoManager.java:242)
>         at
> app//o.a.i.i.managers.communication.GridIoManager$9.execute(GridIoManager.java:1420)
>         at
> app//o.a.i.i.managers.communication.TraceRunnable.run(TraceRunnable.java:55)
>         at
> app//o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:637)
>         at app//o.a.i.i.util.worker.GridWorker.run(GridWorker.java:125)
>         at java.base@17.0.11/java.lang.Thread.run(Thread.java:840)
>
>     Locked synchronizers:
>
> PoolProcessor printed a log showing the contents of the queue:
> 2024-10-31T08:31:33.330-04:00  WARN 34704 ---
> [ignite-blocked-thread-sample] [er-#70%server1%] o.a.i.i.p.p.PoolProcessor
>                : >>> Possible starvation in striped pool.
>     Thread name: sys-stripe-6-#7%server1%
>     Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE,
> topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false,
> msg=GridNearAtomicUpdateResponse [nodeId=null, futId=3, errs=null,
> ret=GridCacheReturn [v=null, cacheObj=null, success=false, invokeRes=true,
> loc=true, cacheId=100355670], remapTopVer=null, nearUpdates=null,
> partId=294, mapping=ArrayList [], nodeLeft=false, super=GridCacheIdMessage
> [cacheId=100355670, super=GridCacheMessage [msgId=42, depInfo=null,
> lastAffChangedTopVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0],
> err=null, skipPrepare=false]]]]]]
>     Deadlock: false
>     Completed: 0
>
>
>
> and here is the thread state for sys-stripe-6 on server 3, which finished
> executing the inner entry processor. The other 126 sys-stripe thread states
> look like this, as well:
> Thread [name="sys-stripe-6-#300%server3%", id=343, state=WAITING,
> blockCnt=17, waitCnt=2]
>         at java.base@17.0.11/jdk.internal.misc.Unsafe.park(Native Method)
>         at java.base@17.0.11
> /java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
>         at
> app//o.a.i.i.util.StripedExecutor$StripeConcurrentQueue.take(StripedExecutor.java:819)
>         at
> app//o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:623)
>         at app//o.a.i.i.util.worker.GridWorker.run(GridWorker.java:125)
>         at java.base@17.0.11/java.lang.Thread.run(Thread.java:840)
>
> This is where I stopped and wrote my first e-mail. I figured that, to
> debug further, I should get some tips about how processor completions are
> normally communicated across the network from the called server to the
> calling server.
>
> It does appear to matter that both invocations use sys-stripe-6 (albeit
> one on server1 and one on server3). The other cross-server invocations
> complete OK, e.g. outer0-inner0 is invoked from sys-stripe-5 on server2 and
> completes on sys-stripe-1 on server1.
> I have attached text logs including the full thread dump, in case that is
> interesting.
>
> - Raymond
>
> On Thu, Oct 31, 2024 at 2:11 AM Pavel Tupitsyn <ptupit...@apache.org>
> wrote:
>
>> Have you tried to get a thread dump to understand which code is blocking
>> the sys-stripe thread?
>>
>> On Thu, Oct 31, 2024 at 5:56 AM Raymond Liu <philosob...@gmail.com>
>> wrote:
>>
>>> Hm, the scenario you described doesn't sound accurate here. I guess it's
>>> my fault for not including logs. I will paste them below. The logs show all
>>> four *inner* entry processors complete without being blocked - we see
>>> four "Entered InnerEntryProcessor" and four "Exiting InnerEntryProcessor"
>>> lines. And yet, one of the OuterEntryProcessors does not seem to get
>>> notified that the inner entry processors have all completed. This is
>>> indicated by the presence of "OuterEntryProcessor with key outer1 waiting
>>> for inner future 1" without a corresponding log stating
>>> "OuterEntryProcessor with key outer1 done waiting for all inner futures".
>>> -------
>>> 2024-10-30T23:27:55.056-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
>>>                : Starting test
>>> 2024-10-30T23:27:55.057-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
>>>                : Invoking OuterEntryProcessor with key outer0
>>> 2024-10-30T23:27:55.077-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [mmunicationSpi%]
>>> o.a.i.s.c.t.TcpCommunicationSpi          : Accepted incoming communication
>>> connection [locAddr=/127.0.0.1:47101, rmtAddr=/127.0.0.1:61560]
>>> 2024-10-30T23:27:55.077-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [mmunicationSpi%]
>>> o.a.i.s.c.t.TcpCommunicationSpi          : Established outgoing
>>> communication connection [locAddr=/127.0.0.1:61560, rmtAddr=/
>>> 127.0.0.1:47101]
>>> 2024-10-30T23:27:55.081-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
>>>                : Invoked OuterEntryProcessor with key outer0
>>> 2024-10-30T23:27:55.081-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
>>>                : Invoking OuterEntryProcessor with key outer1
>>> 2024-10-30T23:27:55.082-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
>>>                : Invoked OuterEntryProcessor with key outer1
>>> 2024-10-30T23:27:55.082-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
>>>                : Done invoking OuterEntryProcessors. Waiting for them to
>>> complete...
>>> 2024-10-30T23:27:55.082-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
>>>                : Test Runner waiting for outer future 0
>>> 2024-10-30T23:27:55.093-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
>>>                : Entered OuterEntryProcessor with key outer1. Sleeping
>>> before further actions...
>>> 2024-10-30T23:27:55.093-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>>>                : Entered OuterEntryProcessor with key outer0. Sleeping
>>> before further actions...
>>> 2024-10-30T23:27:55.195-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
>>>                : Invoking with inner key: outer1-inner0
>>> 2024-10-30T23:27:55.195-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>>>                : Invoking with inner key: outer0-inner0
>>> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
>>>                : Invoked with inner key: outer1-inner0
>>> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>>>                : Invoked with inner key: outer0-inner0
>>> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
>>>                : Invoking with inner key: outer1-inner1
>>> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>>>                : Invoking with inner key: outer0-inner1
>>> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>>>                : Invoked with inner key: outer0-inner1
>>> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
>>>                : Invoked with inner key: outer1-inner1
>>> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
>>>                : OuterEntryProcessor with key outer1 waiting for inner
>>> future 0
>>> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>>>                : OuterEntryProcessor with key outer0 waiting for inner
>>> future 0
>>> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [e-1-#2%server1%] c.p.i.BlockedThreadTest
>>>                : Entered InnerEntryProcessor with inner key: outer0-inner0
>>> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [e-1-#2%server1%] c.p.i.BlockedThreadTest
>>>                : Exiting InnerEntryProcessor with inner key: outer0-inner0
>>> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [e-0-#1%server1%] c.p.i.BlockedThreadTest
>>>                : Entered InnerEntryProcessor with inner key: outer0-inner1
>>> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [e-0-#1%server1%] c.p.i.BlockedThreadTest
>>>                : Exiting InnerEntryProcessor with inner key: outer0-inner1
>>> 2024-10-30T23:27:55.210-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [6-#298%server3%] c.p.i.BlockedThreadTest
>>>                : Entered InnerEntryProcessor with inner key: outer1-inner1
>>> 2024-10-30T23:27:55.210-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [6-#298%server3%] c.p.i.BlockedThreadTest
>>>                : Exiting InnerEntryProcessor with inner key: outer1-inner1
>>> 2024-10-30T23:27:55.215-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>>>                : OuterEntryProcessor with key outer0 waiting for inner
>>> future 1
>>> 2024-10-30T23:27:55.216-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>>>                : OuterEntryProcessor with key outer0 done waiting for all
>>> inner futures
>>> 2024-10-30T23:27:55.216-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>>>                : Entered InnerEntryProcessor with inner key: outer1-inner0
>>> 2024-10-30T23:27:55.216-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>>>                : Exiting InnerEntryProcessor with inner key: outer1-inner0
>>> 2024-10-30T23:27:55.216-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
>>>                : Test Runner waiting for outer future 1
>>> 2024-10-30T23:27:55.217-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
>>>                : OuterEntryProcessor with key outer1 waiting for inner
>>> future 1
>>> 2024-10-30T23:27:59.621-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [-notifier-timer]
>>> o.a.i.i.p.c.GridUpdateNotifier           : Your version is up to date.
>>> 2024-10-30T23:28:01.283-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [-notifier-timer]
>>> o.a.i.i.p.c.GridUpdateNotifier           : Your version is up to date.
>>> 2024-10-30T23:28:02.877-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [-notifier-timer]
>>> o.a.i.i.p.c.GridUpdateNotifier           : Your version is up to date.
>>> 2024-10-30T23:28:04.399-04:00  INFO 17056 ---
>>> [ignite-blocked-thread-sample] [-notifier-timer]
>>> o.a.i.i.p.c.GridUpdateNotifier           : Your version is up to date.
>>> 2024-10-30T23:28:08.508-04:00 ERROR 17056 ---
>>> [ignite-blocked-thread-sample] [%-#118%server1%] o.a.i.i.u.t.G
>>>                : Blocked system-critical thread has been detected. This can
>>> lead to cluster-wide undefined behaviour [workerName=sys-stripe-6,
>>> threadName=sys-stripe-6-#7%server1%, blockedFor=13s]
>>> -------
>>>
>>> Thanks for the advice - we are probably using entry processors in a way
>>> originally intended for compute jobs with transactions, and we will
>>> probably move towards that direction eventually. But to learn from this and
>>> prevent future issues, I'd like to understand exactly what the problem is,
>>> and with these logs, I can't understand how it's a deadlock.
>>>
>>> - Raymond
>>>
>>>
>>> On Wed, Oct 30, 2024 at 7:33 AM Pavel Tupitsyn <ptupit...@apache.org>
>>> wrote:
>>>
>>>> Looks like a deadlock to me.
>>>>
>>>> Processor 1 invokes processor 2 and blocks on a future, stripe thread T
>>>> is blocked
>>>> Processor 2 invokes processor 3, which needs stripe thread T and waits
>>>> forever
>>>>
>>>> General advice - do not block striped pool threads. Avoid IO, waiting
>>>> for futures, etc. An entry processor is supposed to perform a local
>>>> computation and return quickly.
>>>>
>>>> On Wed, Oct 30, 2024 at 7:05 AM Raymond Liu <philosob...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi team,
>>>>>
>>>>> We see a failure with nested asynchronous entry processor calls where the 
>>>>> nested calls appear to complete, but
>>>>> the parent entry processor is not aware the child entry processor is done.
>>>>>
>>>>> In this sample test setup...
>>>>> 1. a test runner invokes an "outer" entry processor twice asynchronously.
>>>>> 2. The two outer entry processor invocations each invoke two inner entry 
>>>>> processors asynchronously.
>>>>> 3. All four inner entry processor invocations appear to complete.
>>>>> 4. One outer entry processor is stuck waiting (via IgniteFuture.get()) 
>>>>> for the last inner entry processor to complete.
>>>>>    The program hangs, prints thread dumps, and prints the following error:
>>>>>    "Blocked system-critical thread has been detected. This can lead to 
>>>>> cluster-wide undefined behaviour [workerName=sys-stripe-6, 
>>>>> threadName=sys-stripe-6-#7%server1%, blockedFor=13s]"
>>>>>
>>>>> I've uploaded this in a Github repo called 
>>>>> "ignite-blocked-thread-sample": 
>>>>> https://github.com/Philosobyte/ignite-blocked-thread-sample/blob/main/src/test/java/com/philosobyte/igniteblockedthreadsample/BlockedThreadTest.java
>>>>>
>>>>> While this sample uses asynchronous invocations, a version of this test 
>>>>> with synchronous invocations runs into the same
>>>>> issue.
>>>>>
>>>>> This appears to happen only when there is more than one server node and 
>>>>> the nested invocation goes to a different
>>>>> server node than the outer invocation. A debug breakpoint shows that, on 
>>>>> the server executing the inner entry processor,
>>>>> the sys-stripe thread pool is idle; but on the server executing the outer 
>>>>> entry processor, there is one sys-stripe
>>>>> thread pool waiting on IgniteFuture.get().
>>>>>
>>>>> We originally suspected a deadlock, but this error also occurs when we 
>>>>> try setting thread pool sizes to a large number.
>>>>> This almost appears to be a failure to communicate that a future should 
>>>>> be complete.
>>>>>
>>>>> Nested compute tasks, such as `IgniteCallable`s and `IgniteClosure`s 
>>>>> executing on the public thread pool, do not run
>>>>> into this problem. So, we worked around this issue by wrapping our entry 
>>>>> processors inside `IgniteCallable`s. That alone
>>>>> was already enough to resolve the issue; of course, at higher scale, we'd 
>>>>> run into deadlocks, so we also followed the
>>>>> advice in the "thread pools tuning" documentation page about using a 
>>>>> custom thread pool.
>>>>>
>>>>> Regardless, we would like to make sure we understand the issue. Can 
>>>>> anyone think of an explanation? Is this expected
>>>>> behavior or possibly a bug?
>>>>>
>>>>> Thank you,
>>>>> Raymond
>>>>>
>>>>>

Reply via email to