I see. It sounds like this would be a rough sequence of events for a happy path successful case: 1. OuterEntryProcessor, running on sys-stripe-5 on server2, invokes InnerEntryProcessor 2. sys-stripe-5 on server2 waits for the above future 3. InnerEntryProcessor runs on sys-stripe-1 on server1 4. server1 communicates the result to server2 5. server2 receives the message and tells sys-stripe-1 on server2 to process it 6. sys-stripe-1 on server2 completes the future that sys-stripe-5 on server2 was waiting for.
This design is not problematic when stripe threads never execute anything blocking, as you advised. Thank you for making that clear! - Raymond On Thu, Oct 31, 2024 at 9:11 AM Pavel Tupitsyn <ptupit...@apache.org> wrote: > > both invocations use sys-stripe-6 > > Ignite needs sys-stripe-6 to complete the future for the invocation, but > the thread is blocked waiting on that future. > > On Thu, Oct 31, 2024 at 3:05 PM Raymond Liu <philosob...@gmail.com> wrote: > >> Each node has 32 sys-stripe threads. With 3 servers and 1 client, there >> are a total of 128 sys-stripe threads. 127 out of 128 of those sys-stripe >> threads are parked on StripeConcurrentQueue.take(). The remaining one, >> sys-stripe-6 on server node 1, is parked on IgniteFutureImpl.get(). >> Specifically, that future.get is the OuterEntryProcessor on server1 for >> outer key "outer1" waiting for the InnerEntryProcessor on server3 for inner >> key "outer1-inner1" to complete. Logs show that particular inner invocation >> completed on sys-stripe-6 on server3. >> >> Here is the thread state for sys-stripe-6 on server1: >> Thread [name="sys-stripe-6-#7%server1%", id=40, state=WAITING, >> blockCnt=29, waitCnt=6] >> at java.base@17.0.11/jdk.internal.misc.Unsafe.park(Native Method) >> at java.base@17.0.11 >> /java.util.concurrent.locks.LockSupport.park(LockSupport.java:341) >> at >> app//o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:181) >> at >> app//o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:144) >> at >> app//o.a.i.i.util.future.IgniteFutureImpl.get(IgniteFutureImpl.java:152) >> at >> app//com.philosobyte.igniteblockedthreadsample.BlockedThreadTest$OuterEntryProcessor.process(BlockedThreadTest.java:105) >> at >> app//com.philosobyte.igniteblockedthreadsample.BlockedThreadTest$OuterEntryProcessor.process(BlockedThreadTest.java:75) >> at >> app//o.a.i.i.processors.cache.EntryProcessorResourceInjectorProxy.process(EntryProcessorResourceInjectorProxy.java:68) >> at >> app//o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.runEntryProcessor(GridCacheMapEntry.java:6334) >> at >> app//o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5650) >> at >> app//o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5437) >> at >> app//o.a.i.i.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:4320) >> at >> app//o.a.i.i.processors.cache.persistence.tree.BPlusTree$Invoke.access$5700(BPlusTree.java:4214) >> at >> app//o.a.i.i.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:2292) >> at >> app//o.a.i.i.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:2182) >> at >> app//o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1692) >> at >> app//o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1675) >> at >> app//o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:424) >> at >> app//o.a.i.i.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:1903) >> at >> app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateSingle(GridDhtAtomicCache.java:2536) >> at >> app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.update(GridDhtAtomicCache.java:1996) >> at >> app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1815) >> at >> app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1688) >> at >> app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3179) >> at >> app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$200(GridDhtAtomicCache.java:147) >> at >> app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:270) >> at >> app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:265) >> at >> app//o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1164) >> at >> app//o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:605) >> at >> app//o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:406) >> at >> app//o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:324) >> at >> app//o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:112) >> at >> app//o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:314) >> at >> app//o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1906) >> at >> app//o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1527) >> at >> app//o.a.i.i.managers.communication.GridIoManager.access$5300(GridIoManager.java:242) >> at >> app//o.a.i.i.managers.communication.GridIoManager$9.execute(GridIoManager.java:1420) >> at >> app//o.a.i.i.managers.communication.TraceRunnable.run(TraceRunnable.java:55) >> at >> app//o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:637) >> at app//o.a.i.i.util.worker.GridWorker.run(GridWorker.java:125) >> at java.base@17.0.11/java.lang.Thread.run(Thread.java:840) >> >> Locked synchronizers: >> >> PoolProcessor printed a log showing the contents of the queue: >> 2024-10-31T08:31:33.330-04:00 WARN 34704 --- >> [ignite-blocked-thread-sample] [er-#70%server1%] o.a.i.i.p.p.PoolProcessor >> : >>> Possible starvation in striped pool. >> Thread name: sys-stripe-6-#7%server1% >> Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, >> topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, >> msg=GridNearAtomicUpdateResponse [nodeId=null, futId=3, errs=null, >> ret=GridCacheReturn [v=null, cacheObj=null, success=false, invokeRes=true, >> loc=true, cacheId=100355670], remapTopVer=null, nearUpdates=null, >> partId=294, mapping=ArrayList [], nodeLeft=false, super=GridCacheIdMessage >> [cacheId=100355670, super=GridCacheMessage [msgId=42, depInfo=null, >> lastAffChangedTopVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0], >> err=null, skipPrepare=false]]]]]] >> Deadlock: false >> Completed: 0 >> >> >> >> and here is the thread state for sys-stripe-6 on server 3, which finished >> executing the inner entry processor. The other 126 sys-stripe thread states >> look like this, as well: >> Thread [name="sys-stripe-6-#300%server3%", id=343, state=WAITING, >> blockCnt=17, waitCnt=2] >> at java.base@17.0.11/jdk.internal.misc.Unsafe.park(Native Method) >> at java.base@17.0.11 >> /java.util.concurrent.locks.LockSupport.park(LockSupport.java:341) >> at >> app//o.a.i.i.util.StripedExecutor$StripeConcurrentQueue.take(StripedExecutor.java:819) >> at >> app//o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:623) >> at app//o.a.i.i.util.worker.GridWorker.run(GridWorker.java:125) >> at java.base@17.0.11/java.lang.Thread.run(Thread.java:840) >> >> This is where I stopped and wrote my first e-mail. I figured that, to >> debug further, I should get some tips about how processor completions are >> normally communicated across the network from the called server to the >> calling server. >> >> It does appear to matter that both invocations use sys-stripe-6 (albeit >> one on server1 and one on server3). The other cross-server invocations >> complete OK, e.g. outer0-inner0 is invoked from sys-stripe-5 on server2 and >> completes on sys-stripe-1 on server1. >> I have attached text logs including the full thread dump, in case that is >> interesting. >> >> - Raymond >> >> On Thu, Oct 31, 2024 at 2:11 AM Pavel Tupitsyn <ptupit...@apache.org> >> wrote: >> >>> Have you tried to get a thread dump to understand which code is blocking >>> the sys-stripe thread? >>> >>> On Thu, Oct 31, 2024 at 5:56 AM Raymond Liu <philosob...@gmail.com> >>> wrote: >>> >>>> Hm, the scenario you described doesn't sound accurate here. I guess >>>> it's my fault for not including logs. I will paste them below. The logs >>>> show all four *inner* entry processors complete without being blocked >>>> - we see four "Entered InnerEntryProcessor" and four "Exiting >>>> InnerEntryProcessor" lines. And yet, one of the OuterEntryProcessors does >>>> not seem to get notified that the inner entry processors have all >>>> completed. This is indicated by the presence of "OuterEntryProcessor with >>>> key outer1 waiting for inner future 1" without a corresponding log stating >>>> "OuterEntryProcessor with key outer1 done waiting for all inner futures". >>>> ------- >>>> 2024-10-30T23:27:55.056-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest >>>> : Starting test >>>> 2024-10-30T23:27:55.057-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest >>>> : Invoking OuterEntryProcessor with key outer0 >>>> 2024-10-30T23:27:55.077-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [mmunicationSpi%] >>>> o.a.i.s.c.t.TcpCommunicationSpi : Accepted incoming communication >>>> connection [locAddr=/127.0.0.1:47101, rmtAddr=/127.0.0.1:61560] >>>> 2024-10-30T23:27:55.077-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [mmunicationSpi%] >>>> o.a.i.s.c.t.TcpCommunicationSpi : Established outgoing >>>> communication connection [locAddr=/127.0.0.1:61560, rmtAddr=/ >>>> 127.0.0.1:47101] >>>> 2024-10-30T23:27:55.081-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest >>>> : Invoked OuterEntryProcessor with key outer0 >>>> 2024-10-30T23:27:55.081-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest >>>> : Invoking OuterEntryProcessor with key outer1 >>>> 2024-10-30T23:27:55.082-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest >>>> : Invoked OuterEntryProcessor with key outer1 >>>> 2024-10-30T23:27:55.082-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest >>>> : Done invoking OuterEntryProcessors. Waiting for them to >>>> complete... >>>> 2024-10-30T23:27:55.082-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest >>>> : Test Runner waiting for outer future 0 >>>> 2024-10-30T23:27:55.093-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest >>>> : Entered OuterEntryProcessor with key outer1. Sleeping >>>> before further actions... >>>> 2024-10-30T23:27:55.093-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>>> : Entered OuterEntryProcessor with key outer0. Sleeping >>>> before further actions... >>>> 2024-10-30T23:27:55.195-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest >>>> : Invoking with inner key: outer1-inner0 >>>> 2024-10-30T23:27:55.195-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>>> : Invoking with inner key: outer0-inner0 >>>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest >>>> : Invoked with inner key: outer1-inner0 >>>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>>> : Invoked with inner key: outer0-inner0 >>>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest >>>> : Invoking with inner key: outer1-inner1 >>>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>>> : Invoking with inner key: outer0-inner1 >>>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>>> : Invoked with inner key: outer0-inner1 >>>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest >>>> : Invoked with inner key: outer1-inner1 >>>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest >>>> : OuterEntryProcessor with key outer1 waiting for inner >>>> future 0 >>>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>>> : OuterEntryProcessor with key outer0 waiting for inner >>>> future 0 >>>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [e-1-#2%server1%] c.p.i.BlockedThreadTest >>>> : Entered InnerEntryProcessor with inner key: outer0-inner0 >>>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [e-1-#2%server1%] c.p.i.BlockedThreadTest >>>> : Exiting InnerEntryProcessor with inner key: outer0-inner0 >>>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [e-0-#1%server1%] c.p.i.BlockedThreadTest >>>> : Entered InnerEntryProcessor with inner key: outer0-inner1 >>>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [e-0-#1%server1%] c.p.i.BlockedThreadTest >>>> : Exiting InnerEntryProcessor with inner key: outer0-inner1 >>>> 2024-10-30T23:27:55.210-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [6-#298%server3%] c.p.i.BlockedThreadTest >>>> : Entered InnerEntryProcessor with inner key: outer1-inner1 >>>> 2024-10-30T23:27:55.210-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [6-#298%server3%] c.p.i.BlockedThreadTest >>>> : Exiting InnerEntryProcessor with inner key: outer1-inner1 >>>> 2024-10-30T23:27:55.215-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>>> : OuterEntryProcessor with key outer0 waiting for inner >>>> future 1 >>>> 2024-10-30T23:27:55.216-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>>> : OuterEntryProcessor with key outer0 done waiting for all >>>> inner futures >>>> 2024-10-30T23:27:55.216-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>>> : Entered InnerEntryProcessor with inner key: outer1-inner0 >>>> 2024-10-30T23:27:55.216-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>>> : Exiting InnerEntryProcessor with inner key: outer1-inner0 >>>> 2024-10-30T23:27:55.216-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest >>>> : Test Runner waiting for outer future 1 >>>> 2024-10-30T23:27:55.217-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest >>>> : OuterEntryProcessor with key outer1 waiting for inner >>>> future 1 >>>> 2024-10-30T23:27:59.621-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [-notifier-timer] >>>> o.a.i.i.p.c.GridUpdateNotifier : Your version is up to date. >>>> 2024-10-30T23:28:01.283-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [-notifier-timer] >>>> o.a.i.i.p.c.GridUpdateNotifier : Your version is up to date. >>>> 2024-10-30T23:28:02.877-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [-notifier-timer] >>>> o.a.i.i.p.c.GridUpdateNotifier : Your version is up to date. >>>> 2024-10-30T23:28:04.399-04:00 INFO 17056 --- >>>> [ignite-blocked-thread-sample] [-notifier-timer] >>>> o.a.i.i.p.c.GridUpdateNotifier : Your version is up to date. >>>> 2024-10-30T23:28:08.508-04:00 ERROR 17056 --- >>>> [ignite-blocked-thread-sample] [%-#118%server1%] o.a.i.i.u.t.G >>>> : Blocked system-critical thread has been detected. This can >>>> lead to cluster-wide undefined behaviour [workerName=sys-stripe-6, >>>> threadName=sys-stripe-6-#7%server1%, blockedFor=13s] >>>> ------- >>>> >>>> Thanks for the advice - we are probably using entry processors in a way >>>> originally intended for compute jobs with transactions, and we will >>>> probably move towards that direction eventually. But to learn from this and >>>> prevent future issues, I'd like to understand exactly what the problem is, >>>> and with these logs, I can't understand how it's a deadlock. >>>> >>>> - Raymond >>>> >>>> >>>> On Wed, Oct 30, 2024 at 7:33 AM Pavel Tupitsyn <ptupit...@apache.org> >>>> wrote: >>>> >>>>> Looks like a deadlock to me. >>>>> >>>>> Processor 1 invokes processor 2 and blocks on a future, stripe thread >>>>> T is blocked >>>>> Processor 2 invokes processor 3, which needs stripe thread T and waits >>>>> forever >>>>> >>>>> General advice - do not block striped pool threads. Avoid IO, waiting >>>>> for futures, etc. An entry processor is supposed to perform a local >>>>> computation and return quickly. >>>>> >>>>> On Wed, Oct 30, 2024 at 7:05 AM Raymond Liu <philosob...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi team, >>>>>> >>>>>> We see a failure with nested asynchronous entry processor calls where >>>>>> the nested calls appear to complete, but >>>>>> the parent entry processor is not aware the child entry processor is >>>>>> done. >>>>>> >>>>>> In this sample test setup... >>>>>> 1. a test runner invokes an "outer" entry processor twice asynchronously. >>>>>> 2. The two outer entry processor invocations each invoke two inner entry >>>>>> processors asynchronously. >>>>>> 3. All four inner entry processor invocations appear to complete. >>>>>> 4. One outer entry processor is stuck waiting (via IgniteFuture.get()) >>>>>> for the last inner entry processor to complete. >>>>>> The program hangs, prints thread dumps, and prints the following >>>>>> error: >>>>>> "Blocked system-critical thread has been detected. This can lead to >>>>>> cluster-wide undefined behaviour [workerName=sys-stripe-6, >>>>>> threadName=sys-stripe-6-#7%server1%, blockedFor=13s]" >>>>>> >>>>>> I've uploaded this in a Github repo called >>>>>> "ignite-blocked-thread-sample": >>>>>> https://github.com/Philosobyte/ignite-blocked-thread-sample/blob/main/src/test/java/com/philosobyte/igniteblockedthreadsample/BlockedThreadTest.java >>>>>> >>>>>> While this sample uses asynchronous invocations, a version of this test >>>>>> with synchronous invocations runs into the same >>>>>> issue. >>>>>> >>>>>> This appears to happen only when there is more than one server node and >>>>>> the nested invocation goes to a different >>>>>> server node than the outer invocation. A debug breakpoint shows that, on >>>>>> the server executing the inner entry processor, >>>>>> the sys-stripe thread pool is idle; but on the server executing the >>>>>> outer entry processor, there is one sys-stripe >>>>>> thread pool waiting on IgniteFuture.get(). >>>>>> >>>>>> We originally suspected a deadlock, but this error also occurs when we >>>>>> try setting thread pool sizes to a large number. >>>>>> This almost appears to be a failure to communicate that a future should >>>>>> be complete. >>>>>> >>>>>> Nested compute tasks, such as `IgniteCallable`s and `IgniteClosure`s >>>>>> executing on the public thread pool, do not run >>>>>> into this problem. So, we worked around this issue by wrapping our entry >>>>>> processors inside `IgniteCallable`s. That alone >>>>>> was already enough to resolve the issue; of course, at higher scale, >>>>>> we'd run into deadlocks, so we also followed the >>>>>> advice in the "thread pools tuning" documentation page about using a >>>>>> custom thread pool. >>>>>> >>>>>> Regardless, we would like to make sure we understand the issue. Can >>>>>> anyone think of an explanation? Is this expected >>>>>> behavior or possibly a bug? >>>>>> >>>>>> Thank you, >>>>>> Raymond >>>>>> >>>>>>