> both invocations use sys-stripe-6 Ignite needs sys-stripe-6 to complete the future for the invocation, but the thread is blocked waiting on that future.
On Thu, Oct 31, 2024 at 3:05 PM Raymond Liu <philosob...@gmail.com> wrote: > Each node has 32 sys-stripe threads. With 3 servers and 1 client, there > are a total of 128 sys-stripe threads. 127 out of 128 of those sys-stripe > threads are parked on StripeConcurrentQueue.take(). The remaining one, > sys-stripe-6 on server node 1, is parked on IgniteFutureImpl.get(). > Specifically, that future.get is the OuterEntryProcessor on server1 for > outer key "outer1" waiting for the InnerEntryProcessor on server3 for inner > key "outer1-inner1" to complete. Logs show that particular inner invocation > completed on sys-stripe-6 on server3. > > Here is the thread state for sys-stripe-6 on server1: > Thread [name="sys-stripe-6-#7%server1%", id=40, state=WAITING, > blockCnt=29, waitCnt=6] > at java.base@17.0.11/jdk.internal.misc.Unsafe.park(Native Method) > at java.base@17.0.11 > /java.util.concurrent.locks.LockSupport.park(LockSupport.java:341) > at > app//o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:181) > at > app//o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:144) > at > app//o.a.i.i.util.future.IgniteFutureImpl.get(IgniteFutureImpl.java:152) > at > app//com.philosobyte.igniteblockedthreadsample.BlockedThreadTest$OuterEntryProcessor.process(BlockedThreadTest.java:105) > at > app//com.philosobyte.igniteblockedthreadsample.BlockedThreadTest$OuterEntryProcessor.process(BlockedThreadTest.java:75) > at > app//o.a.i.i.processors.cache.EntryProcessorResourceInjectorProxy.process(EntryProcessorResourceInjectorProxy.java:68) > at > app//o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.runEntryProcessor(GridCacheMapEntry.java:6334) > at > app//o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5650) > at > app//o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5437) > at > app//o.a.i.i.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:4320) > at > app//o.a.i.i.processors.cache.persistence.tree.BPlusTree$Invoke.access$5700(BPlusTree.java:4214) > at > app//o.a.i.i.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:2292) > at > app//o.a.i.i.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:2182) > at > app//o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1692) > at > app//o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1675) > at > app//o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:424) > at > app//o.a.i.i.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:1903) > at > app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateSingle(GridDhtAtomicCache.java:2536) > at > app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.update(GridDhtAtomicCache.java:1996) > at > app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1815) > at > app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1688) > at > app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3179) > at > app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$200(GridDhtAtomicCache.java:147) > at > app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:270) > at > app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:265) > at > app//o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1164) > at > app//o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:605) > at > app//o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:406) > at > app//o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:324) > at > app//o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:112) > at > app//o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:314) > at > app//o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1906) > at > app//o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1527) > at > app//o.a.i.i.managers.communication.GridIoManager.access$5300(GridIoManager.java:242) > at > app//o.a.i.i.managers.communication.GridIoManager$9.execute(GridIoManager.java:1420) > at > app//o.a.i.i.managers.communication.TraceRunnable.run(TraceRunnable.java:55) > at > app//o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:637) > at app//o.a.i.i.util.worker.GridWorker.run(GridWorker.java:125) > at java.base@17.0.11/java.lang.Thread.run(Thread.java:840) > > Locked synchronizers: > > PoolProcessor printed a log showing the contents of the queue: > 2024-10-31T08:31:33.330-04:00 WARN 34704 --- > [ignite-blocked-thread-sample] [er-#70%server1%] o.a.i.i.p.p.PoolProcessor > : >>> Possible starvation in striped pool. > Thread name: sys-stripe-6-#7%server1% > Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, > topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, > msg=GridNearAtomicUpdateResponse [nodeId=null, futId=3, errs=null, > ret=GridCacheReturn [v=null, cacheObj=null, success=false, invokeRes=true, > loc=true, cacheId=100355670], remapTopVer=null, nearUpdates=null, > partId=294, mapping=ArrayList [], nodeLeft=false, super=GridCacheIdMessage > [cacheId=100355670, super=GridCacheMessage [msgId=42, depInfo=null, > lastAffChangedTopVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0], > err=null, skipPrepare=false]]]]]] > Deadlock: false > Completed: 0 > > > > and here is the thread state for sys-stripe-6 on server 3, which finished > executing the inner entry processor. The other 126 sys-stripe thread states > look like this, as well: > Thread [name="sys-stripe-6-#300%server3%", id=343, state=WAITING, > blockCnt=17, waitCnt=2] > at java.base@17.0.11/jdk.internal.misc.Unsafe.park(Native Method) > at java.base@17.0.11 > /java.util.concurrent.locks.LockSupport.park(LockSupport.java:341) > at > app//o.a.i.i.util.StripedExecutor$StripeConcurrentQueue.take(StripedExecutor.java:819) > at > app//o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:623) > at app//o.a.i.i.util.worker.GridWorker.run(GridWorker.java:125) > at java.base@17.0.11/java.lang.Thread.run(Thread.java:840) > > This is where I stopped and wrote my first e-mail. I figured that, to > debug further, I should get some tips about how processor completions are > normally communicated across the network from the called server to the > calling server. > > It does appear to matter that both invocations use sys-stripe-6 (albeit > one on server1 and one on server3). The other cross-server invocations > complete OK, e.g. outer0-inner0 is invoked from sys-stripe-5 on server2 and > completes on sys-stripe-1 on server1. > I have attached text logs including the full thread dump, in case that is > interesting. > > - Raymond > > On Thu, Oct 31, 2024 at 2:11 AM Pavel Tupitsyn <ptupit...@apache.org> > wrote: > >> Have you tried to get a thread dump to understand which code is blocking >> the sys-stripe thread? >> >> On Thu, Oct 31, 2024 at 5:56 AM Raymond Liu <philosob...@gmail.com> >> wrote: >> >>> Hm, the scenario you described doesn't sound accurate here. I guess it's >>> my fault for not including logs. I will paste them below. The logs show all >>> four *inner* entry processors complete without being blocked - we see >>> four "Entered InnerEntryProcessor" and four "Exiting InnerEntryProcessor" >>> lines. And yet, one of the OuterEntryProcessors does not seem to get >>> notified that the inner entry processors have all completed. This is >>> indicated by the presence of "OuterEntryProcessor with key outer1 waiting >>> for inner future 1" without a corresponding log stating >>> "OuterEntryProcessor with key outer1 done waiting for all inner futures". >>> ------- >>> 2024-10-30T23:27:55.056-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest >>> : Starting test >>> 2024-10-30T23:27:55.057-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest >>> : Invoking OuterEntryProcessor with key outer0 >>> 2024-10-30T23:27:55.077-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [mmunicationSpi%] >>> o.a.i.s.c.t.TcpCommunicationSpi : Accepted incoming communication >>> connection [locAddr=/127.0.0.1:47101, rmtAddr=/127.0.0.1:61560] >>> 2024-10-30T23:27:55.077-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [mmunicationSpi%] >>> o.a.i.s.c.t.TcpCommunicationSpi : Established outgoing >>> communication connection [locAddr=/127.0.0.1:61560, rmtAddr=/ >>> 127.0.0.1:47101] >>> 2024-10-30T23:27:55.081-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest >>> : Invoked OuterEntryProcessor with key outer0 >>> 2024-10-30T23:27:55.081-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest >>> : Invoking OuterEntryProcessor with key outer1 >>> 2024-10-30T23:27:55.082-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest >>> : Invoked OuterEntryProcessor with key outer1 >>> 2024-10-30T23:27:55.082-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest >>> : Done invoking OuterEntryProcessors. Waiting for them to >>> complete... >>> 2024-10-30T23:27:55.082-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest >>> : Test Runner waiting for outer future 0 >>> 2024-10-30T23:27:55.093-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest >>> : Entered OuterEntryProcessor with key outer1. Sleeping >>> before further actions... >>> 2024-10-30T23:27:55.093-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>> : Entered OuterEntryProcessor with key outer0. Sleeping >>> before further actions... >>> 2024-10-30T23:27:55.195-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest >>> : Invoking with inner key: outer1-inner0 >>> 2024-10-30T23:27:55.195-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>> : Invoking with inner key: outer0-inner0 >>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest >>> : Invoked with inner key: outer1-inner0 >>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>> : Invoked with inner key: outer0-inner0 >>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest >>> : Invoking with inner key: outer1-inner1 >>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>> : Invoking with inner key: outer0-inner1 >>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>> : Invoked with inner key: outer0-inner1 >>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest >>> : Invoked with inner key: outer1-inner1 >>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest >>> : OuterEntryProcessor with key outer1 waiting for inner >>> future 0 >>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>> : OuterEntryProcessor with key outer0 waiting for inner >>> future 0 >>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [e-1-#2%server1%] c.p.i.BlockedThreadTest >>> : Entered InnerEntryProcessor with inner key: outer0-inner0 >>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [e-1-#2%server1%] c.p.i.BlockedThreadTest >>> : Exiting InnerEntryProcessor with inner key: outer0-inner0 >>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [e-0-#1%server1%] c.p.i.BlockedThreadTest >>> : Entered InnerEntryProcessor with inner key: outer0-inner1 >>> 2024-10-30T23:27:55.209-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [e-0-#1%server1%] c.p.i.BlockedThreadTest >>> : Exiting InnerEntryProcessor with inner key: outer0-inner1 >>> 2024-10-30T23:27:55.210-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [6-#298%server3%] c.p.i.BlockedThreadTest >>> : Entered InnerEntryProcessor with inner key: outer1-inner1 >>> 2024-10-30T23:27:55.210-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [6-#298%server3%] c.p.i.BlockedThreadTest >>> : Exiting InnerEntryProcessor with inner key: outer1-inner1 >>> 2024-10-30T23:27:55.215-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>> : OuterEntryProcessor with key outer0 waiting for inner >>> future 1 >>> 2024-10-30T23:27:55.216-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>> : OuterEntryProcessor with key outer0 done waiting for all >>> inner futures >>> 2024-10-30T23:27:55.216-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>> : Entered InnerEntryProcessor with inner key: outer1-inner0 >>> 2024-10-30T23:27:55.216-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest >>> : Exiting InnerEntryProcessor with inner key: outer1-inner0 >>> 2024-10-30T23:27:55.216-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest >>> : Test Runner waiting for outer future 1 >>> 2024-10-30T23:27:55.217-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest >>> : OuterEntryProcessor with key outer1 waiting for inner >>> future 1 >>> 2024-10-30T23:27:59.621-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [-notifier-timer] >>> o.a.i.i.p.c.GridUpdateNotifier : Your version is up to date. >>> 2024-10-30T23:28:01.283-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [-notifier-timer] >>> o.a.i.i.p.c.GridUpdateNotifier : Your version is up to date. >>> 2024-10-30T23:28:02.877-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [-notifier-timer] >>> o.a.i.i.p.c.GridUpdateNotifier : Your version is up to date. >>> 2024-10-30T23:28:04.399-04:00 INFO 17056 --- >>> [ignite-blocked-thread-sample] [-notifier-timer] >>> o.a.i.i.p.c.GridUpdateNotifier : Your version is up to date. >>> 2024-10-30T23:28:08.508-04:00 ERROR 17056 --- >>> [ignite-blocked-thread-sample] [%-#118%server1%] o.a.i.i.u.t.G >>> : Blocked system-critical thread has been detected. This can >>> lead to cluster-wide undefined behaviour [workerName=sys-stripe-6, >>> threadName=sys-stripe-6-#7%server1%, blockedFor=13s] >>> ------- >>> >>> Thanks for the advice - we are probably using entry processors in a way >>> originally intended for compute jobs with transactions, and we will >>> probably move towards that direction eventually. But to learn from this and >>> prevent future issues, I'd like to understand exactly what the problem is, >>> and with these logs, I can't understand how it's a deadlock. >>> >>> - Raymond >>> >>> >>> On Wed, Oct 30, 2024 at 7:33 AM Pavel Tupitsyn <ptupit...@apache.org> >>> wrote: >>> >>>> Looks like a deadlock to me. >>>> >>>> Processor 1 invokes processor 2 and blocks on a future, stripe thread T >>>> is blocked >>>> Processor 2 invokes processor 3, which needs stripe thread T and waits >>>> forever >>>> >>>> General advice - do not block striped pool threads. Avoid IO, waiting >>>> for futures, etc. An entry processor is supposed to perform a local >>>> computation and return quickly. >>>> >>>> On Wed, Oct 30, 2024 at 7:05 AM Raymond Liu <philosob...@gmail.com> >>>> wrote: >>>> >>>>> Hi team, >>>>> >>>>> We see a failure with nested asynchronous entry processor calls where the >>>>> nested calls appear to complete, but >>>>> the parent entry processor is not aware the child entry processor is done. >>>>> >>>>> In this sample test setup... >>>>> 1. a test runner invokes an "outer" entry processor twice asynchronously. >>>>> 2. The two outer entry processor invocations each invoke two inner entry >>>>> processors asynchronously. >>>>> 3. All four inner entry processor invocations appear to complete. >>>>> 4. One outer entry processor is stuck waiting (via IgniteFuture.get()) >>>>> for the last inner entry processor to complete. >>>>> The program hangs, prints thread dumps, and prints the following error: >>>>> "Blocked system-critical thread has been detected. This can lead to >>>>> cluster-wide undefined behaviour [workerName=sys-stripe-6, >>>>> threadName=sys-stripe-6-#7%server1%, blockedFor=13s]" >>>>> >>>>> I've uploaded this in a Github repo called >>>>> "ignite-blocked-thread-sample": >>>>> https://github.com/Philosobyte/ignite-blocked-thread-sample/blob/main/src/test/java/com/philosobyte/igniteblockedthreadsample/BlockedThreadTest.java >>>>> >>>>> While this sample uses asynchronous invocations, a version of this test >>>>> with synchronous invocations runs into the same >>>>> issue. >>>>> >>>>> This appears to happen only when there is more than one server node and >>>>> the nested invocation goes to a different >>>>> server node than the outer invocation. A debug breakpoint shows that, on >>>>> the server executing the inner entry processor, >>>>> the sys-stripe thread pool is idle; but on the server executing the outer >>>>> entry processor, there is one sys-stripe >>>>> thread pool waiting on IgniteFuture.get(). >>>>> >>>>> We originally suspected a deadlock, but this error also occurs when we >>>>> try setting thread pool sizes to a large number. >>>>> This almost appears to be a failure to communicate that a future should >>>>> be complete. >>>>> >>>>> Nested compute tasks, such as `IgniteCallable`s and `IgniteClosure`s >>>>> executing on the public thread pool, do not run >>>>> into this problem. So, we worked around this issue by wrapping our entry >>>>> processors inside `IgniteCallable`s. That alone >>>>> was already enough to resolve the issue; of course, at higher scale, we'd >>>>> run into deadlocks, so we also followed the >>>>> advice in the "thread pools tuning" documentation page about using a >>>>> custom thread pool. >>>>> >>>>> Regardless, we would like to make sure we understand the issue. Can >>>>> anyone think of an explanation? Is this expected >>>>> behavior or possibly a bug? >>>>> >>>>> Thank you, >>>>> Raymond >>>>> >>>>>