There are a few issues here — the blocked thread, the communication error — but I possibly the key one is the JVM pause:
[2020-07-03T18:17:21,793][WARN ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM pause: 10133 milliseconds. This is usually due to garbage collection, but there are a number of other possibilities such as slow I/O. Suggest you start with the recommendations on the GC tuning documentation page: https://apacheignite.readme.io/docs/jvm-and-system-tuning Regards, Stephen > On 4 Jul 2020, at 12:44, Kamlesh Joshi <[email protected]> wrote: > > Hi Team, > > We have encountered following defect in PROD environment. After which entire > traffic got halted for around 10 minutes, we recently upgraded our cluster to > Ignite 2.7.6 from 2.6.0. > Is this related to any existing open defect in this version? Has anyone > observed the same defect earlier ? > > Any help or pointers around this will be appreciated. > > > [2020-07-03T18:17:11,613][ERROR][sys-stripe-36-#37%CustomerCC%][G] Blocked > system-critical thread has been detected. This can lead to cluster-wide > undefined behaviour > [threadName=partition-exchanger, blockedFor=480s] > [2020-07-03T18:17:11,613][WARN ][sys-stripe-36-#37%CustomerCC%][G] Thread > [name="exchange-worker-#344%CustomerCC%", id=391, state=TIMED_WAITING, > blockCnt=1, waitCnt=2049782] > Lock > [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6bf9f3a4, > ownerName=null, ownerId=-1] > > [2020-07-03T18:17:11,620][ERROR][sys-stripe-36-#37%CustomerCC%][] Critical > system error detected. Will be handled accordingly to configured handler > [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, > super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, > SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext > [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker > [name=partition-exchanger, igniteInstanceName=CustomerCC, finished=false, > heartbeatTs=1593780431612]]] > org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, > igniteInstanceName=CustomerCC, finished=false, heartbeatTs=1593780431612] > at > org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) > [ignite-core-2.7.6.jar:2.7.6] > at > org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) > [ignite-core-2.7.6.jar:2.7.6] > at > org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) > [ignite-core-2.7.6.jar:2.7.6] > at > org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) > [ignite-core-2.7.6.jar:2.7.6] > at > org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:513) > [ignite-core-2.7.6.jar:2.7.6] > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) > [ignite-core-2.7.6.jar:2.7.6] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151] > [2020-07-03T18:17:11,625][WARN > ][sys-stripe-36-#37%CustomerCC%][FailureProcessor] No deadlocked threads > detected. > [2020-07-03T18:17:21,790][INFO > ][tcp-disco-sock-reader-#201%CustomerCC%][TcpDiscoverySpi] Finished serving > remote node connection [rmtAddr=/xx.xx.xx.xx:46416, rmtPort=46416 > [2020-07-03T18:17:21,793][WARN > ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM > pause: 10133 milliseconds. > [2020-07-03T18:17:21,794][WARN > ][grid-nio-worker-tcp-comm-31-#295%CustomerCC%][TcpCommunicationSpi] > Communication SPI session write timed out (consider increasing > 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:11764, > writeTimeout=2000] > [2020-07-03T18:17:21,794][WARN > ][grid-nio-worker-tcp-comm-57-#321%CustomerCC%][TcpCommunicationSpi] > Communication SPI session write timed out (consider increasing > 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:38500, > writeTimeout=2000] > [2020-07-03T18:17:21,794][WARN > ][grid-nio-worker-tcp-comm-5-#269%CustomerCC%][TcpCommunicationSpi] > Communication SPI session write timed out (consider increasing > 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:41442, > writeTimeout=2000] > [2020-07-03T18:17:21,794][WARN > ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi] > Communication SPI session write timed out (consider increasing > 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:44178, > writeTimeout=2000] > [2020-07-03T18:17:21,794][WARN > ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] > Communication SPI session write timed out (consider increasing > 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:11884, > writeTimeout=2000] > [2020-07-03T18:17:21,795][WARN > ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] > Communication SPI session write timed out (consider increasing > 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:39044, > writeTimeout=2000] > [2020-07-03T18:17:21,795][WARN > ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi] > Communication SPI session write timed out (consider increasing > 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:48756, > writeTimeout=2000] > [2020-07-03T18:17:21,795][WARN > ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] > Communication SPI session write timed out (consider increasing > 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:42190, > writeTimeout=2000] > > > > > > Thanks and Regards, > Kamlesh Joshi > > > "Confidentiality Warning: This message and any attachments are intended only > for the use of the intended recipient(s), are confidential and may be > privileged. If you are not the intended recipient, you are hereby notified > that any review, re-transmission, conversion to hard copy, copying, > circulation or other use of this message and any attachments is strictly > prohibited. If you are not the intended recipient, please notify the sender > immediately by return email and delete this message and any attachments from > your system. > > Virus Warning: Although the company has taken reasonable precautions to > ensure no viruses are present in this email. The company cannot accept > responsibility for any loss or damage arising from the use of this email or > attachment." >
