Sorry for using the same topic of http://apache-ignite-users.70518.x6.nabble.com/TcpDiscoverySpi-worker-thread-failed-with-assertion-error-td14554.html Actually I met the same issue exactly, although it was fixed in https://issues.apache.org/jira/browse/IGNITE-5562
We have more than 6 clusters, each cluster has four or six nodes, and never see this issue in other cluster. This issue only occurred in this node twice at the past three weeks. It was crashed suddenly. The logs show there is not load almostly, and other nodes in this clusters work well. Could someone give me any feedback to avoid this issue? ---------------------------- >>> +----------------------------------------------------------------------+ >>> Ignite ver. 2.7.0#20181130-sha1:256ae4012cb143b4855b598b740a6f3499ead4db >>> +----------------------------------------------------------------------+ >>> OS name: Linux 2.6.32-754.23.1.el6.x86_64 amd64 >>> CPU(s): 6 >>> Heap: 22.0GB >>> VM name: [email protected] >>> Ignite instance name: >>> prod-ignite-18w.an.xx.xxx.com_47600_prod-ignite-19w.an.xx.xxx.com_47600 >>> Local node [ID=831BB843-E190-48B1-B828-BBC9A4407B47, order=2, >>> clientMode=false] >>> Local node addresses: [prod-ignite-19w.xx.xxx.com/xx.xxx.xxx.xxx, >>> /127.0.0.1] >>> Local ports: TCP:10800 TCP:11211 TCP:47100 TCP:47600 -------------------------- [2020-03-04 03:12:09,649][INFO ][grid-timeout-worker-#23%prod-ignite-18w.xx.xxx.com_47600_prod-ignite-19w.xx.xxx.com_47600%][tan_47600] Metrics for local node (to disable set 'metricsLogFrequency' to 0) ^-- Node [id=831bb843, name=prod-ignite-18w.xx.xxx.com_47600_prod-ignite-19w.xx.xxx.com_47600, uptime=5 days, 04:06:39.768] ^-- H/N/C [hosts=6, nodes=6, CPUs=20] ^-- CPU [cur=0.47%, avg=1.11%, GC=0%] ^-- PageMemory [pages=585588] ^-- Heap [used=10009MB, free=55.57%, comm=22528MB] ^-- Off-heap [used=2300MB, free=81.58%, comm=2742MB] ^-- sysMemPlc region [used=0MB, free=99.17%, comm=40MB] ^-- CACHE_NODE_xG_Region region [used=2300MB, free=81.28%, comm=2662MB] ^-- TxLog region [used=0MB, free=100%, comm=40MB] ^-- Outbound messages queue [size=0] ^-- Public thread pool [active=0, idle=0, qSize=0] ^-- System thread pool [active=0, idle=32, qSize=0] [2020-03-04 03:12:09,649][INFO ][grid-timeout-worker-#23%prod-ignite-18w.xx.xxx.com_47600_prod-ignite-19w.xx.xxx.com_47600%][tan_47600] FreeList [name=prod-ignite-18w.xx.xxx.com_47600_prod-ignite-19w.xx.xxx.com_47600, buckets=256, dataPages=1, reusePages=0] [2020-03-04 03:12:09,649][INFO ][grid-timeout-worker-#23%prod-ignite-18w.xx.xxx.com_47600_prod-ignite-19w.xx.xxx.com_47600%][tan_47600] FreeList [name=prod-ignite-18w.xx.xxx.com_47600_prod-ignite-19w.xx.xxx.com_47600, buckets=256, dataPages=365506, reusePages=2926] [2020-03-04 03:12:48,906][ERROR][tcp-disco-msg-worker-#2%prod-ignite-18w.xx.xxx.com_47600_prod-ignite-19w.xx.xxx.com_47600%][TcpDiscoverySpi] TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node in order to prevent cluster wide instability. java.lang.AssertionError: -2977 at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryStatistics.onMessageSent(TcpDiscoveryStatistics.java:317) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.sendMessageAcrossRing(ServerImpl.java:3301) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMetricsUpdateMessage(ServerImpl.java:5305) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2828) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2611) at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7188) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2700) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7119) at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) [2020-03-04 03:12:48,936][INFO ][node-stop-thread][GridTcpRestProtocol] Command protocol successfully stopped: TCP binary [2020-03-04 03:12:48,938][ERROR][tcp-disco-msg-worker-#2%prod-ignite-18w.xx.xxx.com_47600_prod-ignite-19w.xx.xxx.com_47600%][root] Critical system error detected. Will be handled accordingly to configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.AssertionError: -2977]] java.lang.AssertionError: -2977 at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryStatistics.onMessageSent(TcpDiscoveryStatistics.java:317) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.sendMessageAcrossRing(ServerImpl.java:3301) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMetricsUpdateMessage(ServerImpl.java:5305) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2828) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2611) at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7188) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2700) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7119) at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) [2020-03-04 03:12:48,938][INFO ][node-stop-thread][GridServiceProcessor] Shutting down distributed service [name=cacheQueryService, execId8=65a391f9] [2020-03-04 03:12:48,941][WARN ][tcp-disco-msg-worker-#2%prod-ignite-18w.xx.xxx.com_47600_prod-ignite-19w.xx.xxx.com_47600%][FailureProcessor] No deadlocked threads detected. [2020-03-04 03:12:52,315][WARN ][jvm-pause-detector-worker][tan_47600] Possible too long JVM pause: 3334 milliseconds. [2020-03-04 03:12:52,342][WARN ][tcp-disco-msg-worker-#2%prod-ignite-18w.xx.xxx.com_47600_prod-ignite-19w.xx.xxx.com_47600%][FailureProcessor] Thread dump at 2020/03/04 03:12:52 PST Thread [name="node-stop-thread", id=59122, state=RUNNABLE, blockCnt=4, waitCnt=5] at o.a.i.i.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:464) at o.a.i.i.processors.cache.GridCachePartitionExchangeManager.onKernalStop0(GridCachePartitionExchangeManager.java:777) at o.a.i.i.processors.cache.GridCacheSharedManagerAdapter.onKernalStop(GridCacheSharedManagerAdapter.java:120) at o.a.i.i.processors.cache.GridCacheProcessor.onKernalStop(GridCacheProcessor.java:1114) at o.a.i.i.IgniteKernal.stop0(IgniteKernal.java:2280) at o.a.i.i.IgniteKernal.stop(IgniteKernal.java:2228) at o.a.i.i.IgnitionEx$IgniteNamedInstance.stop0(IgnitionEx.java:2612) - locked o.a.i.i.IgnitionEx$IgniteNamedInstance@624fb95 at o.a.i.i.IgnitionEx$IgniteNamedInstance.stop(IgnitionEx.java:2575) at o.a.i.i.IgnitionEx.stop(IgnitionEx.java:379) at o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker$1.run(ServerImpl.java:2719) at java.lang.Thread.run(Thread.java:748) ------------------------ -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
