Thanks Anton for the information. Here I re-summarized and added more details and both server and client logs when the incident happened.
[Cluster configuration] Windows Azure VM scale set Windows Server 2016 10.0 amd64 VM x 40 nodes VM information: Java(TM) SE Runtime Environment 1.8.0_162-b12 Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 25.162-b12 1 ignite server, 2 ignite clients per each node Full topology is, 2018/05/09 17:53:30.564 [INFO ][disco-event-worker-#56][GridDiscoveryManager] Topology snapshot [ver=120, servers=40, clients=80, CPUs=640, heap=560.0GB] [Ignite cache configuration] .NET ignite 2.3 ignite-config.xml <http://apache-ignite-users.70518.x6.nabble.com/file/t1784/ignite-config.xml> var lifecycleHandler = new LifecycleHandler(); IgniteConfiguration igniteConfiguration = new IgniteConfiguration() { SpringConfigUrl = "ignite-config.xml", ClientMode = clientMode, JvmOptions = jvmOptions, LifecycleHandlers = new[] { lifecycleHandler }, BinaryConfiguration = binaryConfiguration }; m_ignite = Ignition.Start(igniteConfiguration); m_ignite.Stopping += async (sender, args) => { Console.WriteLine(">>> Ignite node stopping ..."); /<= No console log has been printed/ }; m_ignite.Stopped += async (sender, args) => { Console.WriteLine(">>> Ignite node stopped."); /<= No console log has been printed/ }; CacheConfiguration cacheConfig = new CacheConfiguration(cacheName, queryEntity) { SqlSchema = "PUBLIC", Backups = 2, DataRegionName = "Default_Region", CopyOnRead = false, EvictionPolicy = new LruEvictionPolicy { MaxSize = 100000, MaxMemorySize = 1024 * 1024 * 1024 * 2 } }; Key type: string Value type: BinaryObject One partitioned cache upto 1 million Average cache entry size is between 2kbytes and 6kbytes + [Problem] Ignite server process has been dropping out of topology one by one over time. 2018/05/09 17:53:30.564 [INFO ][disco-event-worker-#56][GridDiscoveryManager] Topology snapshot [ver=120, servers=40, clients=80, CPUs=640, heap=560.0GB] ... 2018/05/10 08:20:44.254 [INFO ][disco-event-worker-#56][GridDiscoveryManager] Topology snapshot [ver=123, servers=37, clients=80, CPUs=640, heap=530.0GB] ... 2018/05/10 11:29:43.461 [INFO ][disco-event-worker-#56][GridDiscoveryManager] Topology snapshot [ver=128, servers=32, clients=80, CPUs=640, heap=480.0GB] ... 2018/05/10 19:30:08.519 [INFO ][disco-event-worker-#56][GridDiscoveryManager] Topology snapshot [ver=139, servers=21, clients=80, CPUs=640, heap=370.0GB] Now we lost 19 ignite servers out of 40 total from the topology. It seems like the ignite dotnet server progress got frozen when an ignite server was dropped. ignite-jstack-node26.txt <http://apache-ignite-users.70518.x6.nabble.com/file/t1784/ignite-jstack-node26.txt> [JVM Options] SERVER "-Duser.timezone=UTC", "-DIGNITE_QUIET=false", "-Djava.net.preferIPv4Stack=true", "-Djava.awt.headless=true", "-Xms10g", "-Xmx10g", "-XX:+AlwaysPreTouch", "-XX:+UseG1GC", "-XX:+ScavengeBeforeFullGC", "-XX:+DisableExplicitGC" CLIENT "-Duser.timezone=UTC", "-DIGNITE_QUIET=false", "-Djava.net.preferIPv4Stack=true", "-Djava.awt.headless=true", "-Xms2g", "-Xmx2g", "-XX:+AlwaysPreTouch", "-XX:+UseG1GC", "-XX:+ScavengeBeforeFullGC", "-XX:+DisableExplicitGC" [Logs] SERVER 2018/05/10 18:49:56.066 [INFO ][grid-timeout-worker-#39][IgniteKernal] Metrics for local node (to disable set 'metricsLogFrequency' to 0) ^-- Node [id=8a2ce76e, uptime=23:20:53.173] ^-- H/N/C [hosts=40, nodes=104, CPUs=640] ^-- CPU [cur=2.13%, avg=1.23%, GC=0%] ^-- PageMemory [pages=37787] ^-- Heap [used=2571MB, free=74.89%, comm=10240MB] ^-- Non heap [used=77MB, free=-1%, comm=80MB] ^-- Public thread pool [active=0, idle=0, qSize=0] ^-- System thread pool [active=0, idle=6, qSize=0] ^-- Outbound messages queue [size=13] 2018/05/10 18:49:56.343 [INFO ][grid-timeout-worker-#39][IgniteKernal] FreeList [name=null, buckets=256, dataPages=35238, reusePages=12] / PROCESS FROZEN HERE AT 2018/05/10 18:50 !!! The dotnet ignite server process is still alive, but metrics stopped, no other logs, no CPU usage since. / ignite-server-node26.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t1784/ignite-server-node26.zip> CLIENT 2018/05/10 18:50:29.383 [WARN ][Thread-2561][IgniteH2Indexing] Failed to send message [node=TcpDiscoveryNode [id=8a2ce76e-1bf2-4259-8592-81c11af9064f, addrs=[10.0.0.23, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, CDNode00000Q.hlbdeyzzwm2ujgdsre0nhzw3sg.dx.internal.cloudapp.net/10.0.0.23:47500], discPort=47500, order=69, intOrder=69, lastExchangeTime=1525894136807, loc=false, ver=2.3.0#19700101-sha1:00000000, isClient=false], msg=GridQueryCancelRequest [qryReqId=5208317], errMsg=Failed to send message (node left topology): TcpDiscoveryNode [id=8a2ce76e-1bf2-4259-8592-81c11af9064f, addrs=[10.0.0.23, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, CDNode00000Q.hlbdeyzzwm2ujgdsre0nhzw3sg.dx.internal.cloudapp.net/10.0.0.23:47500], discPort=47500, order=69, intOrder=69, lastExchangeTime=1525894136807, loc=false, ver=2.3.0#19700101-sha1:00000000, isClient=false]] ignite-client-node08.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t1784/ignite-client-node08.zip> -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
