Possibly there is a race. I've created a ticket for this [1]
[1] https://issues.apache.org/jira/browse/IGNITE-8751 On Fri, Jun 8, 2018 at 4:56 PM, Andrey Mashenkov <andrey.mashen...@gmail.com > wrote: > Hi, > > Looks like node was segmented due to long JVM pause. > There are 2 "long JVM pause" messages in long an suspicious long > checkpoint: > > Checkpoint finished [cpId=77cf2fa2-2a9f-48ea-bdeb-dda81b15dac1, > pages=2050858, markPos=FileWALPointer [idx=2051, fileOff=38583904, > len=15981], walSegmentsCleared=0, markDuration=920ms, pagesWrite=12002ms, > fsync=965250ms, total=978172ms] > > > Do you use HDD? Are you sure there is no swapping happens? > Do you have enough free space on disk? > Is there any other heavy process on server that may took too much CPU time > and JVM was descheduled from CPU for too long? > > > It looks weird JVM was not restarted. We've to check such case. > > On Fri, Jun 8, 2018 at 12:32 PM, Ray <ray...@cisco.com> wrote: > >> I setup a six node Ignite cluster to test the performance and stability. >> Here's my setup. >> <bean id="grid.cfg" >> class="org.apache.ignite.configuration.IgniteConfiguration"> >> <property name="segmentationPolicy" value="RESTART_JVM"/> >> <property name="peerClassLoadingEnabled" value="true"/> >> <property name="failureDetectionTimeout" value="60000"/> >> <property name="dataStorageConfiguration"> >> <bean >> class="org.apache.ignite.configuration.DataStorageConfiguration"> >> <property name="defaultDataRegionConfiguration"> >> <bean >> class="org.apache.ignite.configuration.DataRegionConfiguration"> >> <property name="name" value="default_Region"/> >> <property name="initialSize" value="#{100L * 1024 * >> 1024 >> * 1024}"/> >> <property name="maxSize" value="#{500L * 1024 * 1024 * >> 1024}"/> >> <property name="persistenceEnabled" value="true"/> >> <property name="checkpointPageBufferSize" >> value="#{2L * >> 1024 * 1024 * 1024}"/> >> </bean> >> </property> >> <property name="walMode" value="BACKGROUND"/> >> <property name="walFlushFrequency" value="5000"/> >> <property name="checkpointFrequency" value="600000"/> >> <property name="writeThrottlingEnabled" value="true"/> >> >> And I used this command to start the Ignite node. >> ./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC >> -J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch >> -J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps >> -J-XX:+PrintAdaptiveSizePolicy -J-Xloggc:/ignitegc-$(date >> +%Y_%m_%d-%H_%M).log config/persistent-config.xml >> >> One of the node just dropped from the topology. Here's the log for last >> three minutes before this node going down. >> [08:39:58,982][INFO][grid-timeout-worker-#119][IgniteKernal] >> Metrics for local node (to disable set 'metricsLogFrequency' to 0) >> ^-- Node [id=8333aa56, uptime=02:34:01.948] >> ^-- H/N/C [hosts=9, nodes=16, CPUs=552] >> ^-- CPU [cur=41%, avg=33.18%, GC=0%] >> ^-- PageMemory [pages=8912687] >> ^-- Heap [used=8942MB, free=72.05%, comm=32000MB] >> ^-- Non heap [used=70MB, free=95.35%, comm=73MB] >> ^-- Outbound messages queue [size=0] >> ^-- Public thread pool [active=0, idle=0, qSize=0] >> ^-- System thread pool [active=0, idle=6, qSize=0] >> [08:40:51,945][INFO][db-checkpoint-thread-#178][GridCacheDat >> abaseSharedManager] >> Checkpoint finished [cpId=77cf2fa2-2a9f-48ea-bdeb-dda81b15dac1, >> pages=2050858, markPos=FileWALPointer [idx=2051, fileOff=38583904, >> len=15981], walSegmentsCleared=0, markDuration=920ms, pagesWrite=12002ms, >> fsync=965250ms, total=978172ms] >> [08:40:53,086][INFO][db-checkpoint-thread-#178][GridCacheDat >> abaseSharedManager] >> Checkpoint started [checkpointId=14d929ac-1b5c-4df2-a71f-002d5eb41f14, >> startPtr=FileWALPointer [idx=2242, fileOff=65211837, len=15981], >> checkpointLockWait=0ms, checkpointLockHoldTime=39ms, >> walCpRecordFsyncDuration=720ms, pages=2110545, reason='timeout'] >> [08:40:57,793][INFO][data-streamer-stripe-1-#58][PageMemoryImpl] >> Throttling >> is applied to page modifications [percentOfPartTime=0.22, markDirty=7192 >> pages/sec, checkpointWrite=2450 pages/sec, estIdealMarkDirty=139543 >> pages/sec, curDirty=0.00, maxDirty=0.17, avgParkTime=1732784 ns, pages: >> (total=2110545, evicted=0, written=875069, synced=0, cpBufUsed=92, >> cpBufTotal=518215)] >> [08:40:58,991][INFO][grid-timeout-worker-#119][IgniteKernal] >> Metrics for local node (to disable set 'metricsLogFrequency' to 0) >> ^-- Node [id=8333aa56, uptime=02:35:01.957] >> ^-- H/N/C [hosts=9, nodes=16, CPUs=552] >> ^-- CPU [cur=9.3%, avg=33%, GC=0%] >> ^-- PageMemory [pages=8920631] >> ^-- Heap [used=13262MB, free=58.55%, comm=32000MB] >> ^-- Non heap [used=70MB, free=95.34%, comm=73MB] >> ^-- Outbound messages queue [size=0] >> ^-- Public thread pool [active=0, idle=0, qSize=0] >> ^-- System thread pool [active=0, idle=6, qSize=0] >> [08:41:29,050][WARNING][jvm-pause-detector-worker][] Possible too long >> JVM >> pause: 22667 milliseconds. >> [08:41:29,050][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Finished >> serving remote node connection [rmtAddr=/10.29.41.23:32815, rmtPort=32815 >> [08:41:29,052][INFO][tcp-disco-sock-reader-#13][TcpDiscoverySpi] Finished >> serving remote node connection [rmtAddr=/10.29.41.25:46515, rmtPort=46515 >> [08:41:30,063][INFO][data-streamer-stripe-3-#60][PageMemoryImpl] >> Throttling >> is applied to page modifications [percentOfPartTime=0.49, markDirty=26945 >> pages/sec, checkpointWrite=2612 pages/sec, estIdealMarkDirty=210815 >> pages/sec, curDirty=0.00, maxDirty=0.34, avgParkTime=1024456 ns, pages: >> (total=2110545, evicted=0, written=1861330, synced=0, cpBufUsed=8657, >> cpBufTotal=518215)] >> [08:42:42,276][WARNING][jvm-pause-detector-worker][] Possible too long >> JVM >> pause: 67967 milliseconds. >> [08:42:42,277][INFO][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Local node >> seems to be disconnected from topology (failure detection timeout is >> reached) [failureDetectionTimeout=60000, connCheckFreq=20000] >> [08:42:42,280][INFO][tcp-disco-sock-reader-#10][TcpDiscoverySpi] Finished >> serving remote node connection [rmtAddr=/10.29.42.49:36509, rmtPort=36509 >> [08:42:42,286][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery >> accepted incoming connection [rmtAddr=/10.29.42.45, rmtPort=38712] >> [08:42:42,286][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery >> spawning a new thread for connection [rmtAddr=/10.29.42.45, >> rmtPort=38712] >> [08:42:42,287][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Started >> serving remote node connection [rmtAddr=/10.29.42.45:38712, >> rmtPort=38712] >> [08:42:42,289][WARNING][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Node is >> out of topology (probably, due to short-time network problems). >> [08:42:42,290][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Finished >> serving remote node connection [rmtAddr=/10.29.42.45:38712, rmtPort=38712 >> [08:42:42,290][WARNING][disco-event-worker-#161][GridDiscoveryManager] >> Local >> node SEGMENTED: TcpDiscoveryNode [id=8333aa56-8bf4-4558-a387-80 >> 9b1d2e2e5b, >> addrs=[10.29.42.44, 127.0.0.1], sockAddrs=[sap-datanode1/10.29 >> .42.44:49500, >> /127.0.0.1:49500], discPort=49500, order=1, intOrder=1, >> lastExchangeTime=1528447362286, loc=true, ver=2.5.0#20180523-sha1:86e110 >> c7, >> isClient=false] >> [08:42:42,294][SEVERE][tcp-disco-srvr-#2][] Critical system error >> detected. >> Will be handled accordingly to configured handler [hnd=class >> o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext >> [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: >> Thread >> tcp-disco-srvr-#2 is terminated unexpectedly.]] >> java.lang.IllegalStateException: Thread tcp-disco-srvr-#2 is terminated >> unexpectedly. >> at >> org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer. >> body(ServerImpl.java:5686) >> at >> org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) >> [08:42:42,294][SEVERE][tcp-disco-srvr-#2][] JVM will be halted >> immediately >> due to the failure: [failureCtx=FailureContext >> [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: >> Thread >> tcp-disco-srvr-#2 is terminated unexpectedly.]] >> >> As you can see from the log, the JVM paused for 67967 milliseconds so this >> the reason why this node is down. >> But I looked at the GC log, here's the last GC log. >> >> 2018-06-08T08:38:56.654+0000: 9199.215: [GC pause (G1 Evacuation Pause) >> (young) 9199.215: [G1Ergonomics (CSet Construction) start choosing CSet, >> _pending_cards: 32828, predicted base time: 18.85 ms, remaining time: >> 181.15 >> ms, target pause time: 200.00 ms] >> 9199.215: [G1Ergonomics (CSet Construction) add young regions to CSet, >> eden: 2350 regions, survivors: 50 regions, predicted young region time: >> 79.31 ms] >> 9199.215: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: >> 2350 regions, survivors: 50 regions, old: 0 regions, predicted pause time: >> 98.16 ms, target pause time: 200.00 ms] >> , 0.0417411 secs] >> [Parallel Time: 32.8 ms, GC Workers: 38] >> [GC Worker Start (ms): Min: 9199215.4, Avg: 9199215.5, Max: >> 9199215.7, >> Diff: 0.3] >> [Ext Root Scanning (ms): Min: 0.4, Avg: 0.6, Max: 1.0, Diff: 0.6, >> Sum: >> 21.8] >> [Update RS (ms): Min: 2.1, Avg: 4.2, Max: 5.3, Diff: 3.3, Sum: >> 158.3] >> [Processed Buffers: Min: 1, Avg: 6.0, Max: 23, Diff: 22, Sum: >> 228] >> [Scan RS (ms): Min: 0.1, Avg: 0.5, Max: 0.9, Diff: 0.8, Sum: 18.6] >> [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, >> Sum: 0.2] >> [Object Copy (ms): Min: 25.9, Avg: 26.8, Max: 28.5, Diff: 2.6, Sum: >> 1017.5] >> [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: >> 0.3] >> [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: >> 38] >> [GC Worker Other (ms): Min: 0.0, Avg: 0.2, Max: 0.5, Diff: 0.4, Sum: >> 8.9] >> [GC Worker Total (ms): Min: 32.0, Avg: 32.3, Max: 32.5, Diff: 0.6, >> Sum: 1225.7] >> [GC Worker End (ms): Min: 9199247.6, Avg: 9199247.8, Max: 9199248.0, >> Diff: 0.4] >> [Code Root Fixup: 0.2 ms] >> [Code Root Purge: 0.0 ms] >> [Clear CT: 2.7 ms] >> [Other: 6.1 ms] >> [Choose CSet: 0.0 ms] >> [Ref Proc: 0.6 ms] >> [Ref Enq: 0.0 ms] >> [Redirty Cards: 0.4 ms] >> [Humongous Register: 0.1 ms] >> [Humongous Reclaim: 0.0 ms] >> [Free CSet: 4.4 ms] >> [Eden: 18.4G(18.4G)->0.0B(18.5G) Survivors: 400.0M->232.0M Heap: >> 21.6G(31.2G)->3110.7M(31.2G)] >> [Times: user=1.28 sys=0.01, real=0.04 secs] >> Heap >> garbage-first heap total 32768000K, used 14908229K [0x00007f9b38000000, >> 0x00007f9b38807d00, 0x00007fa308000000) >> region size 8192K, 1458 young (11943936K), 29 survivors (237568K) >> Metaspace used 43388K, capacity 44288K, committed 44492K, reserved >> 1089536K >> class space used 4241K, capacity 4487K, committed 4556K, reserved >> 1048576K >> >> >> Everything looks normal for this GC, but this GC happened at 8:38 and the >> JVM freeze happed at around 8:40. >> And this picture show what the java heap looks like in the last minutes >> before node going down.(Please add four hours here to match the log time >> because of different timezone) >> Screen_Shot_2018-06-08_at_17.png >> <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/ >> Screen_Shot_2018-06-08_at_17.png> >> >> So my question is why is this node going down? >> What's JVM doing in the last minutes before node going down? >> Why is not there any logs in the last minutes? >> And why isn't the segment_policy RESTART_JVM working? >> >> >> >> -- >> Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >> > > > > -- > Best regards, > Andrey V. Mashenkov > -- Best regards, Andrey V. Mashenkov