Possibly there is a race.
I've created a ticket for this [1]

[1] https://issues.apache.org/jira/browse/IGNITE-8751

On Fri, Jun 8, 2018 at 4:56 PM, Andrey Mashenkov <andrey.mashen...@gmail.com
> wrote:

> Hi,
>
> Looks like node was segmented due to long JVM pause.
> There are 2 "long JVM pause" messages in long an suspicious long
> checkpoint:
>
> Checkpoint finished [cpId=77cf2fa2-2a9f-48ea-bdeb-dda81b15dac1,
> pages=2050858, markPos=FileWALPointer [idx=2051, fileOff=38583904,
> len=15981], walSegmentsCleared=0, markDuration=920ms, pagesWrite=12002ms,
> fsync=965250ms, total=978172ms]
>
>
> Do you use HDD? Are you sure there is no swapping happens?
> Do you have enough free space on disk?
> Is there any other heavy process on server that may took too much CPU time
> and JVM was descheduled from CPU for too long?
>
>
> It looks weird JVM was not restarted. We've to check such case.
>
> On Fri, Jun 8, 2018 at 12:32 PM, Ray <ray...@cisco.com> wrote:
>
>> I setup a six node Ignite cluster to test the performance and stability.
>> Here's my setup.
>>     <bean id="grid.cfg"
>> class="org.apache.ignite.configuration.IgniteConfiguration">
>>         <property name="segmentationPolicy" value="RESTART_JVM"/>
>>         <property name="peerClassLoadingEnabled" value="true"/>
>>         <property name="failureDetectionTimeout" value="60000"/>
>>         <property name="dataStorageConfiguration">
>>             <bean
>> class="org.apache.ignite.configuration.DataStorageConfiguration">
>>             <property name="defaultDataRegionConfiguration">
>>                 <bean
>> class="org.apache.ignite.configuration.DataRegionConfiguration">
>>                     <property name="name" value="default_Region"/>
>>                     <property name="initialSize" value="#{100L * 1024 *
>> 1024
>> * 1024}"/>
>>                     <property name="maxSize" value="#{500L * 1024 * 1024 *
>> 1024}"/>
>>                     <property name="persistenceEnabled" value="true"/>
>>                     <property name="checkpointPageBufferSize"
>> value="#{2L *
>> 1024 * 1024 * 1024}"/>
>>                 </bean>
>>             </property>
>>             <property name="walMode" value="BACKGROUND"/>
>>             <property name="walFlushFrequency" value="5000"/>
>>             <property name="checkpointFrequency" value="600000"/>
>>       <property name="writeThrottlingEnabled" value="true"/>
>>
>> And I used this command to start the Ignite node.
>> ./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC
>> -J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch
>> -J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps
>> -J-XX:+PrintAdaptiveSizePolicy -J-Xloggc:/ignitegc-$(date
>> +%Y_%m_%d-%H_%M).log  config/persistent-config.xml
>>
>> One of the node just dropped from the topology. Here's the log for last
>> three minutes before this node going down.
>> [08:39:58,982][INFO][grid-timeout-worker-#119][IgniteKernal]
>> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>>     ^-- Node [id=8333aa56, uptime=02:34:01.948]
>>     ^-- H/N/C [hosts=9, nodes=16, CPUs=552]
>>     ^-- CPU [cur=41%, avg=33.18%, GC=0%]
>>     ^-- PageMemory [pages=8912687]
>>     ^-- Heap [used=8942MB, free=72.05%, comm=32000MB]
>>     ^-- Non heap [used=70MB, free=95.35%, comm=73MB]
>>     ^-- Outbound messages queue [size=0]
>>     ^-- Public thread pool [active=0, idle=0, qSize=0]
>>     ^-- System thread pool [active=0, idle=6, qSize=0]
>> [08:40:51,945][INFO][db-checkpoint-thread-#178][GridCacheDat
>> abaseSharedManager]
>> Checkpoint finished [cpId=77cf2fa2-2a9f-48ea-bdeb-dda81b15dac1,
>> pages=2050858, markPos=FileWALPointer [idx=2051, fileOff=38583904,
>> len=15981], walSegmentsCleared=0, markDuration=920ms, pagesWrite=12002ms,
>> fsync=965250ms, total=978172ms]
>> [08:40:53,086][INFO][db-checkpoint-thread-#178][GridCacheDat
>> abaseSharedManager]
>> Checkpoint started [checkpointId=14d929ac-1b5c-4df2-a71f-002d5eb41f14,
>> startPtr=FileWALPointer [idx=2242, fileOff=65211837, len=15981],
>> checkpointLockWait=0ms, checkpointLockHoldTime=39ms,
>> walCpRecordFsyncDuration=720ms, pages=2110545, reason='timeout']
>> [08:40:57,793][INFO][data-streamer-stripe-1-#58][PageMemoryImpl]
>> Throttling
>> is applied to page modifications [percentOfPartTime=0.22, markDirty=7192
>> pages/sec, checkpointWrite=2450 pages/sec, estIdealMarkDirty=139543
>> pages/sec, curDirty=0.00, maxDirty=0.17, avgParkTime=1732784 ns, pages:
>> (total=2110545, evicted=0, written=875069, synced=0, cpBufUsed=92,
>> cpBufTotal=518215)]
>> [08:40:58,991][INFO][grid-timeout-worker-#119][IgniteKernal]
>> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>>     ^-- Node [id=8333aa56, uptime=02:35:01.957]
>>     ^-- H/N/C [hosts=9, nodes=16, CPUs=552]
>>     ^-- CPU [cur=9.3%, avg=33%, GC=0%]
>>     ^-- PageMemory [pages=8920631]
>>     ^-- Heap [used=13262MB, free=58.55%, comm=32000MB]
>>     ^-- Non heap [used=70MB, free=95.34%, comm=73MB]
>>     ^-- Outbound messages queue [size=0]
>>     ^-- Public thread pool [active=0, idle=0, qSize=0]
>>     ^-- System thread pool [active=0, idle=6, qSize=0]
>> [08:41:29,050][WARNING][jvm-pause-detector-worker][] Possible too long
>> JVM
>> pause: 22667 milliseconds.
>> [08:41:29,050][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Finished
>> serving remote node connection [rmtAddr=/10.29.41.23:32815, rmtPort=32815
>> [08:41:29,052][INFO][tcp-disco-sock-reader-#13][TcpDiscoverySpi] Finished
>> serving remote node connection [rmtAddr=/10.29.41.25:46515, rmtPort=46515
>> [08:41:30,063][INFO][data-streamer-stripe-3-#60][PageMemoryImpl]
>> Throttling
>> is applied to page modifications [percentOfPartTime=0.49, markDirty=26945
>> pages/sec, checkpointWrite=2612 pages/sec, estIdealMarkDirty=210815
>> pages/sec, curDirty=0.00, maxDirty=0.34, avgParkTime=1024456 ns, pages:
>> (total=2110545, evicted=0, written=1861330, synced=0, cpBufUsed=8657,
>> cpBufTotal=518215)]
>> [08:42:42,276][WARNING][jvm-pause-detector-worker][] Possible too long
>> JVM
>> pause: 67967 milliseconds.
>> [08:42:42,277][INFO][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Local node
>> seems to be disconnected from topology (failure detection timeout is
>> reached) [failureDetectionTimeout=60000, connCheckFreq=20000]
>> [08:42:42,280][INFO][tcp-disco-sock-reader-#10][TcpDiscoverySpi] Finished
>> serving remote node connection [rmtAddr=/10.29.42.49:36509, rmtPort=36509
>> [08:42:42,286][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
>> accepted incoming connection [rmtAddr=/10.29.42.45, rmtPort=38712]
>> [08:42:42,286][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
>> spawning a new thread for connection [rmtAddr=/10.29.42.45,
>> rmtPort=38712]
>> [08:42:42,287][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Started
>> serving remote node connection [rmtAddr=/10.29.42.45:38712,
>> rmtPort=38712]
>> [08:42:42,289][WARNING][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Node is
>> out of topology (probably, due to short-time network problems).
>> [08:42:42,290][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Finished
>> serving remote node connection [rmtAddr=/10.29.42.45:38712, rmtPort=38712
>> [08:42:42,290][WARNING][disco-event-worker-#161][GridDiscoveryManager]
>> Local
>> node SEGMENTED: TcpDiscoveryNode [id=8333aa56-8bf4-4558-a387-80
>> 9b1d2e2e5b,
>> addrs=[10.29.42.44, 127.0.0.1], sockAddrs=[sap-datanode1/10.29
>> .42.44:49500,
>> /127.0.0.1:49500], discPort=49500, order=1, intOrder=1,
>> lastExchangeTime=1528447362286, loc=true, ver=2.5.0#20180523-sha1:86e110
>> c7,
>> isClient=false]
>> [08:42:42,294][SEVERE][tcp-disco-srvr-#2][] Critical system error
>> detected.
>> Will be handled accordingly to configured handler [hnd=class
>> o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
>> [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException:
>> Thread
>> tcp-disco-srvr-#2 is terminated unexpectedly.]]
>> java.lang.IllegalStateException: Thread tcp-disco-srvr-#2 is terminated
>> unexpectedly.
>>         at
>> org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.
>> body(ServerImpl.java:5686)
>>         at
>> org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
>> [08:42:42,294][SEVERE][tcp-disco-srvr-#2][] JVM will be halted
>> immediately
>> due to the failure: [failureCtx=FailureContext
>> [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException:
>> Thread
>> tcp-disco-srvr-#2 is terminated unexpectedly.]]
>>
>> As you can see from the log, the JVM paused for 67967 milliseconds so this
>> the reason why this node is down.
>> But I looked at the GC log, here's the last GC log.
>>
>> 2018-06-08T08:38:56.654+0000: 9199.215: [GC pause (G1 Evacuation Pause)
>> (young) 9199.215: [G1Ergonomics (CSet Construction) start choosing CSet,
>> _pending_cards: 32828, predicted base time: 18.85 ms, remaining time:
>> 181.15
>> ms, target pause time: 200.00 ms]
>>  9199.215: [G1Ergonomics (CSet Construction) add young regions to CSet,
>> eden: 2350 regions, survivors: 50 regions, predicted young region time:
>> 79.31 ms]
>>  9199.215: [G1Ergonomics (CSet Construction) finish choosing CSet, eden:
>> 2350 regions, survivors: 50 regions, old: 0 regions, predicted pause time:
>> 98.16 ms, target pause time: 200.00 ms]
>> , 0.0417411 secs]
>>    [Parallel Time: 32.8 ms, GC Workers: 38]
>>       [GC Worker Start (ms): Min: 9199215.4, Avg: 9199215.5, Max:
>> 9199215.7,
>> Diff: 0.3]
>>       [Ext Root Scanning (ms): Min: 0.4, Avg: 0.6, Max: 1.0, Diff: 0.6,
>> Sum:
>> 21.8]
>>       [Update RS (ms): Min: 2.1, Avg: 4.2, Max: 5.3, Diff: 3.3, Sum:
>> 158.3]
>>          [Processed Buffers: Min: 1, Avg: 6.0, Max: 23, Diff: 22, Sum:
>> 228]
>>       [Scan RS (ms): Min: 0.1, Avg: 0.5, Max: 0.9, Diff: 0.8, Sum: 18.6]
>>       [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
>> Sum: 0.2]
>>       [Object Copy (ms): Min: 25.9, Avg: 26.8, Max: 28.5, Diff: 2.6, Sum:
>> 1017.5]
>>       [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum:
>> 0.3]
>>          [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum:
>> 38]
>>       [GC Worker Other (ms): Min: 0.0, Avg: 0.2, Max: 0.5, Diff: 0.4, Sum:
>> 8.9]
>>       [GC Worker Total (ms): Min: 32.0, Avg: 32.3, Max: 32.5, Diff: 0.6,
>> Sum: 1225.7]
>>       [GC Worker End (ms): Min: 9199247.6, Avg: 9199247.8, Max: 9199248.0,
>> Diff: 0.4]
>>    [Code Root Fixup: 0.2 ms]
>>    [Code Root Purge: 0.0 ms]
>>    [Clear CT: 2.7 ms]
>>    [Other: 6.1 ms]
>>       [Choose CSet: 0.0 ms]
>>       [Ref Proc: 0.6 ms]
>>       [Ref Enq: 0.0 ms]
>>       [Redirty Cards: 0.4 ms]
>>       [Humongous Register: 0.1 ms]
>>       [Humongous Reclaim: 0.0 ms]
>>       [Free CSet: 4.4 ms]
>>    [Eden: 18.4G(18.4G)->0.0B(18.5G) Survivors: 400.0M->232.0M Heap:
>> 21.6G(31.2G)->3110.7M(31.2G)]
>>  [Times: user=1.28 sys=0.01, real=0.04 secs]
>> Heap
>>  garbage-first heap   total 32768000K, used 14908229K [0x00007f9b38000000,
>> 0x00007f9b38807d00, 0x00007fa308000000)
>>   region size 8192K, 1458 young (11943936K), 29 survivors (237568K)
>>  Metaspace       used 43388K, capacity 44288K, committed 44492K, reserved
>> 1089536K
>>   class space    used 4241K, capacity 4487K, committed 4556K, reserved
>> 1048576K
>>
>>
>> Everything looks normal for this GC, but this GC happened at 8:38 and the
>> JVM freeze happed at around 8:40.
>> And this picture show what the java heap looks like in the last minutes
>> before node going down.(Please add four hours here to match the log time
>> because of different timezone)
>> Screen_Shot_2018-06-08_at_17.png
>> <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/
>> Screen_Shot_2018-06-08_at_17.png>
>>
>> So my question is why is this node going down?
>> What's JVM doing in the last minutes before node going down?
>> Why is not there any logs in the last minutes?
>> And why isn't the segment_policy RESTART_JVM working?
>>
>>
>>
>> --
>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>
>
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>



-- 
Best regards,
Andrey V. Mashenkov

Reply via email to