Re: Node pause for no obvious reason

2018-06-14 Thread dkarachentsev
Hi,

Check system logs for that time, maybe there was some system freeze and add
more information in GC logs, for example safepoints:
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime.

Thanks!
-Dmitry



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Node pause for no obvious reason

2018-06-08 Thread Ray
I didn't have the dstat logs.
But I think these charts is the same as dstats logs.
CPU usage
CPU.png   
Memory usage
Memory.png
  
Swap
swap.png
  
Server load
load.png
  
Disk
disk.png
  

We're using bare metal servers for Ignite, each server has 768Gb memory and
56 core CPU.
One node per server.
I don't think hardware is the bottleneck here.

Will enable DEBUG log help in this case?
In my understanding, if JVM is in pause state Ignite can't produce any logs
even DEBUG level log is enabled, right?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Node pause for no obvious reason

2018-06-08 Thread Andrey Mashenkov
Checkpoint can block other threads only if there another checkpoint should
be started.
But seems checkpoint is not the cause.

It is hard to tell what was going on server.  May be you have any dstat
logs?

How many nodes to you start per machine?
Do you use docker or any VM?

We have bad experience of running multiple ignite nodes on same physical
machine (docker and VM) under high load due to concurrency for resources.
You can try to either switch to 1 ignite instance per machine or map nodes
JVM to CPU or reduce ignite thread pool sizes or tune VM nodes quotas.

On Fri, Jun 8, 2018 at 7:01 PM, Ray  wrote:

> Hi,
>
> Please see the GC log and the picture I attached.
> Looks like the GC is not taking a very long time.
>
> Yes, the checkpoint is taking a long time to finish.
> Could it be the checkpoint thread has something to do with the node crash?
> In my understanding, the checkpoint will not block other threads, right?
>
> Yes, I'm using HDD and there's plenty of free space on the disc.
> I disabled swapping using sysctl –w vm.swappiness=0 as this document says
> https://apacheignite.readme.io/docs/durable-memory-tuning#
> section-adjust-swappiness-settings/
> The CPU usage is normal when the node goes down.
>
> What's weird about this is that the other 5 nodes have the same
> configuration and under the same heavy write circumstance and they didn't
> go
> down.
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>



-- 
Best regards,
Andrey V. Mashenkov


Re: Node pause for no obvious reason

2018-06-08 Thread Ray
Hi, 

Please see the GC log and the picture I attached.
Looks like the GC is not taking a very long time.

Yes, the checkpoint is taking a long time to finish.
Could it be the checkpoint thread has something to do with the node crash?
In my understanding, the checkpoint will not block other threads, right?

Yes, I'm using HDD and there's plenty of free space on the disc.
I disabled swapping using sysctl –w vm.swappiness=0 as this document says
https://apacheignite.readme.io/docs/durable-memory-tuning#section-adjust-swappiness-settings/
The CPU usage is normal when the node goes down.

What's weird about this is that the other 5 nodes have the same
configuration and under the same heavy write circumstance and they didn't go
down.




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Node pause for no obvious reason

2018-06-08 Thread Andrey Mashenkov
Possibly there is a race.
I've created a ticket for this [1]


[1] https://issues.apache.org/jira/browse/IGNITE-8751

On Fri, Jun 8, 2018 at 4:56 PM, Andrey Mashenkov  wrote:

> Hi,
>
> Looks like node was segmented due to long JVM pause.
> There are 2 "long JVM pause" messages in long an suspicious long
> checkpoint:
>
> Checkpoint finished [cpId=77cf2fa2-2a9f-48ea-bdeb-dda81b15dac1,
> pages=2050858, markPos=FileWALPointer [idx=2051, fileOff=38583904,
> len=15981], walSegmentsCleared=0, markDuration=920ms, pagesWrite=12002ms,
> fsync=965250ms, total=978172ms]
>
>
> Do you use HDD? Are you sure there is no swapping happens?
> Do you have enough free space on disk?
> Is there any other heavy process on server that may took too much CPU time
> and JVM was descheduled from CPU for too long?
>
>
> It looks weird JVM was not restarted. We've to check such case.
>
> On Fri, Jun 8, 2018 at 12:32 PM, Ray  wrote:
>
>> I setup a six node Ignite cluster to test the performance and stability.
>> Here's my setup.
>> > class="org.apache.ignite.configuration.IgniteConfiguration">
>> 
>> 
>> 
>> 
>> > class="org.apache.ignite.configuration.DataStorageConfiguration">
>> 
>> > class="org.apache.ignite.configuration.DataRegionConfiguration">
>> 
>> 
>> 
>> 
>> > value="#{2L *
>> 1024 * 1024 * 1024}"/>
>> 
>> 
>> 
>> 
>> 
>>   
>>
>> And I used this command to start the Ignite node.
>> ./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC
>> -J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch
>> -J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps
>> -J-XX:+PrintAdaptiveSizePolicy -J-Xloggc:/ignitegc-$(date
>> +%Y_%m_%d-%H_%M).log  config/persistent-config.xml
>>
>> One of the node just dropped from the topology. Here's the log for last
>> three minutes before this node going down.
>> [08:39:58,982][INFO][grid-timeout-worker-#119][IgniteKernal]
>> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>> ^-- Node [id=8333aa56, uptime=02:34:01.948]
>> ^-- H/N/C [hosts=9, nodes=16, CPUs=552]
>> ^-- CPU [cur=41%, avg=33.18%, GC=0%]
>> ^-- PageMemory [pages=8912687]
>> ^-- Heap [used=8942MB, free=72.05%, comm=32000MB]
>> ^-- Non heap [used=70MB, free=95.35%, comm=73MB]
>> ^-- Outbound messages queue [size=0]
>> ^-- Public thread pool [active=0, idle=0, qSize=0]
>> ^-- System thread pool [active=0, idle=6, qSize=0]
>> [08:40:51,945][INFO][db-checkpoint-thread-#178][GridCacheDat
>> abaseSharedManager]
>> Checkpoint finished [cpId=77cf2fa2-2a9f-48ea-bdeb-dda81b15dac1,
>> pages=2050858, markPos=FileWALPointer [idx=2051, fileOff=38583904,
>> len=15981], walSegmentsCleared=0, markDuration=920ms, pagesWrite=12002ms,
>> fsync=965250ms, total=978172ms]
>> [08:40:53,086][INFO][db-checkpoint-thread-#178][GridCacheDat
>> abaseSharedManager]
>> Checkpoint started [checkpointId=14d929ac-1b5c-4df2-a71f-002d5eb41f14,
>> startPtr=FileWALPointer [idx=2242, fileOff=65211837, len=15981],
>> checkpointLockWait=0ms, checkpointLockHoldTime=39ms,
>> walCpRecordFsyncDuration=720ms, pages=2110545, reason='timeout']
>> [08:40:57,793][INFO][data-streamer-stripe-1-#58][PageMemoryImpl]
>> Throttling
>> is applied to page modifications [percentOfPartTime=0.22, markDirty=7192
>> pages/sec, checkpointWrite=2450 pages/sec, estIdealMarkDirty=139543
>> pages/sec, curDirty=0.00, maxDirty=0.17, avgParkTime=1732784 ns, pages:
>> (total=2110545, evicted=0, written=875069, synced=0, cpBufUsed=92,
>> cpBufTotal=518215)]
>> [08:40:58,991][INFO][grid-timeout-worker-#119][IgniteKernal]
>> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>> ^-- Node [id=8333aa56, uptime=02:35:01.957]
>> ^-- H/N/C [hosts=9, nodes=16, CPUs=552]
>> ^-- CPU [cur=9.3%, avg=33%, GC=0%]
>> ^-- PageMemory [pages=8920631]
>> ^-- Heap [used=13262MB, free=58.55%, comm=32000MB]
>> ^-- Non heap [used=70MB, free=95.34%, comm=73MB]
>> ^-- Outbound messages queue [size=0]
>> ^-- Public thread pool [active=0, idle=0, qSize=0]
>> ^-- System thread pool [active=0, idle=6, qSize=0]
>> [08:41:29,050][WARNING][jvm-pause-detector-worker][] Possible too long
>> JVM
>> pause: 22667 milliseconds.
>> [08:41:29,050][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Finished
>> serving remote node connection [rmtAddr=/10.29.41.23:32815, rmtPort=32815
>> [08:41:29,052][INFO][tcp-disco-sock-reader-#13][TcpDiscoverySpi] Finished
>> serving remote node connection [rmtAddr=/10.29.41.25:46515, rmtPort=46515
>> [08:41:30,063][INFO][data-streamer-stripe-3-#60][PageMemoryImpl]
>> Throttling
>> is applied to page modifications [percentOfPartTime=0.49, markDirty=26945
>> pages/sec, checkpointWrite=2612 pages/sec, estIdealMarkDirty=210815
>> pages/sec,

Re: Node pause for no obvious reason

2018-06-08 Thread Andrey Mashenkov
Hi,

Looks like node was segmented due to long JVM pause.
There are 2 "long JVM pause" messages in long an suspicious long checkpoint:

Checkpoint finished [cpId=77cf2fa2-2a9f-48ea-bdeb-dda81b15dac1,
pages=2050858, markPos=FileWALPointer [idx=2051, fileOff=38583904,
len=15981], walSegmentsCleared=0, markDuration=920ms, pagesWrite=12002ms,
fsync=965250ms, total=978172ms]


Do you use HDD? Are you sure there is no swapping happens?
Do you have enough free space on disk?
Is there any other heavy process on server that may took too much CPU time
and JVM was descheduled from CPU for too long?


It looks weird JVM was not restarted. We've to check such case.

On Fri, Jun 8, 2018 at 12:32 PM, Ray  wrote:

> I setup a six node Ignite cluster to test the performance and stability.
> Here's my setup.
>  class="org.apache.ignite.configuration.IgniteConfiguration">
> 
> 
> 
> 
>  class="org.apache.ignite.configuration.DataStorageConfiguration">
> 
>  class="org.apache.ignite.configuration.DataRegionConfiguration">
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>   
>
> And I used this command to start the Ignite node.
> ./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC
> -J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch
> -J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps
> -J-XX:+PrintAdaptiveSizePolicy -J-Xloggc:/ignitegc-$(date
> +%Y_%m_%d-%H_%M).log  config/persistent-config.xml
>
> One of the node just dropped from the topology. Here's the log for last
> three minutes before this node going down.
> [08:39:58,982][INFO][grid-timeout-worker-#119][IgniteKernal]
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
> ^-- Node [id=8333aa56, uptime=02:34:01.948]
> ^-- H/N/C [hosts=9, nodes=16, CPUs=552]
> ^-- CPU [cur=41%, avg=33.18%, GC=0%]
> ^-- PageMemory [pages=8912687]
> ^-- Heap [used=8942MB, free=72.05%, comm=32000MB]
> ^-- Non heap [used=70MB, free=95.35%, comm=73MB]
> ^-- Outbound messages queue [size=0]
> ^-- Public thread pool [active=0, idle=0, qSize=0]
> ^-- System thread pool [active=0, idle=6, qSize=0]
> [08:40:51,945][INFO][db-checkpoint-thread-#178][
> GridCacheDatabaseSharedManager]
> Checkpoint finished [cpId=77cf2fa2-2a9f-48ea-bdeb-dda81b15dac1,
> pages=2050858, markPos=FileWALPointer [idx=2051, fileOff=38583904,
> len=15981], walSegmentsCleared=0, markDuration=920ms, pagesWrite=12002ms,
> fsync=965250ms, total=978172ms]
> [08:40:53,086][INFO][db-checkpoint-thread-#178][
> GridCacheDatabaseSharedManager]
> Checkpoint started [checkpointId=14d929ac-1b5c-4df2-a71f-002d5eb41f14,
> startPtr=FileWALPointer [idx=2242, fileOff=65211837, len=15981],
> checkpointLockWait=0ms, checkpointLockHoldTime=39ms,
> walCpRecordFsyncDuration=720ms, pages=2110545, reason='timeout']
> [08:40:57,793][INFO][data-streamer-stripe-1-#58][PageMemoryImpl]
> Throttling
> is applied to page modifications [percentOfPartTime=0.22, markDirty=7192
> pages/sec, checkpointWrite=2450 pages/sec, estIdealMarkDirty=139543
> pages/sec, curDirty=0.00, maxDirty=0.17, avgParkTime=1732784 ns, pages:
> (total=2110545, evicted=0, written=875069, synced=0, cpBufUsed=92,
> cpBufTotal=518215)]
> [08:40:58,991][INFO][grid-timeout-worker-#119][IgniteKernal]
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
> ^-- Node [id=8333aa56, uptime=02:35:01.957]
> ^-- H/N/C [hosts=9, nodes=16, CPUs=552]
> ^-- CPU [cur=9.3%, avg=33%, GC=0%]
> ^-- PageMemory [pages=8920631]
> ^-- Heap [used=13262MB, free=58.55%, comm=32000MB]
> ^-- Non heap [used=70MB, free=95.34%, comm=73MB]
> ^-- Outbound messages queue [size=0]
> ^-- Public thread pool [active=0, idle=0, qSize=0]
> ^-- System thread pool [active=0, idle=6, qSize=0]
> [08:41:29,050][WARNING][jvm-pause-detector-worker][] Possible too long JVM
> pause: 22667 milliseconds.
> [08:41:29,050][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Finished
> serving remote node connection [rmtAddr=/10.29.41.23:32815, rmtPort=32815
> [08:41:29,052][INFO][tcp-disco-sock-reader-#13][TcpDiscoverySpi] Finished
> serving remote node connection [rmtAddr=/10.29.41.25:46515, rmtPort=46515
> [08:41:30,063][INFO][data-streamer-stripe-3-#60][PageMemoryImpl]
> Throttling
> is applied to page modifications [percentOfPartTime=0.49, markDirty=26945
> pages/sec, checkpointWrite=2612 pages/sec, estIdealMarkDirty=210815
> pages/sec, curDirty=0.00, maxDirty=0.34, avgParkTime=1024456 ns, pages:
> (total=2110545, evicted=0, written=1861330, synced=0, cpBufUsed=8657,
> cpBufTotal=518215)]
> [08:42:42,276][WARNING][jvm-pause-detector-worker][] Possible too long JVM
> pause: 67967 milliseconds.
> [08:42:42,277][INFO][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Local node
> se

Node pause for no obvious reason

2018-06-08 Thread Ray
I setup a six node Ignite cluster to test the performance and stability.
Here's my setup.


















  

And I used this command to start the Ignite node.
./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC
-J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch
-J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps
-J-XX:+PrintAdaptiveSizePolicy -J-Xloggc:/ignitegc-$(date
+%Y_%m_%d-%H_%M).log  config/persistent-config.xml

One of the node just dropped from the topology. Here's the log for last
three minutes before this node going down.
[08:39:58,982][INFO][grid-timeout-worker-#119][IgniteKernal]
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=8333aa56, uptime=02:34:01.948]
^-- H/N/C [hosts=9, nodes=16, CPUs=552]
^-- CPU [cur=41%, avg=33.18%, GC=0%]
^-- PageMemory [pages=8912687]
^-- Heap [used=8942MB, free=72.05%, comm=32000MB]
^-- Non heap [used=70MB, free=95.35%, comm=73MB]
^-- Outbound messages queue [size=0]
^-- Public thread pool [active=0, idle=0, qSize=0]
^-- System thread pool [active=0, idle=6, qSize=0]
[08:40:51,945][INFO][db-checkpoint-thread-#178][GridCacheDatabaseSharedManager]
Checkpoint finished [cpId=77cf2fa2-2a9f-48ea-bdeb-dda81b15dac1,
pages=2050858, markPos=FileWALPointer [idx=2051, fileOff=38583904,
len=15981], walSegmentsCleared=0, markDuration=920ms, pagesWrite=12002ms,
fsync=965250ms, total=978172ms]
[08:40:53,086][INFO][db-checkpoint-thread-#178][GridCacheDatabaseSharedManager]
Checkpoint started [checkpointId=14d929ac-1b5c-4df2-a71f-002d5eb41f14,
startPtr=FileWALPointer [idx=2242, fileOff=65211837, len=15981],
checkpointLockWait=0ms, checkpointLockHoldTime=39ms,
walCpRecordFsyncDuration=720ms, pages=2110545, reason='timeout']
[08:40:57,793][INFO][data-streamer-stripe-1-#58][PageMemoryImpl] Throttling
is applied to page modifications [percentOfPartTime=0.22, markDirty=7192
pages/sec, checkpointWrite=2450 pages/sec, estIdealMarkDirty=139543
pages/sec, curDirty=0.00, maxDirty=0.17, avgParkTime=1732784 ns, pages:
(total=2110545, evicted=0, written=875069, synced=0, cpBufUsed=92,
cpBufTotal=518215)]
[08:40:58,991][INFO][grid-timeout-worker-#119][IgniteKernal]
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=8333aa56, uptime=02:35:01.957]
^-- H/N/C [hosts=9, nodes=16, CPUs=552]
^-- CPU [cur=9.3%, avg=33%, GC=0%]
^-- PageMemory [pages=8920631]
^-- Heap [used=13262MB, free=58.55%, comm=32000MB]
^-- Non heap [used=70MB, free=95.34%, comm=73MB]
^-- Outbound messages queue [size=0]
^-- Public thread pool [active=0, idle=0, qSize=0]
^-- System thread pool [active=0, idle=6, qSize=0]
[08:41:29,050][WARNING][jvm-pause-detector-worker][] Possible too long JVM
pause: 22667 milliseconds.
[08:41:29,050][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Finished
serving remote node connection [rmtAddr=/10.29.41.23:32815, rmtPort=32815
[08:41:29,052][INFO][tcp-disco-sock-reader-#13][TcpDiscoverySpi] Finished
serving remote node connection [rmtAddr=/10.29.41.25:46515, rmtPort=46515
[08:41:30,063][INFO][data-streamer-stripe-3-#60][PageMemoryImpl] Throttling
is applied to page modifications [percentOfPartTime=0.49, markDirty=26945
pages/sec, checkpointWrite=2612 pages/sec, estIdealMarkDirty=210815
pages/sec, curDirty=0.00, maxDirty=0.34, avgParkTime=1024456 ns, pages:
(total=2110545, evicted=0, written=1861330, synced=0, cpBufUsed=8657,
cpBufTotal=518215)]
[08:42:42,276][WARNING][jvm-pause-detector-worker][] Possible too long JVM
pause: 67967 milliseconds.
[08:42:42,277][INFO][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Local node
seems to be disconnected from topology (failure detection timeout is
reached) [failureDetectionTimeout=6, connCheckFreq=2]
[08:42:42,280][INFO][tcp-disco-sock-reader-#10][TcpDiscoverySpi] Finished
serving remote node connection [rmtAddr=/10.29.42.49:36509, rmtPort=36509
[08:42:42,286][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
accepted incoming connection [rmtAddr=/10.29.42.45, rmtPort=38712]
[08:42:42,286][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
spawning a new thread for connection [rmtAddr=/10.29.42.45, rmtPort=38712]
[08:42:42,287][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Started
serving remote node connection [rmtAddr=/10.29.42.45:38712, rmtPort=38712]
[08:42:42,289][WARNING][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Node is
out of topology (probably, due to short-time network problems).
[08:42:42,290][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Finished
serving remote node connection [rmtAddr=/10.29.42.45:38712, rmtPort=38712
[08:42:42,290][WARNING][disco-event-worker-#161][GridDiscoveryManager] Local
node S