Re: node down after Caught unhandled exception in NIO worker thread (restart the node) log

2019-12-02 Thread Ilya Kasnacheev
Hello!

Well, you have an awful lot of pool starvation messages in your server
logs. This may suggest problems with network or the load.

You also hit a long GC:
[2019-11-22T21:19:45,964][WARN ][jvm-pause-detector-worker][IgniteKernal]
Possible too long JVM pause: 15252 milliseconds.

Which was accompanied by a lot of connection re-establishment:

[2019-11-22T21:19:49,142][WARN
][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Communication SPI
session write timed out (consider increasing 'socketWriteTimeout'
configuration property) [remoteAddr=/192.168.199.122:9508,
writeTimeout=2000]
[2019-11-22T21:19:49,352][INFO
][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Accepted incoming
communication connection [locAddr=/192.168.199.60:47100, rmtAddr=/
192.168.199.116:43306]
[2019-11-22T21:19:49,353][INFO
][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Accepted incoming
communication connection [locAddr=/192.168.199.60:47100, rmtAddr=/
192.168.199.107:44274]
[2019-11-22T21:19:49,353][INFO
][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Accepted incoming
communication connection [locAddr=/192.168.199.60:47100, rmtAddr=/
192.168.199.57:5130]
[2019-11-22T21:19:49,353][INFO
][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Accepted incoming
communication connection [locAddr=/192.168.199.60:47100, rmtAddr=/
192.168.199.112:5766]
[2019-11-22T21:19:49,353][INFO
][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Accepted incoming
communication connection [locAddr=/192.168.199.60:47100, rmtAddr=/
192.168.199.114:64992]
[2019-11-22T21:19:49,353][INFO
][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Accepted incoming
communication connection [locAddr=/192.168.199.60:47100, rmtAddr=/
192.168.199.108:28644]
[2019-11-22T21:19:49,353][INFO
][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Accepted incoming
communication connection [locAddr=/192.168.199.60:47100, rmtAddr=/
192.168.199.124:47200]
[2019-11-22T21:19:49,140][INFO
][grid-nio-worker-tcp-comm-1-#201][TcpCommunicationSpi] Established
outgoing communication connection [locAddr=/192.168.199.60:52484, rmtAddr=/
192.168.199.65:47100]
[2019-11-22T21:19:49,124][INFO
][grid-nio-worker-tcp-comm-0-#200][TcpCommunicationSpi] Accepted incoming
communication connection [locAddr=/192.168.199.60:47100, rmtAddr=/
192.168.199.106:13940]

It is recommended to adjust your timeouts (or heap sizes) so that long GC
does not cause all communication connections to be severed.

Regards,
-- 
Ilya Kasnacheev


пн, 25 нояб. 2019 г. в 16:57, ihalilaltun :

> Hi Igniters,
>
> We had a strange node-down incident after getting following log (we've been
> using ignite in production for almost 1 year and we're getting this error
> for the first time)
>
> [2019-11-22T21:19:54,222][INFO
> ][grid-nio-worker-tcp-comm-3-#203][TcpCommunicationSpi] Established
> outgoing
> communication connection [locAddr=/192.168.199.60:43720,
> rmtAddr=/192.168.199.222:47100]
>
> [2019-11-22T21:19:54,230][ERROR][grid-nio-worker-tcp-comm-0-#200][TcpCommunicationSpi]
> Caught unhandled exception in NIO worker thread (restart the node).
> java.nio.channels.CancelledKeyException: null
> at
> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
> ~[?:1.8.0_201]
> at
> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82)
> ~[?:1.8.0_201]
> at
>
> java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:204)
> ~[?:1.8.0_201]
> at
>
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1997)
> ~[ignite-core-2.7.6.jar:2.7.6]
> at
>
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
> [ignite-core-2.7.6.jar:2.7.6]
> at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
> [ignite-core-2.7.6.jar:2.7.6]
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
>
> [2019-11-22T21:19:54,343][ERROR][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi]
> Failed to process selector key [ses=GridSelectorNioSessionImpl
> [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=2,
> bytesRcvd=617063277634, bytesSent=8878293076427, bytesRcvd0=107695,
> bytesSent0=727192, select=true, super=GridWorker
> [name=grid-nio-worker-tcp-comm-2, igniteInstanceName=null, finished=false,
> heartbeatTs=1574457593322, hashCode=1772114147, interrupted=false,
> runner=grid-nio-worker-tcp-comm-2-#202]]],
> writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
> readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
> inRecovery=null, outRecovery=null, super=GridNioSessionImpl
> [locAddr=/192.168.199.60:47100, rmtAddr=/192.168.199.68:62054,
> createTime=1574457593307, closeTime=0, bytesSent=38, bytesRcvd=42,
> bytesSent0=38, bytesRcvd0=42, sndSchedTime=1574457593322,
> lastSndTime=1574457593322, lastRcvTime=1574457593322, 

node down after Caught unhandled exception in NIO worker thread (restart the node) log

2019-11-25 Thread ihalilaltun
Hi Igniters,

We had a strange node-down incident after getting following log (we've been
using ignite in production for almost 1 year and we're getting this error
for the first time)

[2019-11-22T21:19:54,222][INFO
][grid-nio-worker-tcp-comm-3-#203][TcpCommunicationSpi] Established outgoing
communication connection [locAddr=/192.168.199.60:43720,
rmtAddr=/192.168.199.222:47100]
[2019-11-22T21:19:54,230][ERROR][grid-nio-worker-tcp-comm-0-#200][TcpCommunicationSpi]
Caught unhandled exception in NIO worker thread (restart the node).
java.nio.channels.CancelledKeyException: null
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
~[?:1.8.0_201]
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82)
~[?:1.8.0_201]
at
java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:204)
~[?:1.8.0_201]
at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1997)
~[ignite-core-2.7.6.jar:2.7.6]
at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
[ignite-core-2.7.6.jar:2.7.6]
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
[ignite-core-2.7.6.jar:2.7.6]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
[2019-11-22T21:19:54,343][ERROR][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi]
Failed to process selector key [ses=GridSelectorNioSessionImpl
[worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=2,
bytesRcvd=617063277634, bytesSent=8878293076427, bytesRcvd0=107695,
bytesSent0=727192, select=true, super=GridWorker
[name=grid-nio-worker-tcp-comm-2, igniteInstanceName=null, finished=false,
heartbeatTs=1574457593322, hashCode=1772114147, interrupted=false,
runner=grid-nio-worker-tcp-comm-2-#202]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=null, outRecovery=null, super=GridNioSessionImpl
[locAddr=/192.168.199.60:47100, rmtAddr=/192.168.199.68:62054,
createTime=1574457593307, closeTime=0, bytesSent=38, bytesRcvd=42,
bytesSent0=38, bytesRcvd0=42, sndSchedTime=1574457593322,
lastSndTime=1574457593322, lastRcvTime=1574457593322, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@2f6039d0, directMode=true],
GridConnectionBytesVerifyFilter], accepted=true, markedForClose=false]]]
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_201]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
~[?:1.8.0_201]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
~[?:1.8.0_201]
at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_201]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
~[?:1.8.0_201]
at
org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1282)
~[ignite-core-2.7.6.jar:2.7.6]
at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2386)
[ignite-core-2.7.6.jar:2.7.6]
at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2153)
[ignite-core-2.7.6.jar:2.7.6]
at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
[ignite-core-2.7.6.jar:2.7.6]
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
[ignite-core-2.7.6.jar:2.7.6]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
[2019-11-22T21:19:54,624][WARN
][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Closing NIO session
because of unhandled exception [cls=class o.a.i.i.util.nio.GridNioException,
msg=Connection reset by peer]


- there was no network issues in any way
- here is the node logs;
ignite.zip
  
- here is the client node;
proto-20191122.log
  

any thougs?

Regards




-
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/