I see messages in the log that are not from Ignite, but which also suggest that there is some network issue: ======== 2018-06-09 10:19:59 [a9fc3882] severe [native] Exception in controller: receiveExact() ... error reading, 70014, End of file found. Retrying every 10 seconds. 2018-06-09 10:19:59 [abff8882] warning [native] Instrumentation channel disconnected: server did not reply to ping request ========
In any case, given that there were no messages in the logs for ~35 minutes, it is unlikely that it is an Ignite issue – Ignite would at least print metrics or errors if it was running and getting any CPU time. Stan From: naresh.goty Sent: 9 июня 2018 г. 18:16 To: [email protected] Subject: Re: Ignite Node failure - Node out of topology (SEGMENTED) We are still seeing the NODE SEGMENTATION issue happening to one of the node in our production even after JVM option is enabled ( -Djava.net.preferIPv4Stack=true). We don't see any activity reported in the logs for a period of ~30min after node failed. The below logs are from the failed node, and it can be observed that till this timestamp (9:43:25) node was up and running, and after that there are no messages till 10:19:59 in the catalina logs. Thugh node was failed at 09:44, the node segmentation was reported at 10:19AM Right before the node failure, the metrics clearly stated that CPU and memory utilization are very low. Long GC pauses are also not an issue. ================================================= Jun 09, 2018 9:42:25 AM org.apache.ignite.logger.java.JavaLogger info INFO: FreeList [name=delivery, buckets=256, dataPages=10239, reusePages=6228] Jun 09, 2018 9:42:25 AM org.apache.ignite.logger.java.JavaLogger info INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=278] Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info INFO: Metrics for local node (to disable set 'metricsLogFrequency' to 0) ^-- Node [id=9bbe0362, name=delivery, uptime=56:24:59.907] ^-- H/N/C [hosts=9, nodes=9, CPUs=18] ^-- CPU [cur=1.67%, avg=2.71%, GC=0%] ^-- PageMemory [pages=43003] ^-- Heap [used=605MB, free=85.11%, comm=4065MB] ^-- Non heap [used=204MB, free=-1%, comm=213MB] ^-- Public thread pool [active=0, idle=0, qSize=0] ^-- System thread pool [active=0, idle=6, qSize=0] ^-- Outbound messages queue [size=0] Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info INFO: FreeList [name=delivery, buckets=256, dataPages=10239, reusePages=6228] Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=278] 2018-06-09 10:19:59 [a9fc3882] warning [java ] ... last message repeated 1 time ... 2018-06-09 10:19:59 [a9fc3882] severe [native] Exception in controller: receiveExact() ... error reading, 70014, End of file found. Retrying every 10 seconds. 2018-06-09 10:19:59 [abff8882] warning [native] Instrumentation channel disconnected: server did not reply to ping request Jun 09, 2018 10:19:59 AM org.apache.ignite.logger.java.JavaLogger error SEVERE: Failed to send message: class o.a.i.IgniteCheckedException: Failed to send message (connection was closed): GridSelectorNioSessionImpl [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=0, bytesRcvd=24290335228, bytesSent=51080178562, bytesRcvd0=25800, bytesSent0=183862, select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-0, igniteInstanceName=delivery, finished=false, hashCode=1045175627, interrupted=false, runner=grid-nio-worker-tcp-comm-0-#25%delivery%]]], writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], inRecovery=GridNioRecoveryDescriptor [acked=3240064, resendCnt=0, rcvCnt=3240098, sentCnt=3240092, reserved=false, lastAck=3240096, nodeLeft=false, node=TcpDiscoveryNode [id=9165f32c-9765-49d7-8856-5b77b0bded6d, addrs=[10.40.173.14, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, /10.40.173.14:47500], discPort=47500, order=22, intOrder=15, lastExchangeTime=1528334301578, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=false, connectCnt=1, queueLimit=131072, reserveCnt=1, pairedConnections=false], outRecovery=GridNioRecoveryDescriptor [acked=3240064, resendCnt=0, rcvCnt=3240098, sentCnt=3240092, reserved=false, lastAck=3240096, nodeLeft=false, node=TcpDiscoveryNode [id=9165f32c-9765-49d7-8856-5b77b0bded6d, addrs=[10.40.173.14, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, /10.40.173.14:47500], discPort=47500, order=22, intOrder=15, lastExchangeTime=1528334301578, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=false, connectCnt=1, queueLimit=131072, reserveCnt=1, pairedConnections=false], super=GridNioSessionImpl [locAddr=/10.40.173.21:50320, rmtAddr=/10.40.173.14:47100, createTime=1528334304232, closeTime=1528537460337, bytesSent=13308284327, bytesRcvd=12937081481, bytesSent0=52340, bytesRcvd0=7812, sndSchedTime=1528334304232, lastSndTime=1528537460337, lastRcvTime=1528537460337, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=o.a.i.i.util.nio.GridDirectParser@7d2bc944, directMode=true], GridConnectionBytesVerifyFilter], accepted=false]] ==================================================== >From one of the active node, it reported that other node (above) was failed immediately at 09:44 (as reported in the logs below) Jun 09, 2018 9:43:50 AM org.apache.ignite.logger.java.JavaLogger info INFO: Metrics for local node (to disable set 'metricsLogFrequency' to 0) ^-- Node [id=b392f3c6, name=delivery, uptime=56:18:06.778] ^-- H/N/C [hosts=9, nodes=9, CPUs=18] ^-- CPU [cur=9.17%, avg=9%, GC=0%] ^-- PageMemory [pages=41038] ^-- Heap [used=3167MB, free=22.42%, comm=4082MB] ^-- Non heap [used=202MB, free=-1%, comm=210MB] ^-- Public thread pool [active=0, idle=0, qSize=0] ^-- System thread pool [active=0, idle=5, qSize=0] ^-- Outbound messages queue [size=0] Jun 09, 2018 9:43:50 AM org.apache.ignite.logger.java.JavaLogger info INFO: FreeList [name=delivery, buckets=256, dataPages=9096, reusePages=4659] Jun 09, 2018 9:43:50 AM org.apache.ignite.logger.java.JavaLogger info INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=240] Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger warning WARNING: Node FAILED: TcpDiscoveryNode [id=9bbe0362-00e0-4f16-abf0-52de5c62c51d, addrs=[10.40.173.21, 127.0.0.1], sockAddrs=[/10.40.173.21:47500, /127.0.0.1:47500], discPort=47500, order=28, intOrder=18, lastExchangeTime=1528334738846, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false] Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info INFO: Topology snapshot [ver=40, servers=8, clients=0, CPUs=16, heap=26.0GB] Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info INFO: Started exchange init [topVer=AffinityTopologyVersion [topVer=40, minorTopVer=0], crd=false, evt=NODE_FAILED, evtNode=9bbe0362-00e0-4f16-abf0-52de5c62c51d, customEvt=null, allowMerge=true] Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info INFO: Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=40, minorTopVer=0], waitTime=0ms, futInfo=NA] Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info INFO: Finished exchange init [topVer=AffinityTopologyVersion [topVer=40, minorTopVer=0], crd=false] Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info INFO: Received full message, will finish exchange [node=9165f32c-9765-49d7-8856-5b77b0bded6d, resVer=AffinityTopologyVersion [topVer=40, minorTopVer=0]] Jun 09, 2018 9:44:22 AM org.apache.ignite.logger.java.JavaLogger info INFO: Finish exchange future [startVer=AffinityTopologyVersion [topVer=40, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=40, minorTopVer=0], err=null] Jun 09, 2018 9:44:22 AM org.apache.ignite.logger.java.JavaLogger info ================================================================ Any advice in this issue would be appreciated? -Naresh -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
