One failing node stalling the whole cluster

Daniel López Thu, 02 Jun 2016 01:22:58 -0700

Hi there,

We are using Ignite 1.5.0 and we are experiencing a strange issue where one
node stalls the others node in the cluster. We are
using CacheMode.REPLICATED caches to store data on heap on several nodes to
improve latency.
In one of the latest upgrades someone introduced a bug in the system that
could cause one node to consume too much memory and start having GC issues.
Sh*t happens :).
The problem, however, is that when this node starts to crawl due to heavy
GC usage, it starts spitting these logs:


|Failed to process selector key (will close): GridSelectorNioSessionImpl
[selectorIdx=0, queueSize=24, writeBuf=java.nio.DirectByteBuffer[pos=0
lim=32768 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768
cap=32768], recovery=GridNioRecoveryDescriptor [acked=284144, resendCnt=0,
rcvCnt=284230, reserved=true, lastAck=284224, nodeLeft=false,
node=TcpDiscoveryNode [id=1109a421-ec72-4534-99c4-df5d7e4f6136,
addrs=[x.y.z4, 127.0.0.1], sockAddrs=[machine.env/x.y.z4:3808, /x.y.z:3808,
/127.0.0.1:3808], discPort=3808, order=32, intOrder=17,
lastExchangeTime=1464854224616, loc=false,
ver=1.5.0#20151229-sha1:f1f8cda2, isClient=false], connected=true,
connectCnt=0, queueLimit=5120], super=GridNioSessionImpl
[locAddr=/x.y.z3:47100, rmtAddr=/x.y.z4:33450, createTime=1464854224707,
closeTime=0, bytesSent=10838655, bytesRcvd=221207982,
sndSchedTime=1464854575433, lastSndTime=1464854575433,
lastRcvTime=1464854575433, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@63f44e30, directMode=true],
GridConnectionBytesVerifyFilter], accepted=true]]
WARN |2016-06-02T08:03:02,575||TcpCommunicationSpi|Closing NIO session
because of unhandled exception [cls=class
o.a.i.i.util.nio.GridNioException, msg=Conexión reinicializada por la
máquina remota]

And the other nodes in the cluster start to produce these other logs and
access to the cache slows down/pauses greatly:

GridCachePartitionExchangeManager|Failed to send partitions full message
[node=TcpDiscoveryNode [id=913ea465-ed45-4ec9-a4b7-d2c5f9c57a2e, a....
TcpDiscoverySpi|Failed to ping node (status check will be initiated): ....
GridDiscoveryManager|Node FAILED: TcpDiscoveryNode [id=...

That the node with the GC issues stops working is "normal", even if
undesired, but what really worries us is that it causes the other nodes in
the cluster to stop being able to use the replicated caches, so one node
can bring down the whole cluster.

If we stop the offeding node, the others go back to normal behaviour and
work as fast as always.

We are going to solve the application bug, of course, but is there any
configuration setting that we can tweak so one bug in one machine does not
bring the whole cluster to a halt?

Ignite is configured to use TcpDiscoverySpi with TcpDiscoveryVmIpFinder
with a list of addresses (11 nodes per set currently)
Each node has 29 caches configured like this:
        cacheConfiguration.setCacheMode(CacheMode.REPLICATED);
        cacheConfiguration.setCopyOnRead(false);
        cacheConfiguration.setEagerTtl(false);

Thanks,
D.

PD: Yes, we'll have to try with latest Ignite version but we wanted to know
if there is any configuration setting that might help first, before having
to migrate and restart the whole testing/process.

One failing node stalling the whole cluster

Reply via email to