Re: One failing node stalling the whole cluster

Denis Magda Fri, 03 Jun 2016 05:59:15 -0700

Hi Daniel,

Actually a failure of one machine shouldn’t lead to the whole cluster shutdown 
unless your application code was executed on those nodes as well and killed 
other nodes due to high GC pauses or some other reason.


My first suggestion is to tune garbage collection appropriately:
https://apacheignite.readme.io/v1.6/docs/jvm-and-system-tuning#jvm-tuning-for-clusters-with-on_heap-caches

and track GC logs to adjust the settings if needed
https://apacheignite.readme.io/v1.6/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats

If the issue still happens please share GC logs and logs from all the nodes 
with us. Probably we will be able to pin point the problem on your side.

—
Denis

> On Jun 2, 2016, at 11:21 AM, Daniel López <d.lope...@gmail.com> wrote:
> 
> Hi there,
> 
> We are using Ignite 1.5.0 and we are experiencing a strange issue where one 
> node stalls the others node in the cluster. We are using CacheMode.REPLICATED 
> caches to store data on heap on several nodes to improve latency.
> In one of the latest upgrades someone introduced a bug in the system that 
> could cause one node to consume too much memory and start having GC issues. 
> Sh*t happens :).
> The problem, however, is that when this node starts to crawl due to heavy GC 
> usage, it starts spitting these logs:
> 
> |Failed to process selector key (will close): GridSelectorNioSessionImpl 
> [selectorIdx=0, queueSize=24, writeBuf=java.nio.DirectByteBuffer[pos=0 
> lim=32768 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 
> cap=32768], recovery=GridNioRecoveryDescriptor [acked=284144, resendCnt=0, 
> rcvCnt=284230, reserved=true, lastAck=284224, nodeLeft=false, 
> node=TcpDiscoveryNode [id=1109a421-ec72-4534-99c4-df5d7e4f6136, 
> addrs=[x.y.z4, 127.0.0.1], sockAddrs=[machine.env/x.y.z4:3808, /x.y.z:3808, 
> /127.0.0.1:3808 <http://127.0.0.1:3808/>], discPort=3808, order=32, 
> intOrder=17, lastExchangeTime=1464854224616, loc=false, 
> ver=1.5.0#20151229-sha1:f1f8cda2, isClient=false], connected=true, 
> connectCnt=0, queueLimit=5120], super=GridNioSessionImpl 
> [locAddr=/x.y.z3:47100, rmtAddr=/x.y.z4:33450, createTime=1464854224707, 
> closeTime=0, bytesSent=10838655, bytesRcvd=221207982, 
> sndSchedTime=1464854575433, lastSndTime=1464854575433, 
> lastRcvTime=1464854575433, readsPaused=false, 
> filterChain=FilterChain[filters=[GridNioCodecFilter 
> [parser=o.a.i.i.util.nio.GridDirectParser@63f44e30, directMode=true], 
> GridConnectionBytesVerifyFilter], accepted=true]]
> WARN |2016-06-02T08:03:02,575||TcpCommunicationSpi|Closing NIO session 
> because of unhandled exception [cls=class o.a.i.i.util.nio.GridNioException, 
> msg=Conexión reinicializada por la máquina remota]
> 
> And the other nodes in the cluster start to produce these other logs and 
> access to the cache slows down/pauses greatly:
> 
> GridCachePartitionExchangeManager|Failed to send partitions full message 
> [node=TcpDiscoveryNode [id=913ea465-ed45-4ec9-a4b7-d2c5f9c57a2e, a....
> TcpDiscoverySpi|Failed to ping node (status check will be initiated): .... 
> GridDiscoveryManager|Node FAILED: TcpDiscoveryNode [id=...
> 
> That the node with the GC issues stops working is "normal", even if 
> undesired, but what really worries us is that it causes the other nodes in 
> the cluster to stop being able to use the replicated caches, so one node can 
> bring down the whole cluster.
> 
> If we stop the offeding node, the others go back to normal behaviour and work 
> as fast as always.
> 
> We are going to solve the application bug, of course, but is there any 
> configuration setting that we can tweak so one bug in one machine does not 
> bring the whole cluster to a halt?
> 
> Ignite is configured to use TcpDiscoverySpi with TcpDiscoveryVmIpFinder with 
> a list of addresses (11 nodes per set currently)
> Each node has 29 caches configured like this:
>         cacheConfiguration.setCacheMode(CacheMode.REPLICATED);
>         cacheConfiguration.setCopyOnRead(false);
>         cacheConfiguration.setEagerTtl(false);
> 
> Thanks,
> D.
> 
> PD: Yes, we'll have to try with latest Ignite version but we wanted to know 
> if there is any configuration setting that might help first, before having to 
> migrate and restart the whole testing/process.

Re: One failing node stalling the whole cluster

Reply via email to