Hi all, I am evaluating Ignite 2.7 failover scenarios. We are testing 3
different scenarios:
1. Swap rebalance - kill a node, then add a new node in
2. Scale up - add a new node in
3. Scale down - kill a node
I have a cluster with 30 nodes, with a huge dataset of 450 million items.
Test 1
In scenario 1:
I started node 31 and killed node 1. Node 31 was not in the base topology
but they share the same XML file so the cluster detected it. I then used
control.sh --baseline remove node1 which is offline and added node 31 which
is outside of the original topology. This step works fine
In scenario 2:
I started node 1 and added back to the cluster via the steps above, then
suddenly 3 other nodes in the cluster crashed. The reasoning could be
because of me not removing the old work directory in node 1. Anyways the
results I got from the crashed servers are:
```
java.lang.NullPointerException
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.cacheGroupAddedOnExchange(GridDhtPartitionsExchangeFuture.java:492)
at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$14.applyx(CacheAffinitySharedManager.java:1598)
at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$14.applyx(CacheAffinitySharedManager.java:1590)
at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.forAllRegisteredCacheGroups(CacheAffinitySharedManager.java:1206)
at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onReassignmentEnforced(CacheAffinitySharedManager.java:1590)
at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onServerLeftWithExchangeMergeProtocol(CacheAffinitySharedManager.java:1546)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.finishExchangeOnCoordinator(GridDhtPartitionsExchangeFuture.java:3239)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onAllReceived(GridDhtPartitionsExchangeFuture.java:3191)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onBecomeCoordinator(GridDhtPartitionsExchangeFuture.java:4559)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$3500(GridDhtPartitionsExchangeFuture.java:139)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1$1.apply(GridDhtPartitionsExchangeFuture.java:4331)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1$1.apply(GridDhtPartitionsExchangeFuture.java:4320)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:385)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:355)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1.call(GridDhtPartitionsExchangeFuture.java:4320)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1.call(GridDhtPartitionsExchangeFuture.java:4316)
at
org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6816)
at
org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
```
and
```
class org.apache.ignite.internal.cluster.ClusterTopologyCheckedException:
Failed to send message (node left topology): TcpDiscoveryNode
[id=c6cd8563-ca40-4563-8dc0-4626c0c8111e,
addrs=[100.74.26.173, 127.0.0.1], sockAddrs=[/127.0.0.1:47500,
someip:47500],
discPort=47500, order=12, intOrder=12, lastExchangeTime=1581324395969,
loc=false, ver=2.7.0#20181201-sha1:256ae401, isClient=false]
at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3270)
at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2987)
at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2870)
at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
at
org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1656)
at
org.apache.ignite.internal.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1766)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.sendOrderedMessage(GridCacheIoManager.java:1231)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.handleSupplyMessage(GridDhtPartitionDemander.java:845)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleSupplyMessage(GridDhtPreloader.java:387)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:418)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:408)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:101)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1613)
at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:127)
at
org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2768)
at
org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1529)
at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:127)
at
org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1498)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
[12:36:51] Ignite node stopped OK [uptime=1 day, 18:50:16.868]
```
Test 2
I started the test again,
Scenario 1, I removed node 1 and added node 31, seems ok
Scenario 2, I added node 1 after *removing all data files in node 1*, all
seems to be fine
Scenario 3, I try to remove 31st node, 2 nodes go down and I encountered a
new error:
```
Locked synchronizers:
java.util.concurrent.ThreadPoolExecutor$Worker@4819bf4
Thread [name="checkpoint-runner-#50", id=74, state=WAITING, blockCnt=28,
waitCnt=5449360]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
at
o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
at
o.a.i.i.util.future.GridFutureAdapter.getUninterruptibly(GridFutureAdapter.java:146)
at
o.a.i.i.processors.cache.persistence.file.AsyncFileIO.write(AsyncFileIO.java:146)
at
o.a.i.i.processors.cache.persistence.file.AbstractFileIO$5.run(AbstractFileIO.java:118)
at
o.a.i.i.processors.cache.persistence.file.AbstractFileIO.fully(AbstractFileIO.java:54)
at
o.a.i.i.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:116)
at
o.a.i.i.processors.cache.persistence.file.FilePageStore.write(FilePageStore.java:565)
at
o.a.i.i.processors.cache.persistence.file.FilePageStoreManager.writeInternal(FilePageStoreManager.java:483)
at
o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$WriteCheckpointPages.writePages(GridCacheDatabaseSharedManager.java:4207)
at
o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$WriteCheckpointPages.run(GridCacheDatabaseSharedManager.java:4101)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
```
And a whole lot more messages about locked synchronizers, followed by:
```
class org.apache.ignite.IgniteCheckedException: Failed to cache rebalanced
entry (will stop rebalancing) [local=TcpDiscoveryNode
[id=be1978ef-b5c7-4118-b17a-36a65ef1fff6, addrs=[100.74.26.131, 127.0.0.1],
sockAddrs=[someip:47500, /127.0.0.1:47500], discPort=47500, order=41,
intOrder=36, lastExchangeTime=1581563518767, loc=true,
ver=2.7.0#20181201-sha1:256ae401, isClient=false],
node=86b79c0e-e3df-45c9-9a6b-ab5607a41253, key=KeyCacheObjectImpl [part=372,
val=user3974044929057811550, hasValBytes=true], part=372]
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.preloadEntry(GridDhtPartitionDemander.java:951)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.handleSupplyMessage(GridDhtPartitionDemander.java:772)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleSupplyMessage(GridDhtPreloader.java:387)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:418)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:408)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:101)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1613)
at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:127)
at
org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2768)
at
org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1529)
at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:127)
at
org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1498)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: class org.apache.ignite.internal.NodeStoppingException: Operation
has been cancelled (node is stopping).
at
org.apache.ignite.internal.processors.query.GridQueryProcessor.store(GridQueryProcessor.java:1861)
at
org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.store(GridCacheQueryManager.java:404)
at
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishUpdate(IgniteCacheOffheapManagerImpl.java:2633)
at
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1646)
at
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1621)
at
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1935)
at
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:428)
at
org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:4248)
at
org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:3391)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.preloadEntry(GridDhtPartitionDemander.java:902)
... 17 more
```
Configurations
30 servers, Ignite 2.7, no client connected, attached is the XML config
file: ignite-sql.xml
<http://apache-ignite-users.70518.x6.nabble.com/file/t2779/ignite-sql.xml>
I didn't define fault rebalance mode, so it should be ASYNC and partition
loss policy should be IGNORE
My question is:
In general, what are the steps to follow to scale up/down the cluster or
remove nodes. Is kill -9 <pid> the right way to kill a node? Do you just run
`control.sh --baseline add <nodeconsistentid>` to add new nodes not in the
original baseline topology? How about re-adding new nodes that were
previously killed? Do we need to remove any files? How long does it take for
the nodes to synchronise? How do we know when a rebalance is completed?
Sorry for my many questions, I am new to Ignite and any help is appreciated!
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/