Hi all, I am evaluating Ignite 2.7 failover scenarios. We are testing 3
different scenarios:
1. Swap rebalance - kill a node, then add a new node in
2. Scale up - add a new node in
3. Scale down - kill a node

I have a cluster with 30 nodes, with a huge dataset of 450 million items.

Test 1

In scenario 1: 
I started node 31 and killed node 1. Node 31 was not in the base topology
but they share the same XML file so the cluster detected it. I then used
control.sh --baseline remove node1 which is offline and added node 31 which
is outside of the original topology. This step works fine

In scenario 2:
I started node 1 and added back to the cluster via the steps above, then
suddenly 3 other nodes in the cluster crashed. The reasoning could be
because of me not removing the old work directory in node 1. Anyways the
results I got from the crashed servers are:

```
java.lang.NullPointerException
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.cacheGroupAddedOnExchange(GridDhtPartitionsExchangeFuture.java:492)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$14.applyx(CacheAffinitySharedManager.java:1598)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$14.applyx(CacheAffinitySharedManager.java:1590)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.forAllRegisteredCacheGroups(CacheAffinitySharedManager.java:1206)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onReassignmentEnforced(CacheAffinitySharedManager.java:1590)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onServerLeftWithExchangeMergeProtocol(CacheAffinitySharedManager.java:1546)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.finishExchangeOnCoordinator(GridDhtPartitionsExchangeFuture.java:3239)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onAllReceived(GridDhtPartitionsExchangeFuture.java:3191)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onBecomeCoordinator(GridDhtPartitionsExchangeFuture.java:4559)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$3500(GridDhtPartitionsExchangeFuture.java:139)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1$1.apply(GridDhtPartitionsExchangeFuture.java:4331)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1$1.apply(GridDhtPartitionsExchangeFuture.java:4320)
    at
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:385)
    at
org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:355)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1.call(GridDhtPartitionsExchangeFuture.java:4320)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1.call(GridDhtPartitionsExchangeFuture.java:4316)
    at
org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6816)
    at
org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
    at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
```

and

```
class org.apache.ignite.internal.cluster.ClusterTopologyCheckedException:
Failed to send message (node left topology): TcpDiscoveryNode
[id=c6cd8563-ca40-4563-8dc0-4626c0c8111e,
addrs=[100.74.26.173, 127.0.0.1], sockAddrs=[/127.0.0.1:47500,
someip:47500],
discPort=47500, order=12, intOrder=12, lastExchangeTime=1581324395969,
loc=false, ver=2.7.0#20181201-sha1:256ae401, isClient=false]
    at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3270)
    at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2987)
    at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2870)
    at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
    at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
    at
org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1656)
    at
org.apache.ignite.internal.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1766)
    at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.sendOrderedMessage(GridCacheIoManager.java:1231)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.handleSupplyMessage(GridDhtPartitionDemander.java:845)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleSupplyMessage(GridDhtPreloader.java:387)
    at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:418)
    at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:408)
    at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
    at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
    at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:101)
    at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1613)
    at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
    at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:127)
    at
org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2768)
    at
org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1529)
    at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:127)
    at
org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1498)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
[12:36:51] Ignite node stopped OK [uptime=1 day, 18:50:16.868]
```

Test 2

I started the test again, 

Scenario 1, I removed node 1 and added node 31, seems ok

Scenario 2, I added node 1 after *removing all data files in node 1*, all
seems to be fine

Scenario 3, I try to remove 31st node, 2 nodes go down and I encountered a
new error:

```
Locked synchronizers:
        java.util.concurrent.ThreadPoolExecutor$Worker@4819bf4
Thread [name="checkpoint-runner-#50", id=74, state=WAITING, blockCnt=28,
waitCnt=5449360]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
        at
o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
        at
o.a.i.i.util.future.GridFutureAdapter.getUninterruptibly(GridFutureAdapter.java:146)
        at
o.a.i.i.processors.cache.persistence.file.AsyncFileIO.write(AsyncFileIO.java:146)
        at
o.a.i.i.processors.cache.persistence.file.AbstractFileIO$5.run(AbstractFileIO.java:118)
        at
o.a.i.i.processors.cache.persistence.file.AbstractFileIO.fully(AbstractFileIO.java:54)
        at
o.a.i.i.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:116)
        at
o.a.i.i.processors.cache.persistence.file.FilePageStore.write(FilePageStore.java:565)
        at
o.a.i.i.processors.cache.persistence.file.FilePageStoreManager.writeInternal(FilePageStoreManager.java:483)
        at
o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$WriteCheckpointPages.writePages(GridCacheDatabaseSharedManager.java:4207)
        at
o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$WriteCheckpointPages.run(GridCacheDatabaseSharedManager.java:4101)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
```

And a whole lot more messages about locked synchronizers, followed by:

```
class org.apache.ignite.IgniteCheckedException: Failed to cache rebalanced
entry (will stop rebalancing) [local=TcpDiscoveryNode
[id=be1978ef-b5c7-4118-b17a-36a65ef1fff6, addrs=[100.74.26.131, 127.0.0.1],
sockAddrs=[someip:47500, /127.0.0.1:47500], discPort=47500, order=41,
intOrder=36, lastExchangeTime=1581563518767, loc=true,
ver=2.7.0#20181201-sha1:256ae401, isClient=false],
node=86b79c0e-e3df-45c9-9a6b-ab5607a41253, key=KeyCacheObjectImpl [part=372,
val=user3974044929057811550, hasValBytes=true], part=372]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.preloadEntry(GridDhtPartitionDemander.java:951)
        at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.handleSupplyMessage(GridDhtPartitionDemander.java:772)
        at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleSupplyMessage(GridDhtPreloader.java:387)
        at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:418)
        at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:408)
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:101)
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1613)
        at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
        at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:127)
        at
org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2768)
        at
org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1529)
        at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:127)
        at
org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1498)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
Caused by: class org.apache.ignite.internal.NodeStoppingException: Operation
has been cancelled (node is stopping).
        at
org.apache.ignite.internal.processors.query.GridQueryProcessor.store(GridQueryProcessor.java:1861)
        at
org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.store(GridCacheQueryManager.java:404)
        at
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishUpdate(IgniteCacheOffheapManagerImpl.java:2633)
        at
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1646)
        at
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1621)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1935)
        at
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:428)
        at
org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:4248)
        at
org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:3391)
        at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.preloadEntry(GridDhtPartitionDemander.java:902)
        ... 17 more
```

Configurations

30 servers, Ignite 2.7, no client connected, attached is the XML config
file:  ignite-sql.xml
<http://apache-ignite-users.70518.x6.nabble.com/file/t2779/ignite-sql.xml>  
I didn't define fault rebalance mode, so it should be ASYNC and partition
loss policy should be IGNORE

My question is:

In general, what are the steps to follow to scale up/down the cluster or
remove nodes. Is kill -9 <pid> the right way to kill a node? Do you just run
`control.sh --baseline add <nodeconsistentid>` to add new nodes not in the
original baseline topology? How about re-adding new nodes that were
previously killed? Do we need to remove any files? How long does it take for
the nodes to synchronise? How do we know when a rebalance is completed?

Sorry for my many questions, I am new to Ignite and any help is appreciated!




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Reply via email to