Hello! Better issue TERM (kill without -9) so that node can at least gracefully shutdown its file descriptors.
Otherwise, the first error looks like some one-off bug, and "Operation has been cancelled (node is stopping)." are self-descriptive and normal. Unfortunately we would need to take a look at all logs from all nodes to understand why your grid was stalling. Regards, -- Ilya Kasnacheev чт, 13 февр. 2020 г. в 10:10, wentat <wentat.wo...@rakuten.com>: > Hi all, I am evaluating Ignite 2.7 failover scenarios. We are testing 3 > different scenarios: > 1. Swap rebalance - kill a node, then add a new node in > 2. Scale up - add a new node in > 3. Scale down - kill a node > > I have a cluster with 30 nodes, with a huge dataset of 450 million items. > > Test 1 > > In scenario 1: > I started node 31 and killed node 1. Node 31 was not in the base topology > but they share the same XML file so the cluster detected it. I then used > control.sh --baseline remove node1 which is offline and added node 31 which > is outside of the original topology. This step works fine > > In scenario 2: > I started node 1 and added back to the cluster via the steps above, then > suddenly 3 other nodes in the cluster crashed. The reasoning could be > because of me not removing the old work directory in node 1. Anyways the > results I got from the crashed servers are: > > ``` > java.lang.NullPointerException > at > > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.cacheGroupAddedOnExchange(GridDhtPartitionsExchangeFuture.java:492) > at > > org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$14.applyx(CacheAffinitySharedManager.java:1598) > at > > org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$14.applyx(CacheAffinitySharedManager.java:1590) > at > > org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.forAllRegisteredCacheGroups(CacheAffinitySharedManager.java:1206) > at > > org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onReassignmentEnforced(CacheAffinitySharedManager.java:1590) > at > > org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onServerLeftWithExchangeMergeProtocol(CacheAffinitySharedManager.java:1546) > at > > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.finishExchangeOnCoordinator(GridDhtPartitionsExchangeFuture.java:3239) > at > > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onAllReceived(GridDhtPartitionsExchangeFuture.java:3191) > at > > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onBecomeCoordinator(GridDhtPartitionsExchangeFuture.java:4559) > at > > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$3500(GridDhtPartitionsExchangeFuture.java:139) > at > > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1$1.apply(GridDhtPartitionsExchangeFuture.java:4331) > at > > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1$1.apply(GridDhtPartitionsExchangeFuture.java:4320) > at > > org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:385) > at > > org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:355) > at > > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1.call(GridDhtPartitionsExchangeFuture.java:4320) > at > > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1.call(GridDhtPartitionsExchangeFuture.java:4316) > at > > org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6816) > at > > org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967) > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > ``` > > and > > ``` > class org.apache.ignite.internal.cluster.ClusterTopologyCheckedException: > Failed to send message (node left topology): TcpDiscoveryNode > [id=c6cd8563-ca40-4563-8dc0-4626c0c8111e, > addrs=[100.74.26.173, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, > someip:47500], > discPort=47500, order=12, intOrder=12, lastExchangeTime=1581324395969, > loc=false, ver=2.7.0#20181201-sha1:256ae401, isClient=false] > at > > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3270) > at > > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2987) > at > > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2870) > at > > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713) > at > > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672) > at > > org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1656) > at > > org.apache.ignite.internal.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1766) > at > > org.apache.ignite.internal.processors.cache.GridCacheIoManager.sendOrderedMessage(GridCacheIoManager.java:1231) > at > > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.handleSupplyMessage(GridDhtPartitionDemander.java:845) > at > > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleSupplyMessage(GridDhtPreloader.java:387) > at > > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:418) > at > > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:408) > at > > org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056) > at > > org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581) > at > > org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:101) > at > > org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1613) > at > > org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569) > at > > org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:127) > at > > org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2768) > at > > org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1529) > at > > org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:127) > at > > org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1498) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > [12:36:51] Ignite node stopped OK [uptime=1 day, 18:50:16.868] > ``` > > Test 2 > > I started the test again, > > Scenario 1, I removed node 1 and added node 31, seems ok > > Scenario 2, I added node 1 after *removing all data files in node 1*, all > seems to be fine > > Scenario 3, I try to remove 31st node, 2 nodes go down and I encountered a > new error: > > ``` > Locked synchronizers: > java.util.concurrent.ThreadPoolExecutor$Worker@4819bf4 > Thread [name="checkpoint-runner-#50", id=74, state=WAITING, blockCnt=28, > waitCnt=5449360] > at sun.misc.Unsafe.park(Native Method) > at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:304) > at > o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178) > at > > o.a.i.i.util.future.GridFutureAdapter.getUninterruptibly(GridFutureAdapter.java:146) > at > > o.a.i.i.processors.cache.persistence.file.AsyncFileIO.write(AsyncFileIO.java:146) > at > > o.a.i.i.processors.cache.persistence.file.AbstractFileIO$5.run(AbstractFileIO.java:118) > at > > o.a.i.i.processors.cache.persistence.file.AbstractFileIO.fully(AbstractFileIO.java:54) > at > > o.a.i.i.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:116) > at > > o.a.i.i.processors.cache.persistence.file.FilePageStore.write(FilePageStore.java:565) > at > > o.a.i.i.processors.cache.persistence.file.FilePageStoreManager.writeInternal(FilePageStoreManager.java:483) > at > > o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$WriteCheckpointPages.writePages(GridCacheDatabaseSharedManager.java:4207) > at > > o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$WriteCheckpointPages.run(GridCacheDatabaseSharedManager.java:4101) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > ``` > > And a whole lot more messages about locked synchronizers, followed by: > > ``` > class org.apache.ignite.IgniteCheckedException: Failed to cache rebalanced > entry (will stop rebalancing) [local=TcpDiscoveryNode > [id=be1978ef-b5c7-4118-b17a-36a65ef1fff6, addrs=[100.74.26.131, 127.0.0.1], > sockAddrs=[someip:47500, /127.0.0.1:47500], discPort=47500, order=41, > intOrder=36, lastExchangeTime=1581563518767, loc=true, > ver=2.7.0#20181201-sha1:256ae401, isClient=false], > node=86b79c0e-e3df-45c9-9a6b-ab5607a41253, key=KeyCacheObjectImpl > [part=372, > val=user3974044929057811550, hasValBytes=true], part=372] > at > > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.preloadEntry(GridDhtPartitionDemander.java:951) > at > > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.handleSupplyMessage(GridDhtPartitionDemander.java:772) > at > > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleSupplyMessage(GridDhtPreloader.java:387) > at > > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:418) > at > > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:408) > at > > org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056) > at > > org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581) > at > > org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:101) > at > > org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1613) > at > > org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569) > at > > org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:127) > at > > org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2768) > at > > org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1529) > at > > org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:127) > at > > org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1498) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > Caused by: class org.apache.ignite.internal.NodeStoppingException: > Operation > has been cancelled (node is stopping). > at > > org.apache.ignite.internal.processors.query.GridQueryProcessor.store(GridQueryProcessor.java:1861) > at > > org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.store(GridCacheQueryManager.java:404) > at > > org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishUpdate(IgniteCacheOffheapManagerImpl.java:2633) > at > > org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1646) > at > > org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1621) > at > > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1935) > at > > org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:428) > at > > org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:4248) > at > > org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:3391) > at > > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.preloadEntry(GridDhtPartitionDemander.java:902) > ... 17 more > ``` > > Configurations > > 30 servers, Ignite 2.7, no client connected, attached is the XML config > file: ignite-sql.xml > <http://apache-ignite-users.70518.x6.nabble.com/file/t2779/ignite-sql.xml> > > I didn't define fault rebalance mode, so it should be ASYNC and partition > loss policy should be IGNORE > > My question is: > > In general, what are the steps to follow to scale up/down the cluster or > remove nodes. Is kill -9 <pid> the right way to kill a node? Do you just > run > `control.sh --baseline add <nodeconsistentid>` to add new nodes not in the > original baseline topology? How about re-adding new nodes that were > previously killed? Do we need to remove any files? How long does it take > for > the nodes to synchronise? How do we know when a rebalance is completed? > > Sorry for my many questions, I am new to Ignite and any help is > appreciated! > > > > > -- > Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >