Hello!

Better issue TERM (kill without -9) so that node can at least gracefully
shutdown its file descriptors.

Otherwise, the first error looks like some one-off bug, and "Operation has
been cancelled (node is stopping)." are self-descriptive and normal.

Unfortunately we would need to take a look at all logs from all nodes to
understand why your grid was stalling.

Regards,
-- 
Ilya Kasnacheev


чт, 13 февр. 2020 г. в 10:10, wentat <wentat.wo...@rakuten.com>:

> Hi all, I am evaluating Ignite 2.7 failover scenarios. We are testing 3
> different scenarios:
> 1. Swap rebalance - kill a node, then add a new node in
> 2. Scale up - add a new node in
> 3. Scale down - kill a node
>
> I have a cluster with 30 nodes, with a huge dataset of 450 million items.
>
> Test 1
>
> In scenario 1:
> I started node 31 and killed node 1. Node 31 was not in the base topology
> but they share the same XML file so the cluster detected it. I then used
> control.sh --baseline remove node1 which is offline and added node 31 which
> is outside of the original topology. This step works fine
>
> In scenario 2:
> I started node 1 and added back to the cluster via the steps above, then
> suddenly 3 other nodes in the cluster crashed. The reasoning could be
> because of me not removing the old work directory in node 1. Anyways the
> results I got from the crashed servers are:
>
> ```
> java.lang.NullPointerException
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.cacheGroupAddedOnExchange(GridDhtPartitionsExchangeFuture.java:492)
>     at
>
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$14.applyx(CacheAffinitySharedManager.java:1598)
>     at
>
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$14.applyx(CacheAffinitySharedManager.java:1590)
>     at
>
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.forAllRegisteredCacheGroups(CacheAffinitySharedManager.java:1206)
>     at
>
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onReassignmentEnforced(CacheAffinitySharedManager.java:1590)
>     at
>
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onServerLeftWithExchangeMergeProtocol(CacheAffinitySharedManager.java:1546)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.finishExchangeOnCoordinator(GridDhtPartitionsExchangeFuture.java:3239)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onAllReceived(GridDhtPartitionsExchangeFuture.java:3191)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onBecomeCoordinator(GridDhtPartitionsExchangeFuture.java:4559)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$3500(GridDhtPartitionsExchangeFuture.java:139)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1$1.apply(GridDhtPartitionsExchangeFuture.java:4331)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1$1.apply(GridDhtPartitionsExchangeFuture.java:4320)
>     at
>
> org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:385)
>     at
>
> org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:355)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1.call(GridDhtPartitionsExchangeFuture.java:4320)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1.call(GridDhtPartitionsExchangeFuture.java:4316)
>     at
>
> org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6816)
>     at
>
> org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
>     at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>     at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:748)
> ```
>
> and
>
> ```
> class org.apache.ignite.internal.cluster.ClusterTopologyCheckedException:
> Failed to send message (node left topology): TcpDiscoveryNode
> [id=c6cd8563-ca40-4563-8dc0-4626c0c8111e,
> addrs=[100.74.26.173, 127.0.0.1], sockAddrs=[/127.0.0.1:47500,
> someip:47500],
> discPort=47500, order=12, intOrder=12, lastExchangeTime=1581324395969,
> loc=false, ver=2.7.0#20181201-sha1:256ae401, isClient=false]
>     at
>
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3270)
>     at
>
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2987)
>     at
>
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2870)
>     at
>
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
>     at
>
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
>     at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1656)
>     at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1766)
>     at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.sendOrderedMessage(GridCacheIoManager.java:1231)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.handleSupplyMessage(GridDhtPartitionDemander.java:845)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleSupplyMessage(GridDhtPreloader.java:387)
>     at
>
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:418)
>     at
>
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:408)
>     at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
>     at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
>     at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:101)
>     at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1613)
>     at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
>     at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:127)
>     at
>
> org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2768)
>     at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1529)
>     at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:127)
>     at
>
> org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1498)
>     at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:748)
> [12:36:51] Ignite node stopped OK [uptime=1 day, 18:50:16.868]
> ```
>
> Test 2
>
> I started the test again,
>
> Scenario 1, I removed node 1 and added node 31, seems ok
>
> Scenario 2, I added node 1 after *removing all data files in node 1*, all
> seems to be fine
>
> Scenario 3, I try to remove 31st node, 2 nodes go down and I encountered a
> new error:
>
> ```
> Locked synchronizers:
>         java.util.concurrent.ThreadPoolExecutor$Worker@4819bf4
> Thread [name="checkpoint-runner-#50", id=74, state=WAITING, blockCnt=28,
> waitCnt=5449360]
>         at sun.misc.Unsafe.park(Native Method)
>         at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
>         at
> o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
>         at
>
> o.a.i.i.util.future.GridFutureAdapter.getUninterruptibly(GridFutureAdapter.java:146)
>         at
>
> o.a.i.i.processors.cache.persistence.file.AsyncFileIO.write(AsyncFileIO.java:146)
>         at
>
> o.a.i.i.processors.cache.persistence.file.AbstractFileIO$5.run(AbstractFileIO.java:118)
>         at
>
> o.a.i.i.processors.cache.persistence.file.AbstractFileIO.fully(AbstractFileIO.java:54)
>         at
>
> o.a.i.i.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:116)
>         at
>
> o.a.i.i.processors.cache.persistence.file.FilePageStore.write(FilePageStore.java:565)
>         at
>
> o.a.i.i.processors.cache.persistence.file.FilePageStoreManager.writeInternal(FilePageStoreManager.java:483)
>         at
>
> o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$WriteCheckpointPages.writePages(GridCacheDatabaseSharedManager.java:4207)
>         at
>
> o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$WriteCheckpointPages.run(GridCacheDatabaseSharedManager.java:4101)
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:748)
> ```
>
> And a whole lot more messages about locked synchronizers, followed by:
>
> ```
> class org.apache.ignite.IgniteCheckedException: Failed to cache rebalanced
> entry (will stop rebalancing) [local=TcpDiscoveryNode
> [id=be1978ef-b5c7-4118-b17a-36a65ef1fff6, addrs=[100.74.26.131, 127.0.0.1],
> sockAddrs=[someip:47500, /127.0.0.1:47500], discPort=47500, order=41,
> intOrder=36, lastExchangeTime=1581563518767, loc=true,
> ver=2.7.0#20181201-sha1:256ae401, isClient=false],
> node=86b79c0e-e3df-45c9-9a6b-ab5607a41253, key=KeyCacheObjectImpl
> [part=372,
> val=user3974044929057811550, hasValBytes=true], part=372]
>         at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.preloadEntry(GridDhtPartitionDemander.java:951)
>         at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.handleSupplyMessage(GridDhtPartitionDemander.java:772)
>         at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleSupplyMessage(GridDhtPreloader.java:387)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:418)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:408)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:101)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1613)
>         at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
>         at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:127)
>         at
>
> org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2768)
>         at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1529)
>         at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:127)
>         at
>
> org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1498)
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: class org.apache.ignite.internal.NodeStoppingException:
> Operation
> has been cancelled (node is stopping).
>         at
>
> org.apache.ignite.internal.processors.query.GridQueryProcessor.store(GridQueryProcessor.java:1861)
>         at
>
> org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.store(GridCacheQueryManager.java:404)
>         at
>
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishUpdate(IgniteCacheOffheapManagerImpl.java:2633)
>         at
>
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1646)
>         at
>
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1621)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1935)
>         at
>
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:428)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:4248)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:3391)
>         at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.preloadEntry(GridDhtPartitionDemander.java:902)
>         ... 17 more
> ```
>
> Configurations
>
> 30 servers, Ignite 2.7, no client connected, attached is the XML config
> file:  ignite-sql.xml
> <http://apache-ignite-users.70518.x6.nabble.com/file/t2779/ignite-sql.xml>
>
> I didn't define fault rebalance mode, so it should be ASYNC and partition
> loss policy should be IGNORE
>
> My question is:
>
> In general, what are the steps to follow to scale up/down the cluster or
> remove nodes. Is kill -9 <pid> the right way to kill a node? Do you just
> run
> `control.sh --baseline add <nodeconsistentid>` to add new nodes not in the
> original baseline topology? How about re-adding new nodes that were
> previously killed? Do we need to remove any files? How long does it take
> for
> the nodes to synchronise? How do we know when a rebalance is completed?
>
> Sorry for my many questions, I am new to Ignite and any help is
> appreciated!
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Reply via email to