RE: Ignite 2.5 nodes do not rejoin the cluster after restart (workson 2.4)
Hi, Yep, that’s a bug. Daemon nodes (like the ones ignitevisorcmd starts) seem to break baseline topology processing. Filed https://issues.apache.org/jira/browse/IGNITE-8774. Stan From: szj Sent: 9 июня 2018 г. 1:27 To: user@ignite.apache.org Subject: Re: Ignite 2.5 nodes do not rejoin the cluster after restart (workson 2.4) No, I definitely started with 2.5. I only took the trouble to try it later with 2.4 to see that this problem did not exist there. 2.4 works fine in the very same scenario. -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: Ignite 2.5 nodes do not rejoin the cluster after restart (works on 2.4)
No, I definitely started with 2.5. I only took the trouble to try it later with 2.4 to see that this problem did not exist there. 2.4 works fine in the very same scenario. -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: Ignite 2.5 nodes do not rejoin the cluster after restart (works on 2.4)
Hi, szj. Could it be that you run 2 different version of Ignite? You have mentioned that you used 2.4. Ignite nodes should be the same version. On Thu, Jun 7, 2018 at 8:36 PM, szj wrote: > Hi > > I'm afraid I wiped Ignite off my servers already as this behaviour was a > blocker to me. I only needed a key value store able to replicate across > several datacenters across the globe (my use case involves very few writes) > and I'm now evaluating another product already. > > I strongly suggest you try to reproduce it with the exact steps I listed in > this thread if you didn't try it already. It's dead simple to me - there > are > 2 nodes running, both in the baseline, you connect ignitevisorcmd.sh then > shut down one node. Nothing else starts nor stops in the meantime, you just > try to start up the shut down node again. In my tests it was 100% clear > that > ignitevisorcmd.sh being connected was the culprit of the 2.5 cluster being > confused. It did not happen in 2.4. > > Good luck :-) > > > > -- > Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >
Re: Ignite 2.5 nodes do not rejoin the cluster after restart (works on 2.4)
Hi I'm afraid I wiped Ignite off my servers already as this behaviour was a blocker to me. I only needed a key value store able to replicate across several datacenters across the globe (my use case involves very few writes) and I'm now evaluating another product already. I strongly suggest you try to reproduce it with the exact steps I listed in this thread if you didn't try it already. It's dead simple to me - there are 2 nodes running, both in the baseline, you connect ignitevisorcmd.sh then shut down one node. Nothing else starts nor stops in the meantime, you just try to start up the shut down node again. In my tests it was 100% clear that ignitevisorcmd.sh being connected was the culprit of the 2.5 cluster being confused. It did not happen in 2.4. Good luck :-) -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: Ignite 2.5 nodes do not rejoin the cluster after restart (works on 2.4)
Hi, What baseline topology does ./control.sh prints? Is it possible, a node that out of baseline has started before baseline node starts? On Thu, Jun 7, 2018 at 9:54 AM, szj wrote: > Well, it definitely does work in 2.4. Please notice that there needs to be > ignitevisorcmd.sh involved to trigger this bug (I didn't try with other > clients though). Here's what is printed by Java on the console: > > [09:28:33] > [09:28:33] To start Console Management & Monitoring run > ignitevisorcmd.{sh|bat} > [09:28:33] > [09:28:33] Ignite node started OK (id=ae8697ad) > [09:28:33] Topology snapshot [ver=33, servers=2, clients=0, CPUs=4, > offheap=2.1GB, heap=2.0GB] > [09:28:33] ^-- Node [id=AE8697AD-6421-4C0C-96FE-FC29ED9B6DCA, > clusterState=ACTIVE] > [09:28:33] ^-- Baseline [id=7, size=2, online=2, offline=0] > [09:28:33] Data Regions Configured: > [09:28:33] ^-- default [initSize=256.0 MiB, maxSize=1.4 GiB, > persistenceEnabled=true] > [09:29:25] Ignite node stopped OK [uptime=00:00:51.837] > [09:29:35]__ > [09:29:35] / _/ ___/ |/ / _/_ __/ __/ > [09:29:35] _/ // (7 7// / / / / _/ > [09:29:35] /___/\___/_/|_/___/ /_/ /___/ > [09:29:35] > [09:29:35] ver. 2.5.0#20180523-sha1:86e110c7 > [09:29:35] 2018 Copyright(C) Apache Software Foundation > [09:29:35] > [09:29:35] Ignite documentation: http://ignite.apache.org > [09:29:35] > [09:29:35] Quiet mode. > [09:29:35] ^-- Logging to file > '/usr/share/apache-ignite/work/log/ignite-d484e6c6.0.log' > [09:29:35] ^-- Logging by 'JavaLogger [quiet=true, config=null]' > [09:29:35] ^-- To see **FULL** console log here add -DIGNITE_QUIET=false > or "-v" to ignite.{sh|bat} > [09:29:35] > [09:29:35] OS: Linux 2.6.32-696.18.7.el6.x86_64 amd64 > [09:29:35] VM information: OpenJDK Runtime Environment 1.8.0_121-b13 Oracle > Corporation OpenJDK 64-Bit Server VM 25.121-b13 > [09:29:35] Configured plugins: > [09:29:35] ^-- None > [09:29:35] > [09:29:35] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler > [tryStop=false, timeout=0]] > [09:29:35] Message queue limit is set to 0 which may lead to potential > OOMEs > when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to > message queues growth on sender and receiver sides. > [09:29:35] Security status [authentication=off, tls/ssl=off] > [09:29:36,435][SEVERE][tcp-disco-msg-worker-#2][TcpDiscoverySpi] > TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node > in order to prevent cluster wide instability. > class org.apache.ignite.IgniteException: Node with BaselineTopology cannot > join mixed cluster running in compatibility mode > at > org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor. > onGridDataReceived(GridClusterStateProcessor.java:714) > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$5. > onExchange(GridDiscoveryManager.java:883) > at > org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi. > onExchange(TcpDiscoverySpi.java:1939) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker. > processNodeAddedMessage(ServerImpl.java:4354) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker. > processMessage(ServerImpl.java:2744) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker. > processMessage(ServerImpl.java:2536) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body( > ServerImpl.java:6775) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body( > ServerImpl.java:2621) > at > org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) > [09:29:36,437][SEVERE][tcp-disco-msg-worker-#2][] Critical system error > detected. Will be handled accordingly to configured handler [hnd=class > o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext > [type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.IgniteException: Node > with > BaselineTopology cannot join mixed cluster running in compatibility mode]] > class org.apache.ignite.IgniteException: Node with BaselineTopology cannot > join mixed cluster running in compatibility mode > at > org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor. > onGridDataReceived(GridClusterStateProcessor.java:714) > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$5. > onExchange(GridDiscoveryManager.java:883) > at > org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi. > onExchange(TcpDiscoverySpi.java:1939) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker. > processNodeAddedMessage(ServerImpl.java:4354) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker. > processMessage(ServerImpl.java:2744) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker. > processMessage(ServerImpl.java:2536) > at >
Re: Ignite 2.5 nodes do not rejoin the cluster after restart (works on 2.4)
Well, it definitely does work in 2.4. Please notice that there needs to be ignitevisorcmd.sh involved to trigger this bug (I didn't try with other clients though). Here's what is printed by Java on the console: [09:28:33] [09:28:33] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat} [09:28:33] [09:28:33] Ignite node started OK (id=ae8697ad) [09:28:33] Topology snapshot [ver=33, servers=2, clients=0, CPUs=4, offheap=2.1GB, heap=2.0GB] [09:28:33] ^-- Node [id=AE8697AD-6421-4C0C-96FE-FC29ED9B6DCA, clusterState=ACTIVE] [09:28:33] ^-- Baseline [id=7, size=2, online=2, offline=0] [09:28:33] Data Regions Configured: [09:28:33] ^-- default [initSize=256.0 MiB, maxSize=1.4 GiB, persistenceEnabled=true] [09:29:25] Ignite node stopped OK [uptime=00:00:51.837] [09:29:35]__ [09:29:35] / _/ ___/ |/ / _/_ __/ __/ [09:29:35] _/ // (7 7// / / / / _/ [09:29:35] /___/\___/_/|_/___/ /_/ /___/ [09:29:35] [09:29:35] ver. 2.5.0#20180523-sha1:86e110c7 [09:29:35] 2018 Copyright(C) Apache Software Foundation [09:29:35] [09:29:35] Ignite documentation: http://ignite.apache.org [09:29:35] [09:29:35] Quiet mode. [09:29:35] ^-- Logging to file '/usr/share/apache-ignite/work/log/ignite-d484e6c6.0.log' [09:29:35] ^-- Logging by 'JavaLogger [quiet=true, config=null]' [09:29:35] ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat} [09:29:35] [09:29:35] OS: Linux 2.6.32-696.18.7.el6.x86_64 amd64 [09:29:35] VM information: OpenJDK Runtime Environment 1.8.0_121-b13 Oracle Corporation OpenJDK 64-Bit Server VM 25.121-b13 [09:29:35] Configured plugins: [09:29:35] ^-- None [09:29:35] [09:29:35] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0]] [09:29:35] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides. [09:29:35] Security status [authentication=off, tls/ssl=off] [09:29:36,435][SEVERE][tcp-disco-msg-worker-#2][TcpDiscoverySpi] TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node in order to prevent cluster wide instability. class org.apache.ignite.IgniteException: Node with BaselineTopology cannot join mixed cluster running in compatibility mode at org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor.onGridDataReceived(GridClusterStateProcessor.java:714) at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$5.onExchange(GridDiscoveryManager.java:883) at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.onExchange(TcpDiscoverySpi.java:1939) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeAddedMessage(ServerImpl.java:4354) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2744) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2536) at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6775) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2621) at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) [09:29:36,437][SEVERE][tcp-disco-msg-worker-#2][] Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.IgniteException: Node with BaselineTopology cannot join mixed cluster running in compatibility mode]] class org.apache.ignite.IgniteException: Node with BaselineTopology cannot join mixed cluster running in compatibility mode at org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor.onGridDataReceived(GridClusterStateProcessor.java:714) at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$5.onExchange(GridDiscoveryManager.java:883) at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.onExchange(TcpDiscoverySpi.java:1939) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeAddedMessage(ServerImpl.java:4354) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2744) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2536) at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6775) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2621) at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) [09:29:36,438][SEVERE][tcp-disco-msg-worker-#2][] JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION,
Re: Ignite 2.5 nodes do not rejoin the cluster after restart (works on 2.4)
It's hard to guess what happened on your side without seeing error logs. Ignite 2.5 passed QA cycles. Share the logs. -- Denis On Tue, Jun 5, 2018 at 4:39 PM, szj wrote: > I wiped Ignite 2.5 and tried 2.4. On a 2-node cluster I could restart each > node back and forth without hindrance. I could even consider using 2.4 but > it lacks the authentication feature and also the rpm is built with all > contents world-writable which makes you wonder about the overall security > of > the solution (of the lack of it really). > > > > -- > Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >
Re: Ignite 2.5 nodes do not rejoin the cluster after restart
That is not possible. The cluster was stripped down to 2 nodes and when ignitevisorcmd.sh is not connected I can stop and start cluster nodes freely. As soon as ignitevisorcmd.sh is connected to the grid on any of the 2 nodes at the time you stop one cluster node, that makes the stopped cluster node fail to start with "Node with BaselineTopology cannot join mixed cluster running in compatibility mode". I would be very surprised if devs could not reproduce it with: 1. Set up a 2-node cluster with the simplest config possible. Persistence may need to be enabled (I had it on) and consistentID hard-coded in the config (that's what I did but probably doesn't matter). 2. Make sure the cluster is active, 2 nodes are ONLINE. 3. Create an SQL table which will create an underlying cache - may also not be needed really but that is what I did. 4. Try stopping/started the cluster nodes (one at a time) with systemctl (or kill the processes manually if you prefer or have an old system with no systemd). This should work. 5. Now start ignitevisorcmd.sh. on either node, connect it and make sure it can see both cluster nodes with "top". 6. Try restarting any of the 2 cluster nodes while ignitevisorcmd.sh is connected (same as in 4.). You should get the lovely error I did. -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: Ignite 2.5 nodes do not rejoin the cluster after restart
HI, Is it possible there are nodes out of baseline started ans node with baseline is able to discover them? On Wed, Jun 6, 2018 at 9:48 AM, szj wrote: > I also tested an upgrade of the PoC 2-node cluster running Ignite 2.4 to > 2.5. > Both nodes shut down, upgraded, started on node1, started on node2, cluster > looking healthy with both nodes ONLINE. Then I shut down one of the nodes > with "kill -k -al" using batch ignitevisorcmd.sh. Trying to start it brings > back the good old > > class org.apache.ignite.IgniteException: Node with BaselineTopology cannot > join mixed cluster running in compatibility mode > at > org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor. > onGridDataReceived(GridClusterStateProcessor.java:714) > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$5. > onExchange(GridDiscoveryManager.java:883) > at > org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi. > onExchange(TcpDiscoverySpi.java:1939) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker. > processNodeAddedMessage(ServerImpl.java:4354) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker. > processMessage(ServerImpl.java:2744) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker. > processMessage(ServerImpl.java:2536) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body( > ServerImpl.java:6775) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body( > ServerImpl.java:2621) > at > org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) > > > Amazingly when I kicked the node out of the baseline, started it (then it > does start), added back to the baseline and killed the Java process and > ignite.sh with the Linux kill command (as mentioned I had to try it on a > system without systemd) the node DID start (!?). > > That made me thing that it has something to do with the ignitevisorcmd.sh. > What I did I then started ignitevisorcmd.sh on node1 and connected it, > killed ignite (by killing the process) on node2 and bang! - it would not > start again with the "mixed cluster running in compatibility mode" garbage. > > So my conclusion is that if you restart a node when ignitevisorcmd.sh is > connected to the mesh on any node (be that the restarted one or any other), > then you will get the "Node with BaselineTopology cannot join mixed cluster > running in compatibility mode" error and your node won't start. My > knowledge > of Ignite is poor but I think it must have something to do with ignitevisor > being a kind of a node too. But in that case would any client node > connected > cause the same problem? I didn't try - didn't get that far. > > > > -- > Sent from: http://apache-ignite-users.70518.x6.nabble.com/ > -- Best regards, Andrey V. Mashenkov
Re: Ignite 2.5 nodes do not rejoin the cluster after restart
I also tested an upgrade of the PoC 2-node cluster running Ignite 2.4 to 2.5. Both nodes shut down, upgraded, started on node1, started on node2, cluster looking healthy with both nodes ONLINE. Then I shut down one of the nodes with "kill -k -al" using batch ignitevisorcmd.sh. Trying to start it brings back the good old class org.apache.ignite.IgniteException: Node with BaselineTopology cannot join mixed cluster running in compatibility mode at org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor.onGridDataReceived(GridClusterStateProcessor.java:714) at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$5.onExchange(GridDiscoveryManager.java:883) at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.onExchange(TcpDiscoverySpi.java:1939) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeAddedMessage(ServerImpl.java:4354) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2744) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2536) at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6775) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2621) at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) Amazingly when I kicked the node out of the baseline, started it (then it does start), added back to the baseline and killed the Java process and ignite.sh with the Linux kill command (as mentioned I had to try it on a system without systemd) the node DID start (!?). That made me thing that it has something to do with the ignitevisorcmd.sh. What I did I then started ignitevisorcmd.sh on node1 and connected it, killed ignite (by killing the process) on node2 and bang! - it would not start again with the "mixed cluster running in compatibility mode" garbage. So my conclusion is that if you restart a node when ignitevisorcmd.sh is connected to the mesh on any node (be that the restarted one or any other), then you will get the "Node with BaselineTopology cannot join mixed cluster running in compatibility mode" error and your node won't start. My knowledge of Ignite is poor but I think it must have something to do with ignitevisor being a kind of a node too. But in that case would any client node connected cause the same problem? I didn't try - didn't get that far. -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: Ignite 2.5 nodes do not rejoin the cluster after restart (works on 2.4)
I wiped Ignite 2.5 and tried 2.4. On a 2-node cluster I could restart each node back and forth without hindrance. I could even consider using 2.4 but it lacks the authentication feature and also the rpm is built with all contents world-writable which makes you wonder about the overall security of the solution (of the lack of it really). -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/