RE: Ignite 2.5 nodes do not rejoin the cluster after restart (workson 2.4)

2018-06-11 Thread Stanislav Lukyanov
Hi,

Yep, that’s a bug. Daemon nodes (like the ones ignitevisorcmd starts) seem to 
break baseline topology processing.
Filed https://issues.apache.org/jira/browse/IGNITE-8774.

Stan

From: szj
Sent: 9 июня 2018 г. 1:27
To: user@ignite.apache.org
Subject: Re: Ignite 2.5 nodes do not rejoin the cluster after restart (workson 
2.4)

No, I definitely started with 2.5. I only took the trouble to try it later
with 2.4 to see that this problem did not exist there. 2.4 works fine in the
very same scenario.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/



Re: Ignite 2.5 nodes do not rejoin the cluster after restart (works on 2.4)

2018-06-08 Thread szj
No, I definitely started with 2.5. I only took the trouble to try it later
with 2.4 to see that this problem did not exist there. 2.4 works fine in the
very same scenario.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Ignite 2.5 nodes do not rejoin the cluster after restart (works on 2.4)

2018-06-08 Thread Eduard Shangareev
Hi, szj.

Could it be that you run 2 different version of Ignite? You have mentioned
that you used 2.4.

Ignite nodes should be the same version.



On Thu, Jun 7, 2018 at 8:36 PM, szj  wrote:

> Hi
>
> I'm afraid I wiped Ignite off my servers already as this behaviour was a
> blocker to me. I only needed a key value store able to replicate across
> several datacenters across the globe (my use case involves very few writes)
> and I'm now evaluating another product already.
>
> I strongly suggest you try to reproduce it with the exact steps I listed in
> this thread if you didn't try it already. It's dead simple to me - there
> are
> 2 nodes running, both in the baseline, you connect ignitevisorcmd.sh then
> shut down one node. Nothing else starts nor stops in the meantime, you just
> try to start up the shut down node again. In my tests it was 100% clear
> that
> ignitevisorcmd.sh being connected was the culprit of the 2.5 cluster being
> confused. It did not happen in 2.4.
>
> Good luck :-)
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Ignite 2.5 nodes do not rejoin the cluster after restart (works on 2.4)

2018-06-07 Thread szj
Hi

I'm afraid I wiped Ignite off my servers already as this behaviour was a
blocker to me. I only needed a key value store able to replicate across
several datacenters across the globe (my use case involves very few writes)
and I'm now evaluating another product already.

I strongly suggest you try to reproduce it with the exact steps I listed in
this thread if you didn't try it already. It's dead simple to me - there are
2 nodes running, both in the baseline, you connect ignitevisorcmd.sh then
shut down one node. Nothing else starts nor stops in the meantime, you just
try to start up the shut down node again. In my tests it was 100% clear that
ignitevisorcmd.sh being connected was the culprit of the 2.5 cluster being
confused. It did not happen in 2.4.

Good luck :-)



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Ignite 2.5 nodes do not rejoin the cluster after restart (works on 2.4)

2018-06-07 Thread Andrey Mashenkov
Hi,

What baseline topology does ./control.sh prints?
Is it possible, a node that out of baseline has started before baseline
node starts?

On Thu, Jun 7, 2018 at 9:54 AM, szj  wrote:

> Well, it definitely does work in 2.4. Please notice that there needs to be
> ignitevisorcmd.sh involved to trigger this bug (I didn't try with other
> clients though). Here's what is printed by Java on the console:
>
> [09:28:33]
> [09:28:33] To start Console Management & Monitoring run
> ignitevisorcmd.{sh|bat}
> [09:28:33]
> [09:28:33] Ignite node started OK (id=ae8697ad)
> [09:28:33] Topology snapshot [ver=33, servers=2, clients=0, CPUs=4,
> offheap=2.1GB, heap=2.0GB]
> [09:28:33]   ^-- Node [id=AE8697AD-6421-4C0C-96FE-FC29ED9B6DCA,
> clusterState=ACTIVE]
> [09:28:33]   ^-- Baseline [id=7, size=2, online=2, offline=0]
> [09:28:33] Data Regions Configured:
> [09:28:33]   ^-- default [initSize=256.0 MiB, maxSize=1.4 GiB,
> persistenceEnabled=true]
> [09:29:25] Ignite node stopped OK [uptime=00:00:51.837]
> [09:29:35]__  
> [09:29:35]   /  _/ ___/ |/ /  _/_  __/ __/
> [09:29:35]  _/ // (7 7// /  / / / _/
> [09:29:35] /___/\___/_/|_/___/ /_/ /___/
> [09:29:35]
> [09:29:35] ver. 2.5.0#20180523-sha1:86e110c7
> [09:29:35] 2018 Copyright(C) Apache Software Foundation
> [09:29:35]
> [09:29:35] Ignite documentation: http://ignite.apache.org
> [09:29:35]
> [09:29:35] Quiet mode.
> [09:29:35]   ^-- Logging to file
> '/usr/share/apache-ignite/work/log/ignite-d484e6c6.0.log'
> [09:29:35]   ^-- Logging by 'JavaLogger [quiet=true, config=null]'
> [09:29:35]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false
> or "-v" to ignite.{sh|bat}
> [09:29:35]
> [09:29:35] OS: Linux 2.6.32-696.18.7.el6.x86_64 amd64
> [09:29:35] VM information: OpenJDK Runtime Environment 1.8.0_121-b13 Oracle
> Corporation OpenJDK 64-Bit Server VM 25.121-b13
> [09:29:35] Configured plugins:
> [09:29:35]   ^-- None
> [09:29:35]
> [09:29:35] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler
> [tryStop=false, timeout=0]]
> [09:29:35] Message queue limit is set to 0 which may lead to potential
> OOMEs
> when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to
> message queues growth on sender and receiver sides.
> [09:29:35] Security status [authentication=off, tls/ssl=off]
> [09:29:36,435][SEVERE][tcp-disco-msg-worker-#2][TcpDiscoverySpi]
> TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node
> in order to prevent cluster wide instability.
> class org.apache.ignite.IgniteException: Node with BaselineTopology cannot
> join mixed cluster running in compatibility mode
> at
> org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor.
> onGridDataReceived(GridClusterStateProcessor.java:714)
> at
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$5.
> onExchange(GridDiscoveryManager.java:883)
> at
> org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.
> onExchange(TcpDiscoverySpi.java:1939)
> at
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.
> processNodeAddedMessage(ServerImpl.java:4354)
> at
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.
> processMessage(ServerImpl.java:2744)
> at
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.
> processMessage(ServerImpl.java:2536)
> at
> org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(
> ServerImpl.java:6775)
> at
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(
> ServerImpl.java:2621)
> at
> org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> [09:29:36,437][SEVERE][tcp-disco-msg-worker-#2][] Critical system error
> detected. Will be handled accordingly to configured handler [hnd=class
> o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
> [type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.IgniteException: Node
> with
> BaselineTopology cannot join mixed cluster running in compatibility mode]]
> class org.apache.ignite.IgniteException: Node with BaselineTopology cannot
> join mixed cluster running in compatibility mode
> at
> org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor.
> onGridDataReceived(GridClusterStateProcessor.java:714)
> at
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$5.
> onExchange(GridDiscoveryManager.java:883)
> at
> org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.
> onExchange(TcpDiscoverySpi.java:1939)
> at
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.
> processNodeAddedMessage(ServerImpl.java:4354)
> at
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.
> processMessage(ServerImpl.java:2744)
> at
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.
> processMessage(ServerImpl.java:2536)
> at
> 

Re: Ignite 2.5 nodes do not rejoin the cluster after restart (works on 2.4)

2018-06-07 Thread szj
Well, it definitely does work in 2.4. Please notice that there needs to be
ignitevisorcmd.sh involved to trigger this bug (I didn't try with other
clients though). Here's what is printed by Java on the console:

[09:28:33]
[09:28:33] To start Console Management & Monitoring run
ignitevisorcmd.{sh|bat}
[09:28:33]
[09:28:33] Ignite node started OK (id=ae8697ad)
[09:28:33] Topology snapshot [ver=33, servers=2, clients=0, CPUs=4,
offheap=2.1GB, heap=2.0GB]
[09:28:33]   ^-- Node [id=AE8697AD-6421-4C0C-96FE-FC29ED9B6DCA,
clusterState=ACTIVE]
[09:28:33]   ^-- Baseline [id=7, size=2, online=2, offline=0]
[09:28:33] Data Regions Configured:
[09:28:33]   ^-- default [initSize=256.0 MiB, maxSize=1.4 GiB,
persistenceEnabled=true]
[09:29:25] Ignite node stopped OK [uptime=00:00:51.837]
[09:29:35]__  
[09:29:35]   /  _/ ___/ |/ /  _/_  __/ __/
[09:29:35]  _/ // (7 7// /  / / / _/
[09:29:35] /___/\___/_/|_/___/ /_/ /___/
[09:29:35]
[09:29:35] ver. 2.5.0#20180523-sha1:86e110c7
[09:29:35] 2018 Copyright(C) Apache Software Foundation
[09:29:35]
[09:29:35] Ignite documentation: http://ignite.apache.org
[09:29:35]
[09:29:35] Quiet mode.
[09:29:35]   ^-- Logging to file
'/usr/share/apache-ignite/work/log/ignite-d484e6c6.0.log'
[09:29:35]   ^-- Logging by 'JavaLogger [quiet=true, config=null]'
[09:29:35]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false
or "-v" to ignite.{sh|bat}
[09:29:35]
[09:29:35] OS: Linux 2.6.32-696.18.7.el6.x86_64 amd64
[09:29:35] VM information: OpenJDK Runtime Environment 1.8.0_121-b13 Oracle
Corporation OpenJDK 64-Bit Server VM 25.121-b13
[09:29:35] Configured plugins:
[09:29:35]   ^-- None
[09:29:35]
[09:29:35] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler
[tryStop=false, timeout=0]]
[09:29:35] Message queue limit is set to 0 which may lead to potential OOMEs
when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to
message queues growth on sender and receiver sides.
[09:29:35] Security status [authentication=off, tls/ssl=off]
[09:29:36,435][SEVERE][tcp-disco-msg-worker-#2][TcpDiscoverySpi]
TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node
in order to prevent cluster wide instability.
class org.apache.ignite.IgniteException: Node with BaselineTopology cannot
join mixed cluster running in compatibility mode
at
org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor.onGridDataReceived(GridClusterStateProcessor.java:714)
at
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$5.onExchange(GridDiscoveryManager.java:883)
at
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.onExchange(TcpDiscoverySpi.java:1939)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeAddedMessage(ServerImpl.java:4354)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2744)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2536)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6775)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2621)
at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
[09:29:36,437][SEVERE][tcp-disco-msg-worker-#2][] Critical system error
detected. Will be handled accordingly to configured handler [hnd=class
o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.IgniteException: Node with
BaselineTopology cannot join mixed cluster running in compatibility mode]]
class org.apache.ignite.IgniteException: Node with BaselineTopology cannot
join mixed cluster running in compatibility mode
at
org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor.onGridDataReceived(GridClusterStateProcessor.java:714)
at
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$5.onExchange(GridDiscoveryManager.java:883)
at
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.onExchange(TcpDiscoverySpi.java:1939)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeAddedMessage(ServerImpl.java:4354)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2744)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2536)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6775)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2621)
at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
[09:29:36,438][SEVERE][tcp-disco-msg-worker-#2][] JVM will be halted
immediately due to the failure: [failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, 

Re: Ignite 2.5 nodes do not rejoin the cluster after restart (works on 2.4)

2018-06-06 Thread Denis Magda
It's hard to guess what happened on your side without seeing error logs.
Ignite 2.5 passed QA cycles.

Share the logs.

--
Denis

On Tue, Jun 5, 2018 at 4:39 PM, szj  wrote:

> I wiped Ignite 2.5 and tried 2.4. On a 2-node cluster I could restart each
> node back and forth without hindrance. I could even consider using 2.4 but
> it lacks the authentication feature and also the rpm is built with all
> contents world-writable which makes you wonder about the overall security
> of
> the solution (of the lack of it really).
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Ignite 2.5 nodes do not rejoin the cluster after restart

2018-06-06 Thread szj
That is not possible. The cluster was stripped down to 2 nodes and when
ignitevisorcmd.sh is not connected I can stop and start cluster nodes
freely. As soon as ignitevisorcmd.sh is connected to the grid on any of the
2 nodes at the time you stop one cluster node, that makes the stopped
cluster node fail to start with "Node with BaselineTopology cannot join
mixed cluster running in compatibility mode". I would be very surprised if
devs could not reproduce it with:

1. Set up a 2-node cluster with the simplest config possible. Persistence
may need to be enabled (I had it on) and consistentID hard-coded in the
config (that's what I did but probably doesn't matter).
2. Make sure the cluster is active, 2 nodes are ONLINE.
3. Create an SQL table which will create an underlying cache - may also not
be needed really but that is what I did.
4. Try stopping/started the cluster nodes (one at a time) with systemctl (or
kill the processes manually if you prefer or have an old system with no
systemd). This should work.
5. Now start  ignitevisorcmd.sh. on either node, connect it and make sure it
can see both cluster nodes with "top".
6. Try restarting any of the 2 cluster nodes while ignitevisorcmd.sh is
connected (same as in 4.). You should get the lovely error I did.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Ignite 2.5 nodes do not rejoin the cluster after restart

2018-06-06 Thread Andrey Mashenkov
HI,

Is it possible there are nodes out of baseline started ans node with
baseline is able to discover them?

On Wed, Jun 6, 2018 at 9:48 AM, szj  wrote:

> I also tested an upgrade of the PoC 2-node cluster running Ignite 2.4 to
> 2.5.
> Both nodes shut down, upgraded, started on node1, started on node2, cluster
> looking healthy with both nodes ONLINE. Then I shut down one of the nodes
> with "kill -k -al" using batch ignitevisorcmd.sh. Trying to start it brings
> back the good old
>
> class org.apache.ignite.IgniteException: Node with BaselineTopology cannot
> join mixed cluster running in compatibility mode
> at
> org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor.
> onGridDataReceived(GridClusterStateProcessor.java:714)
> at
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$5.
> onExchange(GridDiscoveryManager.java:883)
> at
> org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.
> onExchange(TcpDiscoverySpi.java:1939)
> at
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.
> processNodeAddedMessage(ServerImpl.java:4354)
> at
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.
> processMessage(ServerImpl.java:2744)
> at
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.
> processMessage(ServerImpl.java:2536)
> at
> org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(
> ServerImpl.java:6775)
> at
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(
> ServerImpl.java:2621)
> at
> org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
>
>
> Amazingly when I kicked the node out of the baseline, started it (then it
> does start), added back to the baseline and killed the Java process and
> ignite.sh with the Linux kill command (as mentioned I had to try it on a
> system without systemd) the node DID start (!?).
>
> That made me thing that it has something to do with the ignitevisorcmd.sh.
> What I did I then started ignitevisorcmd.sh on node1 and connected it,
> killed ignite (by killing the process) on node2 and bang! - it would not
> start again with the "mixed cluster running in compatibility mode" garbage.
>
> So my conclusion is that if you restart a node when ignitevisorcmd.sh is
> connected to the mesh on any node (be that the restarted one or any other),
> then you will get the "Node with BaselineTopology cannot join mixed cluster
> running in compatibility mode" error and your node won't start. My
> knowledge
> of Ignite is poor but I think it must have something to do with ignitevisor
> being a kind of a node too. But in that case would any client node
> connected
> cause the same problem? I didn't try - didn't get that far.
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>



-- 
Best regards,
Andrey V. Mashenkov


Re: Ignite 2.5 nodes do not rejoin the cluster after restart

2018-06-06 Thread szj
I also tested an upgrade of the PoC 2-node cluster running Ignite 2.4 to 2.5.
Both nodes shut down, upgraded, started on node1, started on node2, cluster
looking healthy with both nodes ONLINE. Then I shut down one of the nodes
with "kill -k -al" using batch ignitevisorcmd.sh. Trying to start it brings
back the good old

class org.apache.ignite.IgniteException: Node with BaselineTopology cannot
join mixed cluster running in compatibility mode
at
org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor.onGridDataReceived(GridClusterStateProcessor.java:714)
at
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$5.onExchange(GridDiscoveryManager.java:883)
at
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.onExchange(TcpDiscoverySpi.java:1939)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeAddedMessage(ServerImpl.java:4354)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2744)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2536)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6775)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2621)
at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)


Amazingly when I kicked the node out of the baseline, started it (then it
does start), added back to the baseline and killed the Java process and
ignite.sh with the Linux kill command (as mentioned I had to try it on a
system without systemd) the node DID start (!?).

That made me thing that it has something to do with the ignitevisorcmd.sh.
What I did I then started ignitevisorcmd.sh on node1 and connected it,
killed ignite (by killing the process) on node2 and bang! - it would not
start again with the "mixed cluster running in compatibility mode" garbage.

So my conclusion is that if you restart a node when ignitevisorcmd.sh is
connected to the mesh on any node (be that the restarted one or any other),
then you will get the "Node with BaselineTopology cannot join mixed cluster
running in compatibility mode" error and your node won't start. My knowledge
of Ignite is poor but I think it must have something to do with ignitevisor
being a kind of a node too. But in that case would any client node connected
cause the same problem? I didn't try - didn't get that far.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Ignite 2.5 nodes do not rejoin the cluster after restart (works on 2.4)

2018-06-05 Thread szj
I wiped Ignite 2.5 and tried 2.4. On a 2-node cluster I could restart each
node back and forth without hindrance. I could even consider using 2.4 but
it lacks the authentication feature and also the rpm is built with all
contents world-writable which makes you wonder about the overall security of
the solution (of the lack of it really).



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/