Hi, we are running cluster of 2 Ignite 1.9 servers running on the EC2. The
EC2 instances are r4.large, i.e. 16GB of memory each. We use Amazon S3 based
discovery for both servers and clients.

We have another EC2 instance (r4.large, 16GB) where our app service is
running, and where the Ignite clients live. There are 5 Ignite clients are
running there (as we run the app in the docker container using
`network_mode: host`), and so there are 5 docker instances with the app are
running.

We also set the socket timeout for the `TcpDiscoverySpi` as the recommended
for the EC2 to 30 seconds for both servers and clients.

The problem is that after some period of time we get `Local node failed`
error, and the looks like the cluster becomes unstable, as it reports in a
loop the new topology version increased constantly, i.e. cascading failure.

```
2017-10-08 17:04:14.520  WARN 6 --- [tcp-client-disco-msg-worker-#4%st%]
o.a.i.spi.discovery.tcp.TcpDiscoverySpi  : Local node was dropped from
cluster due to network problems, will try to reconnect with new id after
10000ms (reconnect delay can be changed using
IGNITE_DISCO_FAILED_CLIENT_RECONNECT_DELAY system property)
[newId=85e37c0f-fd44-430f-9247-06f783589523,
prevId=48e71e9f-7548-460b-9320-2155be8a30a4, locNode=TcpDiscoveryNode
[id=48e71e9f-7548-460b-9320-2155be8a30a4, addrs=[0:0:0:0:0:0:0:1%lo,
127.0.0.1, 172.17.0.1, 172.31.29.171],
sockAddrs=[ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:0,
/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,
ip-172-31-29-171.us-west-2.compute.internal/172.31.29.171:0], discPort=0,
order=138, intOrder=0, lastExchangeTime=1507193821071, loc=true,
ver=1.9.0#20170302-sha1:a8169d0a, isClient=true],
nodeInitiatedFail=e5897e87-65e8-4bf8-947e-7b3f244c3458,
msg=TcpCommunicationSpi failed to establish connection to node
[rmtNode=TcpDiscoveryNode [id=48e71e9f-7548-460b-9320-2155be8a30a4,
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.31.29.171],
sockAddrs=[ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:0,
/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,
ip-172-31-29-171.us-west-2.compute.internal/172.31.29.171:0], discPort=0,
order=138, intOrder=74, lastExchangeTime=1507392564555, loc=false,
ver=1.9.0#20170302-sha1:a8169d0a, isClient=true], errs=class
o.a.i.IgniteCheckedException: Failed to connect to node (is node still
alive?). Make sure that each ComputeTask and cache Transaction has a timeout
set in order to prevent parties from waiting forever in case of network
issues [nodeId=48e71e9f-7548-460b-9320-2155be8a30a4,
addrs=[ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:47103,
ip-172-31-29-171.us-west-2.compute.internal/172.31.29.171:47103,
/0:0:0:0:0:0:0:1%lo:47103, /127.0.0.1:47103]], connectErrs=[class
o.a.i.IgniteCheckedException: Failed to connect to address:
ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:47103, class
o.a.i.IgniteCheckedException: Failed to connect to address:
ip-172-31-29-171.us-west-2.compute.internal/172.31.29.171:47103, class
o.a.i.IgniteCheckedException: Failed to connect to address:
/0:0:0:0:0:0:0:1%lo:47103, class o.a.i.IgniteCheckedException: Failed to
connect to address: /127.0.0.1:47103]]]
 
2017-10-08 17:04:24.888  WARN 6 --- [tcp-client-disco-msg-worker-#4%st%]
o.a.i.spi.discovery.tcp.TcpDiscoverySpi  : Client node was reconnected after
it was already considered failed by the server topology (this could happen
after all servers restarted or due to a long network outage between the
client and servers). All continuous queries and remote event listeners
created by this client will be unsubscribed, consider listening to
EVT_CLIENT_NODE_RECONNECTED event to restore them.


2017-10-08 17:04:24.981  INFO 6 --- [disco-event-worker-#23%st%]
o.a.i.i.m.d.GridDiscoveryManager         : Client node reconnected to
topology: TcpDiscoveryNode [id=85e37c0f-fd44-430f-9247-06f783589523,
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.31.29.171],
sockAddrs=[ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:0,
/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,
ip-172-31-29-171.us-west-2.compute.internal/172.31.29.171:0], discPort=0,
order=188, intOrder=0, lastExchangeTime=1507193821071, loc=true,
ver=1.9.0#20170302-sha1:a8169d0a, isClient=true]
2017-10-08 17:04:24.988  INFO 6 --- [disco-event-worker-#23%st%]
o.a.i.i.m.d.GridDiscoveryManager         : Topology snapshot [ver=188,
servers=2, clients=8, CPUs=12, heap=17.0GB]
2017-10-08 17:04:47.264  WARN 6 --- [tcp-client-disco-msg-worker-#4%st%]
o.a.i.spi.discovery.tcp.TcpDiscoverySpi  : Received EVT_NODE_FAILED event
with warning [nodeInitiatedEvt=TcpDiscoveryNode
[id=28db9f51-f3a3-42d2-b241-520de1124d77, addrs=[0:0:0:0:0:0:0:1%lo,
127.0.0.1, 172.17.0.1, 172.31.22.48],
sockAddrs=[ip-172-31-22-48.us-west-2.compute.internal/172.31.22.48:47500,
ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:47500,
/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=1,
intOrder=1, lastExchangeTime=1507482264715, loc=false,
ver=1.9.0#20170302-sha1:a8169d0a, isClient=false], msg=TcpCommunicationSpi
failed to establish connection to node [rmtNode=TcpDiscoveryNode
[id=691db97e-1bb0-49d9-aa8c-a5c6114e4842, addrs=[0:0:0:0:0:0:0:1%lo,
127.0.0.1, 172.17.0.1, 172.31.29.171],
sockAddrs=[ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:0,
/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,
ip-172-31-29-171.us-west-2.compute.internal/172.31.29.171:0], discPort=0,
order=186, intOrder=98, lastExchangeTime=1507466513487, loc=false,
ver=1.9.0#20170302-sha1:a8169d0a, isClient=true], errs=class
o.a.i.IgniteCheckedException: Failed to connect to node (is node still
alive?). Make sure that each ComputeTask and cache Transaction has a timeout
set in order to prevent parties from waiting forever in case of network
issues [nodeId=691db97e-1bb0-49d9-aa8c-a5c6114e4842,
addrs=[ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:47104,
ip-172-31-29-171.us-west-2.compute.internal/172.31.29.171:47104,
/0:0:0:0:0:0:0:1%lo:47104, /127.0.0.1:47104]], connectErrs=[class
o.a.i.IgniteCheckedException: Failed to connect to address:
ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:47104, class
o.a.i.IgniteCheckedException: Failed to connect to address:
ip-172-31-29-171.us-west-2.compute.internal/172.31.29.171:47104, class
o.a.i.IgniteCheckedException: Failed to connect to address:
/0:0:0:0:0:0:0:1%lo:47104, class o.a.i.IgniteCheckedException: Failed to
connect to address: /127.0.0.1:47104]]]
2017-10-08 17:04:47.274  WARN 6 --- [disco-event-worker-#23%st%]
o.a.i.i.m.d.GridDiscoveryManager         : Node FAILED: TcpDiscoveryNode
[id=691db97e-1bb0-49d9-aa8c-a5c6114e4842, addrs=[0:0:0:0:0:0:0:1%lo,
127.0.0.1, 172.17.0.1, 172.31.29.171],
sockAddrs=[ip-172-17-0-1.us-west-2.compute.internal/172.17.0.1:0,
/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,
ip-172-31-29-171.us-west-2.compute.internal/172.31.29.171:0], discPort=0,
order=186, intOrder=98, lastExchangeTime=1507482264827, loc=false,
ver=1.9.0#20170302-sha1:a8169d0a, isClient=true]
2017-10-08 17:04:47.278  INFO 6 --- [disco-event-worker-#23%st%]
o.a.i.i.m.d.GridDiscoveryManager         : Topology snapshot [ver=189,
servers=2, clients=7, CPUs=12, heap=17.0GB]
 ...
```

What could be the cause of this "Local node was dropped from cluster due to
network problems" (and why after it happened the cluster seems unstable) and
what are the strategies to resolve it?

One thing we plan is to create another EC2 instance and split the Ignite
clients now between 2 EC2 instances, but it is good to know the root cause
of the problem anyway, as not necessary this split will help.




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Reply via email to