Ok, we identified the root cause. It was not specifically related to the
Ignite, but rather the security settings (EC2 security group), i.e. we only
had inbound port 47100 open to the EC2 instance. But as you can see from the
original message, the error is about the nodes running on ports 47103 and
47104, actually all others except 47100.

There is `TcpCommunicationSpi`
https://apacheignite.readme.io/v1.9/docs/network-config#section-configuration,
which defines `setLocalPort` (default to 47100) and `setLocalPortRange`
which is 100. And my assumption is that because we are running multiple
services on the same machine, so every Ignite client will get its own port
starting from 47100, and up to 47200 (or 47199?) (see `setLocalPortRange`
above). So as we are running multiple of the them, only one will get 47100
port, others will get 47101, and 47102 (as we have max of 3 running on the
same machine currently), and so on.

And they connect to the server node, which is listening port 47500 (which is
opened in the security group to connect).

So during the cluster start up everything works fine.

But then because ports 47101-... were not open on our app side, the server
could not reach back other clients apart from the one running on port 47100.

This is my theory (but at least opening those ports fixed the problem).

Of course, there is still an open question, is why the client node starts to
fail only when there is a load, I would expect there is a periodic heart
beat, so the server should not reach the client node almost immediately
after the cluster started (I mean the client nodes listening on ports
14701-...).

But we only start see the error after couple of hours when the system is in
use. 

Could you, please, comment on this?

Thank you.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Reply via email to