Ok, we identified the root cause. It was not specifically related to the Ignite, but rather the security settings (EC2 security group), i.e. we only had inbound port 47100 open to the EC2 instance. But as you can see from the original message, the error is about the nodes running on ports 47103 and 47104, actually all others except 47100.
There is `TcpCommunicationSpi` https://apacheignite.readme.io/v1.9/docs/network-config#section-configuration, which defines `setLocalPort` (default to 47100) and `setLocalPortRange` which is 100. And my assumption is that because we are running multiple services on the same machine, so every Ignite client will get its own port starting from 47100, and up to 47200 (or 47199?) (see `setLocalPortRange` above). So as we are running multiple of the them, only one will get 47100 port, others will get 47101, and 47102 (as we have max of 3 running on the same machine currently), and so on. And they connect to the server node, which is listening port 47500 (which is opened in the security group to connect). So during the cluster start up everything works fine. But then because ports 47101-... were not open on our app side, the server could not reach back other clients apart from the one running on port 47100. This is my theory (but at least opening those ports fixed the problem). Of course, there is still an open question, is why the client node starts to fail only when there is a load, I would expect there is a periodic heart beat, so the server should not reach the client node almost immediately after the cluster started (I mean the client nodes listening on ports 14701-...). But we only start see the error after couple of hours when the system is in use. Could you, please, comment on this? Thank you. -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/