Hello!

Most likely it’s the same connectivity issue that lead to the lost partitions.

Do you have the full cmd output and some server logs?



> On 15 Apr 2021, at 21:15, facundo.maldonado <maldonadofacu...@gmail.com> 
> wrote:
> 
> I'm testing a 32 nodes cluster with a partitioned cache with one backup.
> If 2 of them crashed (not if, when) I have the lost partitions problem.
> 
> Now I ssh to one of the nodes and execute *control.sh --baseline.*
> From every node other than the one marked as "coordinator" (?) I get this
> output:
> 
> --------------------------------------------------------------------------------
> Failed to execute baseline command='collect'
> Failed to communicate with grid nodes (maximum count of retries reached).
> Connection to cluster failed. Failed to communicate with grid nodes (maximum
> count of retries reached).
> 
> Ok, I went to every node and do the same until I found the 'coordinator'.
> Once I made the failing nodes get online again I execute:
> *control.sh --cache reset_lost_partitions mycache*
> 
> To my surprise, I'm getting 
> --------------------------------------------------------------------------------
> Connection to cluster failed. Failed to communicate with grid nodes (maximum
> count of retries reached).
> 
> So, started again looking for the nodes where that command actually works.
> 
> I'm sure I'm doing something wrong. Could someone help me?
> 
> 
> 
> 
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Reply via email to