What version are you in? This seems like a typical case were there was a
problem with streaming (hanging, etc), do you have access to the logs?
Maybe look for streaming errors? Typically streaming errors are related to
timeouts, so you should review your cassandra
streaming_socket_timeout_in_ms and kernel tcp_keepalive settings.

If you're on 2.2+ you can resume a failed bootstrap with nodetool bootstrap
resume. There were also some streaming hanging problems fixed recently, so
I'd advise you to upgrade to the latest version of your particular series
for a more robust version.

Is there any reason why you didn't use the replace procedure
(-Dreplace_address) to replace the node with the same tokens? This would be
a bit faster than remove + bootstrap procedure.

2016-08-15 15:37 GMT-03:00 Jérôme Mainaud <jer...@mainaud.com>:

> Hello,
>
> A client of mime have problems when adding a node in the cluster.
> After 4 days, the node is still in joining mode, it doesn't have the same
> level of load than the other and there seems to be no streaming from and to
> the new node.
>
> This node has a history.
>
>    1. At the begin, it was in a seed in the cluster.
>    2. Ops detected that client had problems with it.
>    3. They tried to reset it but failed. In their process they launched
>    several repair and rebuild process on the node.
>    4. Then they asked me to help them.
>    5. We stopped the node,
>    6. removed it from the list of seeds (more precisely it was replaced
>    by another node),
>    7. removed it from the cluster (I choose not to use decommission since
>    node data was compromised)
>    8. deleted all files from data, commitlog and savedcache directories.
>    9. after the leaving process ended, it was started as a fresh new node
>    and began autobootstrap.
>
>
> As I don’t have direct access to the cluster I don't have a lot of
> information, but I will have tomorrow (logs and results of some commands).
> And I can ask for people any required information.
>
> Does someone have any idea of what could have happened and what I should
> investigate first ?
> What would you do to unlock the situation ?
>
> Context: The cluster consists of two DC, each with 15 nodes. Average load
> is around 3 TB per node. The joining node froze a little after 2 TB.
>
> Thank you for your help.
> Cheers,
>
>
> --
> Jérôme Mainaud
> jer...@mainaud.com
>

Reply via email to