> On Jun 13, 2019, at 6:29 AM, Oleksandr Shulgin <oleksandr.shul...@zalando.de> 
> wrote:
> 
>> On Thu, Jun 13, 2019 at 3:16 PM Jeff Jirsa <jji...@gmail.com> wrote:
> 
>> On Jun 13, 2019, at 2:52 AM, Oleksandr Shulgin 
>> <oleksandr.shul...@zalando.de> wrote:
>> On Wed, Jun 12, 2019 at 4:02 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>>> To avoid violating consistency guarantees, you have to repair the replicas 
>>>> while the lost node is down
>>> 
>>> How do you suggest to trigger it?  Potentially replicas of the primary 
>>> range for the down node are all over the local DC, so I would go with 
>>> triggering a full cluster repair with Cassandra Reaper.  But isn't it going 
>>> to fail because of the down node?  
>> Im not sure there’s an easy and obvious path here - this is something TLP 
>> may want to enhance reaper to help with. 
>> 
>> You have to specify the ranges with -st/-et, and you have to tell it to 
>> ignore the down host with -hosts. With vnodes you’re right that this may be 
>> lots and lots of ranges all over the ring.
>> 
>> There’s a patch proposed (maybe committed in 4.0) that makes this a nonissue 
>> by allowing bootstrap to stream one repaired set and all of the unrepaired 
>> replica data (which is probably very small if you’re running IR regularly), 
>> which accomplished the same thing.
> 
> Ouch, it really hurts to learn this. :(
>>> It is also documented (I believe) that one should repair the node after it 
>>> finishes the "replace address" procedure.  So should one repair before and 
>>> after?
>> You do not need to repair after the bootstrap if you repair before. If the 
>> docs say that, they’re wrong. The joining host gets writes during bootstrap 
>> and consistency levels are altered during bootstrap to account for the 
>> joining host.
> 
> This is what I had in mind (what makes replacement different from actual 
> bootstrap of a new node):

Bootstrapping a new node does not require repairs at all.

Replacing a node only requires repairs to guarantee consistency to avoid 
violating quorum because streaming for bootstrap only streams from one replica

Think this way:

Host 1, 2, 3 in a replica set
You write value A to some key
It lands on hosts 1 and 3. Host 2 was being restarted or something
Host 2 comes back up
Host 3 fails

If you replace 3 with 3’ - 
3’ May stream from host 1 and now you’ve got a quorum if replicas with A
3’ may stream fr host 2, and now you’ve got a quorum if replicas without A. 
This is illegal.

This is just a statistics game - do you have hosts missing writes? If so, are 
hints delivering them when those hosts come back? What’s the cost of violating 
consistency in that second scenario to you? 

If you’re running something where correctness really really really matters, you 
must repair first. If you’re actually running a truly eventual consistency use 
case and reading stale writes is fine, you probably won’t ever notice.  

In any case these docs are weird and wrong - joining nodes get writes in all 
versions of Cassandra for the past few years (at least 2.0+), so the docs 
really need to be fixed.

> http://cassandra.apache.org/doc/latest/operating/topo_changes.html?highlight=replace%20address#replacing-a-dead-node
>  
> Note
> If any of the following cases apply, you MUST run repair to make the replaced 
> node consistent again, since it missed ongoing writes during/prior to 
> bootstrapping. The replacement timeframe refers to the period from when the 
> node initially dies to when a new node completes the replacement process.
> 
> The node is down for longer than max_hint_window_in_ms before being replaced.
> You are replacing using the same IP address as the dead node and replacement 
> takes longer than max_hint_window_in_ms.
> 
> I would imagine that any production size instance would take way longer to 
> replace than the default max hint window (which is 3 hours, AFAIK).  Didn't 
> remember the same IP restriction, but at least this I would also expect to be 
> the most common setup.
> 
> --
> Alex
> 

Reply via email to