[ 
https://issues.apache.org/jira/browse/CASSANDRA-7489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7489:
--------------------------------

    Description: 
We will need a few things in place to get this right, but it should be possible 
to track live what the current health of a single range is across the cluster. 
If we force an owning node to be the coordinator for an update (so if a 
non-smart client sends a mutation to a non-owning node, it just proxies it on 
to an owning node to coordinate the update; this should tend to minimal 
overhead as smart clients become the norm, and smart clients scale up to cope 
with huge clusters), then each owner can maintain the oldest known timestamp it 
has coordinated an update for that was not acknowledged by every owning node it 
propagated it to. The minimum of all of these for a region is the lower bound 
from which we need to either repair, or retain tombstones. With vnode file 
segregation we can mark an entire vnode range as repaired up to the most 
recently determined healthy lower bound.

There are some subtleties with this, but it means tombstones can be cleared 
potentially only minutes after they are generated, instead of days or weeks. It 
also means even repairs can be even more incremental, only operating over 
ranges and time periods we know to be potentially out of sync.

It will most likely need RAMP transactions in place, so that atomic batch 
mutations are not serialized on non-owning nodes. Having owning nodes 
coordinate updates is to ensure robustness in case of a single node failure - 
in this case all ranges owned by the node are considered to have a lower bound 
of -Inf. Without this a single node being down would result in the entire 
cluster being considered out of sync.

We will still need a short grace period for clients to send timestamps, and we 
would have to outright reject any updates that arrived with a timestamp near to 
that window expiring. But that window could safely be just minutes.

  was:
We will need a few things in place to get this right, but it should be possible 
to track live what the current health of a single range is across the cluster. 
If we force an owning node to be the coordinator for an update (so if a 
non-smart client sends a mutation to a non-owning node, it just proxies it on 
to an owning node to coordinate the update; this should tend to minimal 
overhead as smart clients become the norm, and smart clients scale up to cope 
with huge clusters), then each owner can maintain the oldest known timestamp it 
has coordinated an update for that was not acknowledged by every owning node it 
propagated it to. The minimum of all of these for a region is the lower bound 
from which we need to either repair, or retain tombstones.

There are some subtleties with this, but it means tombstones can be cleared 
potentially only minutes after they are generated, instead of days or weeks. It 
also means even repairs can be even more incremental, only operating over 
ranges and time periods we know to be potentially out of sync.

It will most likely need RAMP transactions in place, so that atomic batch 
mutations are not serialized on non-owning nodes. Having owning nodes 
coordinate updates is to ensure robustness in case of a single node failure - 
in this case all ranges owned by the node are considered to have a lower bound 
of -Inf. Without this a single node being down would result in the entire 
cluster being considered out of sync.


> Track lower bound necessary for a repair, live, without actually repairing
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7489
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7489
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Benedict
>              Labels: performance, repair
>
> We will need a few things in place to get this right, but it should be 
> possible to track live what the current health of a single range is across 
> the cluster. If we force an owning node to be the coordinator for an update 
> (so if a non-smart client sends a mutation to a non-owning node, it just 
> proxies it on to an owning node to coordinate the update; this should tend to 
> minimal overhead as smart clients become the norm, and smart clients scale up 
> to cope with huge clusters), then each owner can maintain the oldest known 
> timestamp it has coordinated an update for that was not acknowledged by every 
> owning node it propagated it to. The minimum of all of these for a region is 
> the lower bound from which we need to either repair, or retain tombstones. 
> With vnode file segregation we can mark an entire vnode range as repaired up 
> to the most recently determined healthy lower bound.
> There are some subtleties with this, but it means tombstones can be cleared 
> potentially only minutes after they are generated, instead of days or weeks. 
> It also means even repairs can be even more incremental, only operating over 
> ranges and time periods we know to be potentially out of sync.
> It will most likely need RAMP transactions in place, so that atomic batch 
> mutations are not serialized on non-owning nodes. Having owning nodes 
> coordinate updates is to ensure robustness in case of a single node failure - 
> in this case all ranges owned by the node are considered to have a lower 
> bound of -Inf. Without this a single node being down would result in the 
> entire cluster being considered out of sync.
> We will still need a short grace period for clients to send timestamps, and 
> we would have to outright reject any updates that arrived with a timestamp 
> near to that window expiring. But that window could safely be just minutes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to