Hello,

We've had Cassandra running in a single production data center now for several 
months and have started detailed plans to add data center fault tolerance.

Our requirements do not appear to be solved out-of-the-box with Cassandra. I'd 
like to share a solution we're planning and find others considering similar 
problems.

We require the following:

1. Two data centers
One is primary, the other hot standby to be used when primary fails. Of course 
Cassandra has no such bias, but as will be seen below this becomes important 
when considering app latency.

2. No more than 3 copies of data total
We are storing blob-like objects. Cost per unit of usable storage is closely 
scrutinized vs other solutions. Hence we want to keep replication factor low.
Two copies will be held in the primary DC, 1 in the secondary DC - with the 
corresponding ratio of machines in each DC.

3. Immediate consistency

4. No waiting on remote data center
The application front-end runs in the primary data center and expects that 
operations using a local coordinator node will not suffer a response time 
determined by the WAN. Hence we cannot require a response from the node in the 
secondary data center to achieve quorum.

5. Ability to operate with a single working node per key, if necessary
We wish to temporarily operate with even a single working node per token in 
desperate situations involving data center failures or combinations of node and 
data center failure.


Existing Cassandra solutions offer combinations of the above, but it is not at 
all clear how to achieve all the above without custom work. 
Normal quorum with N=3 can only work with a single down node regardless of 
topology. Furthermore if one node in the primary DC fails, quorum requires 
synchronous operations over the WAN.
NetworkTopologyStrategy is nice, but requiring quorum in the primary DC with 2 
nodes means no tolerance to a single node failure there.
If we're overlooking something I'd love to know.


Hence the following proposal for a new replication strategy we're calling 
SubQuorum.

In short SubQuorum allows administratively marking some nodes as being exempt 
from participating in quorum. As all nodes agree as to exemption status, 
consistency is still guaranteed as quorum is still achieved amongst the 
remaining nodes. We gain tremendous flexibility to deal with node and DC 
failures. Exempt nodes, if up, still receive mutation messages as usual.

For example : If a primary DC node fails we can mark its remote counterpart 
exempt from quorum, hence allowing continued operation without a synchronous 
call over the WAN.

Or another example : If the primary DC fails we mark all primary DC nodes 
exempt and move the entire application to the secondary DC where it runs as 
usual but with just the one copy.



The implementation is trivial and consists of two pieces:

1. Exempt node management. The list of exempt nodes is broadcast out of band. 
In our case we're leveraging puppet and a admin server.

2. We've written an implementation of AbstractReplicationStrategy that returns 
custom QuorumResponseHandler and IWriteResponseHandler. These simply wait for 
quorum amongst non-exempt nodes.
This requires a small change to the AbstractReplicationStrategy interface to 
pass the endpoints to getQuorumResponseHandler and getWriteResponseHandler, but 
otherwise changes are contained in the plugin.


There is more analysis I can share if anyone is interested. But at this point 
I'd like to get feedback.

Thanks,
Wayne Lewis

Reply via email to