>>>At t8 The request would not start as the CL level of nodes is not
available, the write would not be written to node X. The client would get an
UnavailableException. In response it should connect to a new coordinator and
try again.
[Naren] There may (and most likely there will  be) be a window when CL will
be satisfied while write will still fail because the node is actually down.
There are lot of possible scenarios here. I believe Milind is talking about
some extreme but likely cases.



On Sat, Apr 23, 2011 at 7:31 PM, aaron morton <aa...@thelastpickle.com>wrote:

> Have not read the whole thing just the time line. Couple of issues...
>
> At t8 The request would not start as the CL level of nodes is not
> available, the write would not be written to node X. The client would get an
> UnavailableException. In response it should connect to a new coordinator and
> try again.
>
> At t12 if RR is enabled for the request the read is sent to all UP
> endpoints for the key. Once CL requests have returned (including the data /
> non digest request) the responses are repaired and a synchronous (to the
> read request) RR round is initiated.
>
> Once all the requests have responded they are compared again an async RR
> process is kicked off. So it seems that in a worse case scenario two round
> of RR are possible, one to make sure the correct data is returned for the
> request. And another to make sure that all UP replicas agree, as it may not
> be the case that all UP replicas were involved in completing the request.
>
> So as written, at t8 the write would have failed and not be stored on any
> nodes. So the write at t7 would not be lost.
>
> I think the crux of this example is the failure mode at t8, I'm assuming
> Alice is connected to node x:
>
> 1) if X is disconnected before the write starts, it will not start any
> write that requires Quorum CL. Write fails with Unavailable error.
> 2) If X disconnects from the network *after* sending the write messages,
> and all messages are successfully  actioned (including a local write) the
> request will fail with a TimedOutException as < CL nodes will respond.
> 3) If X disconnects from the cluster after sending the messages, and the
> messages it  sends are lost but the local write succeeds. The request will
> fail with a TimedOutException as < CL nodes will respond.
>
> In all these cases the request is considered to have failed. The client
> should connect to another node and try again. In the case of timeout the
> operation was not completed to the CL level you asked for. In the case of
> unavailable the operation was not started.
>
> It can look like the RR conflict resolution is a little naive here, but
> it's less simple when you consider another scenario. The write at t8 failed
> at Quorum, and in your deployment the client cannot connect to another node
> in the cluster, so your code drops the CL down to ONE and gets the write
> done. You are happy that any nodes in Alice's partition see her write, and
> that those in Bens partition see he's. When things get back to normal you
> want the most recent write to what clients consistently see, not the most
> popular value. The Consistency section here
> http://wiki.apache.org/cassandra/ArchitectureOverview says the same, it's
> the most recent value.
>
> I tend to think of Consistency as all clients getting the same response to
> the same query.
>
> Not sure if I've made things clearer, feel free to poke holes in my logic
> :)
>
> Hope that helps.
> Aaron
>
>
> On 23 Apr 2011, at 09:02, Edward Capriolo wrote:
>
> On Fri, Apr 22, 2011 at 4:31 PM, Milind Parikh <milindpar...@gmail.com>
> wrote:
>
> Is there a chance of getting manual conflict resolution in Cassandra?
>
> Please see attachment for why this is important in some cases.
>
>
> Regards
>
> Milind
>
>
>
>
> I think about this often. LDAP servers like SunOne have pluggable
> conflict resolution. I could see the read-repair algorithm being
> pluggable.
>
>
>


-- 
Narendra Sharma
Solution Architect
*http://www.persistentsys.com*
*http://narendrasharma.blogspot.com/*

Reply via email to