Re: How does Cassandra handle failure during synchronous writes

Ritesh Tijoriwala Wed, 23 Feb 2011 18:15:05 -0800

>In this case - N1 will be identified as a discrepancy and the change will
be discarded via read repair


Brilliant. This does sound correct :)

One more related question - how are read repairs protected against a quorum
write that is in-progress? For e.g. say nodes A, B, C and Client C1 intends
to write K = X for Quorum ( = 2 nodes) say on A & B and mean while just
after it finishes writing on A and before writing on B, client C2 reads with
Quorum. Then does that trigger a read repair and race with C1?

Also, when a client reads with Quorum, does it read all nodes (A, B, C in
this case) or any Quorum and if it cannot figure out a consistent value then
it reads more? What is the process here? For e.g. in the above example, if
C2 were to read A and C, then it will bet X and W which will not achieve a
quorum so then would that trigger a read on C? And does this continue for
some number of times until a quorum is achieved or timeout occurs? For e.g.
under high concurrency for a specific value, values might be changing fast.

Thanks,
Ritesh

On Wed, Feb 23, 2011 at 6:05 PM, Anthony John <chirayit...@gmail.com> wrote:

> >Remember the simple rule. Column with highest timestamp is the one that
> will be considered correct EVENTUALLY. So consider following case:
>
> I am sorry, that will return inconsistent results even a Q. Time stamp have
> nothing to do with this. It is just an application provided artifact and
> could be anything.
>
> >c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data
> that was written to node1 will be returned.
>
> In this case - N1 will be identified as a discrepancy and the change will
> be discarded via read repair
>
> On Wed, Feb 23, 2011 at 6:47 PM, Narendra Sharma <
> narendra.sha...@gmail.com> wrote:
>
>> Remember the simple rule. Column with highest timestamp is the one that
>> will be considered correct EVENTUALLY. So consider following case:
>>
>> Cluster size = 3 (say node1, node2 and node3), RF = 3, Read/Write CL =
>> QUORUM
>> a. QUORUM in this case requires 2 nodes. Write failed with successful
>> write to only 1 node say node1.
>> b. Read with CL = QUORUM. If read hits node2 and node3, old data will be
>> returned with read repair triggered in background. On next read you will get
>> the data that was written to node1.
>> c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data
>> that was written to node1 will be returned.
>>
>> HTH!
>>
>> Thanks,
>> Naren
>>
>>
>>
>> On Wed, Feb 23, 2011 at 3:36 PM, Ritesh Tijoriwala <
>> tijoriwala.rit...@gmail.com> wrote:
>>
>>> Hi Anthony,
>>> I am not talking about the case of CL ANY. I am talking about the case
>>> where your consistency level is  R + W > N and you want to write to W nodes
>>> but only succeed in writing to X ( where X < W) nodes and hence fail the
>>> write to the client.
>>>
>>> thanks,
>>> Ritesh
>>>
>>> On Wed, Feb 23, 2011 at 2:48 PM, Anthony John <chirayit...@gmail.com>wrote:
>>>
>>>> Ritesh,
>>>>
>>>> At CL ANY - if all endpoints are down - a HH is written. And it is a
>>>> successful write - not a failed write.
>>>>
>>>> Now that does not guarantee a READ of the value just written - but that
>>>> is a risk that you take when you use the ANY CL!
>>>>
>>>> HTH,
>>>>
>>>> -JA
>>>>
>>>>
>>>> On Wed, Feb 23, 2011 at 4:40 PM, Ritesh Tijoriwala <
>>>> tijoriwala.rit...@gmail.com> wrote:
>>>>
>>>>> hi Anthony,
>>>>> While you stated the facts right, I don't see how it relates to the
>>>>> question I ask. Can you elaborate specifically what happens in the case I
>>>>> mentioned above to Dave?
>>>>>
>>>>> thanks,
>>>>> Ritesh
>>>>>
>>>>>
>>>>> On Wed, Feb 23, 2011 at 1:57 PM, Anthony John 
>>>>> <chirayit...@gmail.com>wrote:
>>>>>
>>>>>> Seems to me that the explanations are getting incredibly complicated -
>>>>>> while I submit the real issue is not!
>>>>>>
>>>>>> Salient points here:-
>>>>>> 1. To be guaranteed data consistency - the writes and reads have to be
>>>>>> at Quorum CL or more
>>>>>> 2. Any W/R at lesser CL means that the application has to handle the
>>>>>> inconsistency, or has to be tolerant of it
>>>>>> 3. Writing at "ANY" CL - a special case - means that writes will
>>>>>> always go through (as long as any node is up), even if the destination 
>>>>>> nodes
>>>>>> are not up. This is done via hinted handoff. But this can result in
>>>>>> inconsistent reads, and yes that is a problem but refer to pt-2 above
>>>>>> 4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to
>>>>>> handle that case where a particular node is down and the write needs to 
>>>>>> be
>>>>>> replicated to it. But this will not cause inconsistent R as the hinted
>>>>>> handoff (in this case) only applies after Quorum is met - so a Quorum R 
>>>>>> is
>>>>>> not dependent on the down node being up, and having got the hint.
>>>>>>
>>>>>> Hope I state this appropriately!
>>>>>>
>>>>>> HTH,
>>>>>>
>>>>>> -JA
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijoriwala <
>>>>>> tijoriwala.rit...@gmail.com> wrote:
>>>>>>
>>>>>>> > Read repair will probably occur at that point (depending on your
>>>>>>> config), which would cause the newest value to propagate to more 
>>>>>>> replicas.
>>>>>>>
>>>>>>> Is the newest value the "quorum" value which means it is the old
>>>>>>> value that will be written back to the nodes having "newer non-quorum" 
>>>>>>> value
>>>>>>> or the newest value is the real new value? :) If later, than this seems 
>>>>>>> kind
>>>>>>> of odd to me and how it will be useful to any application. A bug?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ritesh
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 23, 2011 at 12:43 PM, Dave Revell <d...@meebo-inc.com>wrote:
>>>>>>>
>>>>>>>> Ritesh,
>>>>>>>>
>>>>>>>> You have seen the problem. Clients may read the newly written value
>>>>>>>> even though the client performing the write saw it as a failure. When 
>>>>>>>> the
>>>>>>>> client reads, it will use the correct number of replicas for the 
>>>>>>>> chosen CL,
>>>>>>>> then return the newest value seen at any replica. This "newest value" 
>>>>>>>> could
>>>>>>>> be the result of a failed write.
>>>>>>>>
>>>>>>>> Read repair will probably occur at that point (depending on your
>>>>>>>> config), which would cause the newest value to propagate to more 
>>>>>>>> replicas.
>>>>>>>>
>>>>>>>> R+W>N guarantees serial order of operations: any read at CL=R that
>>>>>>>> occurs after a write at CL=W will observe the write. I don't think this
>>>>>>>> property is relevant to your current question, though.
>>>>>>>>
>>>>>>>> Cassandra has no mechanism to "roll back" the partial write, other
>>>>>>>> than to simply write again. This may also fail.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Dave
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Feb 23, 2011 at 10:12 AM, <tijoriwala.rit...@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> Hi Dave,
>>>>>>>>> Thanks for your input. In the steps you mention, what happens when
>>>>>>>>> client tries to read the value at step 6? Is it possible that the 
>>>>>>>>> client may
>>>>>>>>> see the new value? My understanding was if R + W > N, then client 
>>>>>>>>> will not
>>>>>>>>> see the new value as Quorum nodes will not agree on the new value. If 
>>>>>>>>> that
>>>>>>>>> is the case, then its alright to return failure to the client. 
>>>>>>>>> However, if
>>>>>>>>> not, then it is difficult to program as after every failure, you as an
>>>>>>>>> client are not sure if failure is a pseudo failure with some side 
>>>>>>>>> effects or
>>>>>>>>> real failure.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ritesh
>>>>>>>>>
>>>>>>>>> <quote author='Dave Revell'>
>>>>>>>>>
>>>>>>>>> Ritesh,
>>>>>>>>>
>>>>>>>>> There is no commit protocol. Writes may be persisted on some
>>>>>>>>> replicas even
>>>>>>>>> though the quorum fails. Here's a sequence of events that shows the
>>>>>>>>> "problem:"
>>>>>>>>>
>>>>>>>>> 1. Some replica R fails, but recently, so its failure has not yet
>>>>>>>>> been
>>>>>>>>> detected
>>>>>>>>> 2. A client writes with consistency > 1
>>>>>>>>> 3. The write goes to all replicas, all replicas except R persist
>>>>>>>>> the write
>>>>>>>>> to disk
>>>>>>>>> 4. Replica R never responds
>>>>>>>>> 5. Failure is returned to the client, but the new value is still in
>>>>>>>>> the
>>>>>>>>> cluster, on all replicas except R.
>>>>>>>>>
>>>>>>>>> Something very similar could happen for CL QUORUM.
>>>>>>>>>
>>>>>>>>> This is a conscious design decision because a commit protocol would
>>>>>>>>> constitute tight coupling between nodes, which goes against the
>>>>>>>>> Cassandra
>>>>>>>>> philosophy. But unfortunately you do have to write your app with
>>>>>>>>> this case
>>>>>>>>> in mind.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Dave
>>>>>>>>>
>>>>>>>>> On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh <
>>>>>>>>> tijoriwala.rit...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> >
>>>>>>>>> > Hi,
>>>>>>>>> > I wanted to get details on how does cassandra do synchronous
>>>>>>>>> writes to W
>>>>>>>>> > replicas (out of N)? Does it do a 2PC? If not, how does it deal
>>>>>>>>> with
>>>>>>>>> > failures of of nodes before it gets to write to W replicas? If
>>>>>>>>> the
>>>>>>>>> > orchestrating node cannot write to W nodes successfully, I guess
>>>>>>>>> it will
>>>>>>>>> > fail the write operation but what happens to the completed writes
>>>>>>>>> on X (W
>>>>>>>>> > >
>>>>>>>>> > X) nodes?
>>>>>>>>> >
>>>>>>>>> > Thanks,
>>>>>>>>> > Ritesh
>>>>>>>>> > --
>>>>>>>>> > View this message in context:
>>>>>>>>> >
>>>>>>>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
>>>>>>>>> > Sent from the cassandra-u...@incubator.apache.org mailing list
>>>>>>>>> archive at
>>>>>>>>> > Nabble.com.
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>> </quote>
>>>>>>>>> Quoted from:
>>>>>>>>>
>>>>>>>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055408.html
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How does Cassandra handle failure during synchronous writes

Reply via email to