Re: How does Cassandra handle failure during synchronous writes

2011-02-24 Thread Jonathan Ellis
This is where things starts getting subtle.

If Cassandra's failure detector knows ahead of time that not enough
writes are available, that is the only time we truly fail a write, and
nothing will be written anywhere.  But if a write starts during the
window where a node is failed but we don't know it yet, then it will
return TimedOutException.

This is commonly called a "failed write" but that is incorrect -- the
write is in progress, but we can't guarantee it's been replicated to
the desired number of replicas.

It's important to note that even in this situation, quorum reads +
writes provide strong consistency.  ("Strong consistency" is defined
as "after an update completes, any subsequent access will return the
updated value.") Quorum eads will be unable to complete as well until
enough machines come back to satisfy the quorum, which is the same
number as needed to finish the write.  So either the original writer
retrying, or the first reader will cause the write to be completed,
after which we're on familiar ground.

Consider the simplest non-trivial quorum, where we are replicating to
nodes X, Y, and Z.  For the case we are interested in, the original
quorum write attempt must time out, so 2 of the 3 replicas (Y and Z)
are temporarily unavailable. The write is applied to one replica (X),
and the client gets a TimedOutException. The write is not failed, it
is not succeeded, it is in progress (and the client should retry,
because it doesn't know for sure that it was applied anywhere at all).

While Y and Z stay down, quorum reads will be rejected.

When they come back up*, a read could achieve a quorum as {X, Y} or
{X, Z} or {Y, Z}.

{Y, Z} is the more interesting case because neither has the new write
yet.  The client will get the old version back, which is fine
according to our contract since the write is still in-progress.  Read
repair will see the new version on X and send it to X and Y.  As soon
as it gets to one of those, the original write is complete, and all
subsequent reads will see the new version.

{X, Y} and {X, Z} are equivalent: one node with the write, and one
without. The read will recognize that X's version needs to be sent to
Z, and the write will be complete.  This read and all subsequent ones
will see the write.  (Z will be replicated to asynchronously via read
repair.)

*If only one comes back up, then you of course only have the {X, Y} or
{X, Z} case.

The important guarantee this gives you is that once one quorum read
sees the new value, all others will too.  You can't see the newest
version, then see an older version on a subsequent write, which is the
characteristic of non-strong consistency (and which you can see in
Cassandra, temporarily, with lower ConsistencyLevels).

On Tue, Feb 22, 2011 at 10:22 PM, tijoriwala.ritesh
 wrote:
>
> Hi,
> I wanted to get details on how does cassandra do synchronous writes to W
> replicas (out of N)? Does it do a 2PC? If not, how does it deal with
> failures of of nodes before it gets to write to W replicas? If the
> orchestrating node cannot write to W nodes successfully, I guess it will
> fail the write operation but what happens to the completed writes on X (W >
> X) nodes?
>
> Thanks,
> Ritesh
> --
> View this message in context: 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> Nabble.com.
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: How does Cassandra handle failure during synchronous writes

2011-02-23 Thread Narendra Sharma
>>c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data
that was written to node1 will be returned.

>>In this case - N1 will be identified as a discrepancy and the change will
be discarded via read repair

[Naren] How will Cassandra know this is a discrepancy?

On Wed, Feb 23, 2011 at 6:05 PM, Anthony John  wrote:

> >Remember the simple rule. Column with highest timestamp is the one that
> will be considered correct EVENTUALLY. So consider following case:
>
> I am sorry, that will return inconsistent results even a Q. Time stamp have
> nothing to do with this. It is just an application provided artifact and
> could be anything.
>
> >c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data
> that was written to node1 will be returned.
>
> In this case - N1 will be identified as a discrepancy and the change will
> be discarded via read repair
>
> On Wed, Feb 23, 2011 at 6:47 PM, Narendra Sharma <
> narendra.sha...@gmail.com> wrote:
>
>> Remember the simple rule. Column with highest timestamp is the one that
>> will be considered correct EVENTUALLY. So consider following case:
>>
>> Cluster size = 3 (say node1, node2 and node3), RF = 3, Read/Write CL =
>> QUORUM
>> a. QUORUM in this case requires 2 nodes. Write failed with successful
>> write to only 1 node say node1.
>> b. Read with CL = QUORUM. If read hits node2 and node3, old data will be
>> returned with read repair triggered in background. On next read you will get
>> the data that was written to node1.
>> c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data
>> that was written to node1 will be returned.
>>
>> HTH!
>>
>> Thanks,
>> Naren
>>
>>
>>
>> On Wed, Feb 23, 2011 at 3:36 PM, Ritesh Tijoriwala <
>> tijoriwala.rit...@gmail.com> wrote:
>>
>>> Hi Anthony,
>>> I am not talking about the case of CL ANY. I am talking about the case
>>> where your consistency level is  R + W > N and you want to write to W nodes
>>> but only succeed in writing to X ( where X < W) nodes and hence fail the
>>> write to the client.
>>>
>>> thanks,
>>> Ritesh
>>>
>>> On Wed, Feb 23, 2011 at 2:48 PM, Anthony John wrote:
>>>
 Ritesh,

 At CL ANY - if all endpoints are down - a HH is written. And it is a
 successful write - not a failed write.

 Now that does not guarantee a READ of the value just written - but that
 is a risk that you take when you use the ANY CL!

 HTH,

 -JA


 On Wed, Feb 23, 2011 at 4:40 PM, Ritesh Tijoriwala <
 tijoriwala.rit...@gmail.com> wrote:

> hi Anthony,
> While you stated the facts right, I don't see how it relates to the
> question I ask. Can you elaborate specifically what happens in the case I
> mentioned above to Dave?
>
> thanks,
> Ritesh
>
>
> On Wed, Feb 23, 2011 at 1:57 PM, Anthony John 
> wrote:
>
>> Seems to me that the explanations are getting incredibly complicated -
>> while I submit the real issue is not!
>>
>> Salient points here:-
>> 1. To be guaranteed data consistency - the writes and reads have to be
>> at Quorum CL or more
>> 2. Any W/R at lesser CL means that the application has to handle the
>> inconsistency, or has to be tolerant of it
>> 3. Writing at "ANY" CL - a special case - means that writes will
>> always go through (as long as any node is up), even if the destination 
>> nodes
>> are not up. This is done via hinted handoff. But this can result in
>> inconsistent reads, and yes that is a problem but refer to pt-2 above
>> 4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to
>> handle that case where a particular node is down and the write needs to 
>> be
>> replicated to it. But this will not cause inconsistent R as the hinted
>> handoff (in this case) only applies after Quorum is met - so a Quorum R 
>> is
>> not dependent on the down node being up, and having got the hint.
>>
>> Hope I state this appropriately!
>>
>> HTH,
>>
>> -JA
>>
>>
>> On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijoriwala <
>> tijoriwala.rit...@gmail.com> wrote:
>>
>>> > Read repair will probably occur at that point (depending on your
>>> config), which would cause the newest value to propagate to more 
>>> replicas.
>>>
>>> Is the newest value the "quorum" value which means it is the old
>>> value that will be written back to the nodes having "newer non-quorum" 
>>> value
>>> or the newest value is the real new value? :) If later, than this seems 
>>> kind
>>> of odd to me and how it will be useful to any application. A bug?
>>>
>>> Thanks,
>>> Ritesh
>>>
>>>
>>> On Wed, Feb 23, 2011 at 12:43 PM, Dave Revell wrote:
>>>
 Ritesh,

 You have seen the problem. Clients may read the newly written value
 even though the client performing the write saw it as

Re: How does Cassandra handle failure during synchronous writes

2011-02-23 Thread Ritesh Tijoriwala
>In this case - N1 will be identified as a discrepancy and the change will
be discarded via read repair

Brilliant. This does sound correct :)

One more related question - how are read repairs protected against a quorum
write that is in-progress? For e.g. say nodes A, B, C and Client C1 intends
to write K = X for Quorum ( = 2 nodes) say on A & B and mean while just
after it finishes writing on A and before writing on B, client C2 reads with
Quorum. Then does that trigger a read repair and race with C1?

Also, when a client reads with Quorum, does it read all nodes (A, B, C in
this case) or any Quorum and if it cannot figure out a consistent value then
it reads more? What is the process here? For e.g. in the above example, if
C2 were to read A and C, then it will bet X and W which will not achieve a
quorum so then would that trigger a read on C? And does this continue for
some number of times until a quorum is achieved or timeout occurs? For e.g.
under high concurrency for a specific value, values might be changing fast.

Thanks,
Ritesh

On Wed, Feb 23, 2011 at 6:05 PM, Anthony John  wrote:

> >Remember the simple rule. Column with highest timestamp is the one that
> will be considered correct EVENTUALLY. So consider following case:
>
> I am sorry, that will return inconsistent results even a Q. Time stamp have
> nothing to do with this. It is just an application provided artifact and
> could be anything.
>
> >c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data
> that was written to node1 will be returned.
>
> In this case - N1 will be identified as a discrepancy and the change will
> be discarded via read repair
>
> On Wed, Feb 23, 2011 at 6:47 PM, Narendra Sharma <
> narendra.sha...@gmail.com> wrote:
>
>> Remember the simple rule. Column with highest timestamp is the one that
>> will be considered correct EVENTUALLY. So consider following case:
>>
>> Cluster size = 3 (say node1, node2 and node3), RF = 3, Read/Write CL =
>> QUORUM
>> a. QUORUM in this case requires 2 nodes. Write failed with successful
>> write to only 1 node say node1.
>> b. Read with CL = QUORUM. If read hits node2 and node3, old data will be
>> returned with read repair triggered in background. On next read you will get
>> the data that was written to node1.
>> c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data
>> that was written to node1 will be returned.
>>
>> HTH!
>>
>> Thanks,
>> Naren
>>
>>
>>
>> On Wed, Feb 23, 2011 at 3:36 PM, Ritesh Tijoriwala <
>> tijoriwala.rit...@gmail.com> wrote:
>>
>>> Hi Anthony,
>>> I am not talking about the case of CL ANY. I am talking about the case
>>> where your consistency level is  R + W > N and you want to write to W nodes
>>> but only succeed in writing to X ( where X < W) nodes and hence fail the
>>> write to the client.
>>>
>>> thanks,
>>> Ritesh
>>>
>>> On Wed, Feb 23, 2011 at 2:48 PM, Anthony John wrote:
>>>
 Ritesh,

 At CL ANY - if all endpoints are down - a HH is written. And it is a
 successful write - not a failed write.

 Now that does not guarantee a READ of the value just written - but that
 is a risk that you take when you use the ANY CL!

 HTH,

 -JA


 On Wed, Feb 23, 2011 at 4:40 PM, Ritesh Tijoriwala <
 tijoriwala.rit...@gmail.com> wrote:

> hi Anthony,
> While you stated the facts right, I don't see how it relates to the
> question I ask. Can you elaborate specifically what happens in the case I
> mentioned above to Dave?
>
> thanks,
> Ritesh
>
>
> On Wed, Feb 23, 2011 at 1:57 PM, Anthony John 
> wrote:
>
>> Seems to me that the explanations are getting incredibly complicated -
>> while I submit the real issue is not!
>>
>> Salient points here:-
>> 1. To be guaranteed data consistency - the writes and reads have to be
>> at Quorum CL or more
>> 2. Any W/R at lesser CL means that the application has to handle the
>> inconsistency, or has to be tolerant of it
>> 3. Writing at "ANY" CL - a special case - means that writes will
>> always go through (as long as any node is up), even if the destination 
>> nodes
>> are not up. This is done via hinted handoff. But this can result in
>> inconsistent reads, and yes that is a problem but refer to pt-2 above
>> 4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to
>> handle that case where a particular node is down and the write needs to 
>> be
>> replicated to it. But this will not cause inconsistent R as the hinted
>> handoff (in this case) only applies after Quorum is met - so a Quorum R 
>> is
>> not dependent on the down node being up, and having got the hint.
>>
>> Hope I state this appropriately!
>>
>> HTH,
>>
>> -JA
>>
>>
>> On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijoriwala <
>> tijoriwala.rit...@gmail.com> wrote:
>>
>>> > Read repair will

Re: How does Cassandra handle failure during synchronous writes

2011-02-23 Thread Anthony John
>Remember the simple rule. Column with highest timestamp is the one that
will be considered correct EVENTUALLY. So consider following case:

I am sorry, that will return inconsistent results even a Q. Time stamp have
nothing to do with this. It is just an application provided artifact and
could be anything.

>c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data that
was written to node1 will be returned.

In this case - N1 will be identified as a discrepancy and the change will be
discarded via read repair

On Wed, Feb 23, 2011 at 6:47 PM, Narendra Sharma
wrote:

> Remember the simple rule. Column with highest timestamp is the one that
> will be considered correct EVENTUALLY. So consider following case:
>
> Cluster size = 3 (say node1, node2 and node3), RF = 3, Read/Write CL =
> QUORUM
> a. QUORUM in this case requires 2 nodes. Write failed with successful write
> to only 1 node say node1.
> b. Read with CL = QUORUM. If read hits node2 and node3, old data will be
> returned with read repair triggered in background. On next read you will get
> the data that was written to node1.
> c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data that
> was written to node1 will be returned.
>
> HTH!
>
> Thanks,
> Naren
>
>
>
> On Wed, Feb 23, 2011 at 3:36 PM, Ritesh Tijoriwala <
> tijoriwala.rit...@gmail.com> wrote:
>
>> Hi Anthony,
>> I am not talking about the case of CL ANY. I am talking about the case
>> where your consistency level is  R + W > N and you want to write to W nodes
>> but only succeed in writing to X ( where X < W) nodes and hence fail the
>> write to the client.
>>
>> thanks,
>> Ritesh
>>
>> On Wed, Feb 23, 2011 at 2:48 PM, Anthony John wrote:
>>
>>> Ritesh,
>>>
>>> At CL ANY - if all endpoints are down - a HH is written. And it is a
>>> successful write - not a failed write.
>>>
>>> Now that does not guarantee a READ of the value just written - but that
>>> is a risk that you take when you use the ANY CL!
>>>
>>> HTH,
>>>
>>> -JA
>>>
>>>
>>> On Wed, Feb 23, 2011 at 4:40 PM, Ritesh Tijoriwala <
>>> tijoriwala.rit...@gmail.com> wrote:
>>>
 hi Anthony,
 While you stated the facts right, I don't see how it relates to the
 question I ask. Can you elaborate specifically what happens in the case I
 mentioned above to Dave?

 thanks,
 Ritesh


 On Wed, Feb 23, 2011 at 1:57 PM, Anthony John wrote:

> Seems to me that the explanations are getting incredibly complicated -
> while I submit the real issue is not!
>
> Salient points here:-
> 1. To be guaranteed data consistency - the writes and reads have to be
> at Quorum CL or more
> 2. Any W/R at lesser CL means that the application has to handle the
> inconsistency, or has to be tolerant of it
> 3. Writing at "ANY" CL - a special case - means that writes will always
> go through (as long as any node is up), even if the destination nodes are
> not up. This is done via hinted handoff. But this can result in 
> inconsistent
> reads, and yes that is a problem but refer to pt-2 above
> 4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to
> handle that case where a particular node is down and the write needs to be
> replicated to it. But this will not cause inconsistent R as the hinted
> handoff (in this case) only applies after Quorum is met - so a Quorum R is
> not dependent on the down node being up, and having got the hint.
>
> Hope I state this appropriately!
>
> HTH,
>
> -JA
>
>
> On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijoriwala <
> tijoriwala.rit...@gmail.com> wrote:
>
>> > Read repair will probably occur at that point (depending on your
>> config), which would cause the newest value to propagate to more 
>> replicas.
>>
>> Is the newest value the "quorum" value which means it is the old value
>> that will be written back to the nodes having "newer non-quorum" value or
>> the newest value is the real new value? :) If later, than this seems 
>> kind of
>> odd to me and how it will be useful to any application. A bug?
>>
>> Thanks,
>> Ritesh
>>
>>
>> On Wed, Feb 23, 2011 at 12:43 PM, Dave Revell wrote:
>>
>>> Ritesh,
>>>
>>> You have seen the problem. Clients may read the newly written value
>>> even though the client performing the write saw it as a failure. When 
>>> the
>>> client reads, it will use the correct number of replicas for the chosen 
>>> CL,
>>> then return the newest value seen at any replica. This "newest value" 
>>> could
>>> be the result of a failed write.
>>>
>>> Read repair will probably occur at that point (depending on your
>>> config), which would cause the newest value to propagate to more 
>>> replicas.
>>>
>>> R+W>N guarantees serial order of operations: any read at CL=R that
>>> occurs after

Re: How does Cassandra handle failure during synchronous writes

2011-02-23 Thread Ritesh Tijoriwala
Thanks Narendra. This is exactly what I was looking for. So the read will
return with old value but at the same time, repair will occur and next reads
will return "new value". But the new value was never written successfully in
the first place as Quorum was never achieved. Isn't that semantically
incorrect?
Taking configuration of cluster size = 3 and RF = 3 as you described with
Read/Write CL = Quorum,

0. Current value for some key K = W.
1. Client writes K = X. Unfortunately, due to intermittent network error,
writes cannot be done successfully on quorum nodes (say node 2 or node 3).
Node 1 has written successfully the value of X for K. Hence, a failure is
returned to the client. If this X gets written for some unknown reason
behind the scene, how the client is suppose to know this? This sounds like a
major design flaw. For e.g. consider withdrawing $500 from account B. If
client is told that withdrawal cannot succeed, he will try again just to
find out that his account is in overdraft state even though the consistency
level he was using is Read/Write Consistent with Quorum.

On step 2 after 1, when client asks for K, I agree that W should be returned
but at the same time, I don't know if silently propagating the failed value
to rest of the nodes is the right behavior.

Thanks,
Ritesh


On Wed, Feb 23, 2011 at 4:47 PM, Narendra Sharma
wrote:

> Remember the simple rule. Column with highest timestamp is the one that
> will be considered correct EVENTUALLY. So consider following case:
>
> Cluster size = 3 (say node1, node2 and node3), RF = 3, Read/Write CL =
> QUORUM
> a. QUORUM in this case requires 2 nodes. Write failed with successful write
> to only 1 node say node1.
> b. Read with CL = QUORUM. If read hits node2 and node3, old data will be
> returned with read repair triggered in background. On next read you will get
> the data that was written to node1.
> c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data that
> was written to node1 will be returned.
>
> HTH!
>
> Thanks,
> Naren
>
>
>
> On Wed, Feb 23, 2011 at 3:36 PM, Ritesh Tijoriwala <
> tijoriwala.rit...@gmail.com> wrote:
>
>> Hi Anthony,
>> I am not talking about the case of CL ANY. I am talking about the case
>> where your consistency level is  R + W > N and you want to write to W nodes
>> but only succeed in writing to X ( where X < W) nodes and hence fail the
>> write to the client.
>>
>> thanks,
>> Ritesh
>>
>> On Wed, Feb 23, 2011 at 2:48 PM, Anthony John wrote:
>>
>>> Ritesh,
>>>
>>> At CL ANY - if all endpoints are down - a HH is written. And it is a
>>> successful write - not a failed write.
>>>
>>> Now that does not guarantee a READ of the value just written - but that
>>> is a risk that you take when you use the ANY CL!
>>>
>>> HTH,
>>>
>>> -JA
>>>
>>>
>>> On Wed, Feb 23, 2011 at 4:40 PM, Ritesh Tijoriwala <
>>> tijoriwala.rit...@gmail.com> wrote:
>>>
 hi Anthony,
 While you stated the facts right, I don't see how it relates to the
 question I ask. Can you elaborate specifically what happens in the case I
 mentioned above to Dave?

 thanks,
 Ritesh


 On Wed, Feb 23, 2011 at 1:57 PM, Anthony John wrote:

> Seems to me that the explanations are getting incredibly complicated -
> while I submit the real issue is not!
>
> Salient points here:-
> 1. To be guaranteed data consistency - the writes and reads have to be
> at Quorum CL or more
> 2. Any W/R at lesser CL means that the application has to handle the
> inconsistency, or has to be tolerant of it
> 3. Writing at "ANY" CL - a special case - means that writes will always
> go through (as long as any node is up), even if the destination nodes are
> not up. This is done via hinted handoff. But this can result in 
> inconsistent
> reads, and yes that is a problem but refer to pt-2 above
> 4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to
> handle that case where a particular node is down and the write needs to be
> replicated to it. But this will not cause inconsistent R as the hinted
> handoff (in this case) only applies after Quorum is met - so a Quorum R is
> not dependent on the down node being up, and having got the hint.
>
> Hope I state this appropriately!
>
> HTH,
>
> -JA
>
>
> On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijoriwala <
> tijoriwala.rit...@gmail.com> wrote:
>
>> > Read repair will probably occur at that point (depending on your
>> config), which would cause the newest value to propagate to more 
>> replicas.
>>
>> Is the newest value the "quorum" value which means it is the old value
>> that will be written back to the nodes having "newer non-quorum" value or
>> the newest value is the real new value? :) If later, than this seems 
>> kind of
>> odd to me and how it will be useful to any application. A bug?
>>
>> Thanks,
>> Ri

Re: How does Cassandra handle failure during synchronous writes

2011-02-23 Thread Narendra Sharma
Remember the simple rule. Column with highest timestamp is the one that will
be considered correct EVENTUALLY. So consider following case:

Cluster size = 3 (say node1, node2 and node3), RF = 3, Read/Write CL =
QUORUM
a. QUORUM in this case requires 2 nodes. Write failed with successful write
to only 1 node say node1.
b. Read with CL = QUORUM. If read hits node2 and node3, old data will be
returned with read repair triggered in background. On next read you will get
the data that was written to node1.
c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data that
was written to node1 will be returned.

HTH!

Thanks,
Naren


On Wed, Feb 23, 2011 at 3:36 PM, Ritesh Tijoriwala <
tijoriwala.rit...@gmail.com> wrote:

> Hi Anthony,
> I am not talking about the case of CL ANY. I am talking about the case
> where your consistency level is  R + W > N and you want to write to W nodes
> but only succeed in writing to X ( where X < W) nodes and hence fail the
> write to the client.
>
> thanks,
> Ritesh
>
> On Wed, Feb 23, 2011 at 2:48 PM, Anthony John wrote:
>
>> Ritesh,
>>
>> At CL ANY - if all endpoints are down - a HH is written. And it is a
>> successful write - not a failed write.
>>
>> Now that does not guarantee a READ of the value just written - but that is
>> a risk that you take when you use the ANY CL!
>>
>> HTH,
>>
>> -JA
>>
>>
>> On Wed, Feb 23, 2011 at 4:40 PM, Ritesh Tijoriwala <
>> tijoriwala.rit...@gmail.com> wrote:
>>
>>> hi Anthony,
>>> While you stated the facts right, I don't see how it relates to the
>>> question I ask. Can you elaborate specifically what happens in the case I
>>> mentioned above to Dave?
>>>
>>> thanks,
>>> Ritesh
>>>
>>>
>>> On Wed, Feb 23, 2011 at 1:57 PM, Anthony John wrote:
>>>
 Seems to me that the explanations are getting incredibly complicated -
 while I submit the real issue is not!

 Salient points here:-
 1. To be guaranteed data consistency - the writes and reads have to be
 at Quorum CL or more
 2. Any W/R at lesser CL means that the application has to handle the
 inconsistency, or has to be tolerant of it
 3. Writing at "ANY" CL - a special case - means that writes will always
 go through (as long as any node is up), even if the destination nodes are
 not up. This is done via hinted handoff. But this can result in 
 inconsistent
 reads, and yes that is a problem but refer to pt-2 above
 4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to
 handle that case where a particular node is down and the write needs to be
 replicated to it. But this will not cause inconsistent R as the hinted
 handoff (in this case) only applies after Quorum is met - so a Quorum R is
 not dependent on the down node being up, and having got the hint.

 Hope I state this appropriately!

 HTH,

 -JA


 On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijoriwala <
 tijoriwala.rit...@gmail.com> wrote:

> > Read repair will probably occur at that point (depending on your
> config), which would cause the newest value to propagate to more replicas.
>
> Is the newest value the "quorum" value which means it is the old value
> that will be written back to the nodes having "newer non-quorum" value or
> the newest value is the real new value? :) If later, than this seems kind 
> of
> odd to me and how it will be useful to any application. A bug?
>
> Thanks,
> Ritesh
>
>
> On Wed, Feb 23, 2011 at 12:43 PM, Dave Revell wrote:
>
>> Ritesh,
>>
>> You have seen the problem. Clients may read the newly written value
>> even though the client performing the write saw it as a failure. When the
>> client reads, it will use the correct number of replicas for the chosen 
>> CL,
>> then return the newest value seen at any replica. This "newest value" 
>> could
>> be the result of a failed write.
>>
>> Read repair will probably occur at that point (depending on your
>> config), which would cause the newest value to propagate to more 
>> replicas.
>>
>> R+W>N guarantees serial order of operations: any read at CL=R that
>> occurs after a write at CL=W will observe the write. I don't think this
>> property is relevant to your current question, though.
>>
>> Cassandra has no mechanism to "roll back" the partial write, other
>> than to simply write again. This may also fail.
>>
>> Best,
>> Dave
>>
>>
>> On Wed, Feb 23, 2011 at 10:12 AM, wrote:
>>
>>> Hi Dave,
>>> Thanks for your input. In the steps you mention, what happens when
>>> client tries to read the value at step 6? Is it possible that the 
>>> client may
>>> see the new value? My understanding was if R + W > N, then client will 
>>> not
>>> see the new value as Quorum nodes will not agree on the new value. If 
>>> that
>>> is the

Re: How does Cassandra handle failure during synchronous writes

2011-02-23 Thread Ritesh Tijoriwala
Hi Anthony,
I am not talking about the case of CL ANY. I am talking about the case where
your consistency level is  R + W > N and you want to write to W nodes but
only succeed in writing to X ( where X < W) nodes and hence fail the write
to the client.

thanks,
Ritesh

On Wed, Feb 23, 2011 at 2:48 PM, Anthony John  wrote:

> Ritesh,
>
> At CL ANY - if all endpoints are down - a HH is written. And it is a
> successful write - not a failed write.
>
> Now that does not guarantee a READ of the value just written - but that is
> a risk that you take when you use the ANY CL!
>
> HTH,
>
> -JA
>
>
> On Wed, Feb 23, 2011 at 4:40 PM, Ritesh Tijoriwala <
> tijoriwala.rit...@gmail.com> wrote:
>
>> hi Anthony,
>> While you stated the facts right, I don't see how it relates to the
>> question I ask. Can you elaborate specifically what happens in the case I
>> mentioned above to Dave?
>>
>> thanks,
>> Ritesh
>>
>>
>> On Wed, Feb 23, 2011 at 1:57 PM, Anthony John wrote:
>>
>>> Seems to me that the explanations are getting incredibly complicated -
>>> while I submit the real issue is not!
>>>
>>> Salient points here:-
>>> 1. To be guaranteed data consistency - the writes and reads have to be at
>>> Quorum CL or more
>>> 2. Any W/R at lesser CL means that the application has to handle the
>>> inconsistency, or has to be tolerant of it
>>> 3. Writing at "ANY" CL - a special case - means that writes will always
>>> go through (as long as any node is up), even if the destination nodes are
>>> not up. This is done via hinted handoff. But this can result in inconsistent
>>> reads, and yes that is a problem but refer to pt-2 above
>>> 4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to
>>> handle that case where a particular node is down and the write needs to be
>>> replicated to it. But this will not cause inconsistent R as the hinted
>>> handoff (in this case) only applies after Quorum is met - so a Quorum R is
>>> not dependent on the down node being up, and having got the hint.
>>>
>>> Hope I state this appropriately!
>>>
>>> HTH,
>>>
>>> -JA
>>>
>>>
>>> On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijoriwala <
>>> tijoriwala.rit...@gmail.com> wrote:
>>>
 > Read repair will probably occur at that point (depending on your
 config), which would cause the newest value to propagate to more replicas.

 Is the newest value the "quorum" value which means it is the old value
 that will be written back to the nodes having "newer non-quorum" value or
 the newest value is the real new value? :) If later, than this seems kind 
 of
 odd to me and how it will be useful to any application. A bug?

 Thanks,
 Ritesh


 On Wed, Feb 23, 2011 at 12:43 PM, Dave Revell wrote:

> Ritesh,
>
> You have seen the problem. Clients may read the newly written value
> even though the client performing the write saw it as a failure. When the
> client reads, it will use the correct number of replicas for the chosen 
> CL,
> then return the newest value seen at any replica. This "newest value" 
> could
> be the result of a failed write.
>
> Read repair will probably occur at that point (depending on your
> config), which would cause the newest value to propagate to more replicas.
>
> R+W>N guarantees serial order of operations: any read at CL=R that
> occurs after a write at CL=W will observe the write. I don't think this
> property is relevant to your current question, though.
>
> Cassandra has no mechanism to "roll back" the partial write, other than
> to simply write again. This may also fail.
>
> Best,
> Dave
>
>
> On Wed, Feb 23, 2011 at 10:12 AM,  wrote:
>
>> Hi Dave,
>> Thanks for your input. In the steps you mention, what happens when
>> client tries to read the value at step 6? Is it possible that the client 
>> may
>> see the new value? My understanding was if R + W > N, then client will 
>> not
>> see the new value as Quorum nodes will not agree on the new value. If 
>> that
>> is the case, then its alright to return failure to the client. However, 
>> if
>> not, then it is difficult to program as after every failure, you as an
>> client are not sure if failure is a pseudo failure with some side 
>> effects or
>> real failure.
>>
>> Thanks,
>> Ritesh
>>
>> 
>>
>> Ritesh,
>>
>> There is no commit protocol. Writes may be persisted on some replicas
>> even
>> though the quorum fails. Here's a sequence of events that shows the
>> "problem:"
>>
>> 1. Some replica R fails, but recently, so its failure has not yet been
>> detected
>> 2. A client writes with consistency > 1
>> 3. The write goes to all replicas, all replicas except R persist the
>> write
>> to disk
>> 4. Replica R never responds
>> 5. Failure is returned to the client, but the 

Re: How does Cassandra handle failure during synchronous writes

2011-02-23 Thread Anthony John
Ritesh,

At CL ANY - if all endpoints are down - a HH is written. And it is a
successful write - not a failed write.

Now that does not guarantee a READ of the value just written - but that is a
risk that you take when you use the ANY CL!

HTH,

-JA

On Wed, Feb 23, 2011 at 4:40 PM, Ritesh Tijoriwala <
tijoriwala.rit...@gmail.com> wrote:

> hi Anthony,
> While you stated the facts right, I don't see how it relates to the
> question I ask. Can you elaborate specifically what happens in the case I
> mentioned above to Dave?
>
> thanks,
> Ritesh
>
>
> On Wed, Feb 23, 2011 at 1:57 PM, Anthony John wrote:
>
>> Seems to me that the explanations are getting incredibly complicated -
>> while I submit the real issue is not!
>>
>> Salient points here:-
>> 1. To be guaranteed data consistency - the writes and reads have to be at
>> Quorum CL or more
>> 2. Any W/R at lesser CL means that the application has to handle the
>> inconsistency, or has to be tolerant of it
>> 3. Writing at "ANY" CL - a special case - means that writes will always go
>> through (as long as any node is up), even if the destination nodes are not
>> up. This is done via hinted handoff. But this can result in inconsistent
>> reads, and yes that is a problem but refer to pt-2 above
>> 4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to
>> handle that case where a particular node is down and the write needs to be
>> replicated to it. But this will not cause inconsistent R as the hinted
>> handoff (in this case) only applies after Quorum is met - so a Quorum R is
>> not dependent on the down node being up, and having got the hint.
>>
>> Hope I state this appropriately!
>>
>> HTH,
>>
>> -JA
>>
>>
>> On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijoriwala <
>> tijoriwala.rit...@gmail.com> wrote:
>>
>>> > Read repair will probably occur at that point (depending on your
>>> config), which would cause the newest value to propagate to more replicas.
>>>
>>> Is the newest value the "quorum" value which means it is the old value
>>> that will be written back to the nodes having "newer non-quorum" value or
>>> the newest value is the real new value? :) If later, than this seems kind of
>>> odd to me and how it will be useful to any application. A bug?
>>>
>>> Thanks,
>>> Ritesh
>>>
>>>
>>> On Wed, Feb 23, 2011 at 12:43 PM, Dave Revell wrote:
>>>
 Ritesh,

 You have seen the problem. Clients may read the newly written value even
 though the client performing the write saw it as a failure. When the client
 reads, it will use the correct number of replicas for the chosen CL, then
 return the newest value seen at any replica. This "newest value" could be
 the result of a failed write.

 Read repair will probably occur at that point (depending on your
 config), which would cause the newest value to propagate to more replicas.

 R+W>N guarantees serial order of operations: any read at CL=R that
 occurs after a write at CL=W will observe the write. I don't think this
 property is relevant to your current question, though.

 Cassandra has no mechanism to "roll back" the partial write, other than
 to simply write again. This may also fail.

 Best,
 Dave


 On Wed, Feb 23, 2011 at 10:12 AM,  wrote:

> Hi Dave,
> Thanks for your input. In the steps you mention, what happens when
> client tries to read the value at step 6? Is it possible that the client 
> may
> see the new value? My understanding was if R + W > N, then client will not
> see the new value as Quorum nodes will not agree on the new value. If that
> is the case, then its alright to return failure to the client. However, if
> not, then it is difficult to program as after every failure, you as an
> client are not sure if failure is a pseudo failure with some side effects 
> or
> real failure.
>
> Thanks,
> Ritesh
>
> 
>
> Ritesh,
>
> There is no commit protocol. Writes may be persisted on some replicas
> even
> though the quorum fails. Here's a sequence of events that shows the
> "problem:"
>
> 1. Some replica R fails, but recently, so its failure has not yet been
> detected
> 2. A client writes with consistency > 1
> 3. The write goes to all replicas, all replicas except R persist the
> write
> to disk
> 4. Replica R never responds
> 5. Failure is returned to the client, but the new value is still in the
> cluster, on all replicas except R.
>
> Something very similar could happen for CL QUORUM.
>
> This is a conscious design decision because a commit protocol would
> constitute tight coupling between nodes, which goes against the
> Cassandra
> philosophy. But unfortunately you do have to write your app with this
> case
> in mind.
>
> Best,
> Dave
>
> On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh <
> tijori

Re: How does Cassandra handle failure during synchronous writes

2011-02-23 Thread Ritesh Tijoriwala
hi Anthony,
While you stated the facts right, I don't see how it relates to the question
I ask. Can you elaborate specifically what happens in the case I mentioned
above to Dave?

thanks,
Ritesh

On Wed, Feb 23, 2011 at 1:57 PM, Anthony John  wrote:

> Seems to me that the explanations are getting incredibly complicated -
> while I submit the real issue is not!
>
> Salient points here:-
> 1. To be guaranteed data consistency - the writes and reads have to be at
> Quorum CL or more
> 2. Any W/R at lesser CL means that the application has to handle the
> inconsistency, or has to be tolerant of it
> 3. Writing at "ANY" CL - a special case - means that writes will always go
> through (as long as any node is up), even if the destination nodes are not
> up. This is done via hinted handoff. But this can result in inconsistent
> reads, and yes that is a problem but refer to pt-2 above
> 4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to
> handle that case where a particular node is down and the write needs to be
> replicated to it. But this will not cause inconsistent R as the hinted
> handoff (in this case) only applies after Quorum is met - so a Quorum R is
> not dependent on the down node being up, and having got the hint.
>
> Hope I state this appropriately!
>
> HTH,
>
> -JA
>
>
> On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijoriwala <
> tijoriwala.rit...@gmail.com> wrote:
>
>> > Read repair will probably occur at that point (depending on your
>> config), which would cause the newest value to propagate to more replicas.
>>
>> Is the newest value the "quorum" value which means it is the old value
>> that will be written back to the nodes having "newer non-quorum" value or
>> the newest value is the real new value? :) If later, than this seems kind of
>> odd to me and how it will be useful to any application. A bug?
>>
>> Thanks,
>> Ritesh
>>
>>
>> On Wed, Feb 23, 2011 at 12:43 PM, Dave Revell  wrote:
>>
>>> Ritesh,
>>>
>>> You have seen the problem. Clients may read the newly written value even
>>> though the client performing the write saw it as a failure. When the client
>>> reads, it will use the correct number of replicas for the chosen CL, then
>>> return the newest value seen at any replica. This "newest value" could be
>>> the result of a failed write.
>>>
>>> Read repair will probably occur at that point (depending on your config),
>>> which would cause the newest value to propagate to more replicas.
>>>
>>> R+W>N guarantees serial order of operations: any read at CL=R that occurs
>>> after a write at CL=W will observe the write. I don't think this property is
>>> relevant to your current question, though.
>>>
>>> Cassandra has no mechanism to "roll back" the partial write, other than
>>> to simply write again. This may also fail.
>>>
>>> Best,
>>> Dave
>>>
>>>
>>> On Wed, Feb 23, 2011 at 10:12 AM,  wrote:
>>>
 Hi Dave,
 Thanks for your input. In the steps you mention, what happens when
 client tries to read the value at step 6? Is it possible that the client 
 may
 see the new value? My understanding was if R + W > N, then client will not
 see the new value as Quorum nodes will not agree on the new value. If that
 is the case, then its alright to return failure to the client. However, if
 not, then it is difficult to program as after every failure, you as an
 client are not sure if failure is a pseudo failure with some side effects 
 or
 real failure.

 Thanks,
 Ritesh

 

 Ritesh,

 There is no commit protocol. Writes may be persisted on some replicas
 even
 though the quorum fails. Here's a sequence of events that shows the
 "problem:"

 1. Some replica R fails, but recently, so its failure has not yet been
 detected
 2. A client writes with consistency > 1
 3. The write goes to all replicas, all replicas except R persist the
 write
 to disk
 4. Replica R never responds
 5. Failure is returned to the client, but the new value is still in the
 cluster, on all replicas except R.

 Something very similar could happen for CL QUORUM.

 This is a conscious design decision because a commit protocol would
 constitute tight coupling between nodes, which goes against the
 Cassandra
 philosophy. But unfortunately you do have to write your app with this
 case
 in mind.

 Best,
 Dave

 On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh <
 tijoriwala.rit...@gmail.com> wrote:

 >
 > Hi,
 > I wanted to get details on how does cassandra do synchronous writes to
 W
 > replicas (out of N)? Does it do a 2PC? If not, how does it deal with
 > failures of of nodes before it gets to write to W replicas? If the
 > orchestrating node cannot write to W nodes successfully, I guess it
 will
 > fail the write operation but what happens to the completed writes on X
 (W
 > >

Re: How does Cassandra handle failure during synchronous writes

2011-02-23 Thread Anthony John
Seems to me that the explanations are getting incredibly complicated - while
I submit the real issue is not!

Salient points here:-
1. To be guaranteed data consistency - the writes and reads have to be at
Quorum CL or more
2. Any W/R at lesser CL means that the application has to handle the
inconsistency, or has to be tolerant of it
3. Writing at "ANY" CL - a special case - means that writes will always go
through (as long as any node is up), even if the destination nodes are not
up. This is done via hinted handoff. But this can result in inconsistent
reads, and yes that is a problem but refer to pt-2 above
4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to
handle that case where a particular node is down and the write needs to be
replicated to it. But this will not cause inconsistent R as the hinted
handoff (in this case) only applies after Quorum is met - so a Quorum R is
not dependent on the down node being up, and having got the hint.

Hope I state this appropriately!

HTH,

-JA

On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijoriwala <
tijoriwala.rit...@gmail.com> wrote:

> > Read repair will probably occur at that point (depending on your config),
> which would cause the newest value to propagate to more replicas.
>
> Is the newest value the "quorum" value which means it is the old value that
> will be written back to the nodes having "newer non-quorum" value or the
> newest value is the real new value? :) If later, than this seems kind of odd
> to me and how it will be useful to any application. A bug?
>
> Thanks,
> Ritesh
>
>
> On Wed, Feb 23, 2011 at 12:43 PM, Dave Revell  wrote:
>
>> Ritesh,
>>
>> You have seen the problem. Clients may read the newly written value even
>> though the client performing the write saw it as a failure. When the client
>> reads, it will use the correct number of replicas for the chosen CL, then
>> return the newest value seen at any replica. This "newest value" could be
>> the result of a failed write.
>>
>> Read repair will probably occur at that point (depending on your config),
>> which would cause the newest value to propagate to more replicas.
>>
>> R+W>N guarantees serial order of operations: any read at CL=R that occurs
>> after a write at CL=W will observe the write. I don't think this property is
>> relevant to your current question, though.
>>
>> Cassandra has no mechanism to "roll back" the partial write, other than to
>> simply write again. This may also fail.
>>
>> Best,
>> Dave
>>
>>
>> On Wed, Feb 23, 2011 at 10:12 AM,  wrote:
>>
>>> Hi Dave,
>>> Thanks for your input. In the steps you mention, what happens when client
>>> tries to read the value at step 6? Is it possible that the client may see
>>> the new value? My understanding was if R + W > N, then client will not see
>>> the new value as Quorum nodes will not agree on the new value. If that is
>>> the case, then its alright to return failure to the client. However, if not,
>>> then it is difficult to program as after every failure, you as an client are
>>> not sure if failure is a pseudo failure with some side effects or real
>>> failure.
>>>
>>> Thanks,
>>> Ritesh
>>>
>>> 
>>>
>>> Ritesh,
>>>
>>> There is no commit protocol. Writes may be persisted on some replicas
>>> even
>>> though the quorum fails. Here's a sequence of events that shows the
>>> "problem:"
>>>
>>> 1. Some replica R fails, but recently, so its failure has not yet been
>>> detected
>>> 2. A client writes with consistency > 1
>>> 3. The write goes to all replicas, all replicas except R persist the
>>> write
>>> to disk
>>> 4. Replica R never responds
>>> 5. Failure is returned to the client, but the new value is still in the
>>> cluster, on all replicas except R.
>>>
>>> Something very similar could happen for CL QUORUM.
>>>
>>> This is a conscious design decision because a commit protocol would
>>> constitute tight coupling between nodes, which goes against the Cassandra
>>> philosophy. But unfortunately you do have to write your app with this
>>> case
>>> in mind.
>>>
>>> Best,
>>> Dave
>>>
>>> On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh <
>>> tijoriwala.rit...@gmail.com> wrote:
>>>
>>> >
>>> > Hi,
>>> > I wanted to get details on how does cassandra do synchronous writes to
>>> W
>>> > replicas (out of N)? Does it do a 2PC? If not, how does it deal with
>>> > failures of of nodes before it gets to write to W replicas? If the
>>> > orchestrating node cannot write to W nodes successfully, I guess it
>>> will
>>> > fail the write operation but what happens to the completed writes on X
>>> (W
>>> > >
>>> > X) nodes?
>>> >
>>> > Thanks,
>>> > Ritesh
>>> > --
>>> > View this message in context:
>>> >
>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
>>> > Sent from the cassandra-u...@incubator.apache.org mailing list archive
>>> at
>>> > Nabble.com.
>>> >
>>>
>>> 
>>> Quoted from:
>>>
>>> http://cassandra-user-i

Re: How does Cassandra handle failure during synchronous writes

2011-02-23 Thread Ritesh Tijoriwala
> Read repair will probably occur at that point (depending on your config),
which would cause the newest value to propagate to more replicas.

Is the newest value the "quorum" value which means it is the old value that
will be written back to the nodes having "newer non-quorum" value or the
newest value is the real new value? :) If later, than this seems kind of odd
to me and how it will be useful to any application. A bug?

Thanks,
Ritesh

On Wed, Feb 23, 2011 at 12:43 PM, Dave Revell  wrote:

> Ritesh,
>
> You have seen the problem. Clients may read the newly written value even
> though the client performing the write saw it as a failure. When the client
> reads, it will use the correct number of replicas for the chosen CL, then
> return the newest value seen at any replica. This "newest value" could be
> the result of a failed write.
>
> Read repair will probably occur at that point (depending on your config),
> which would cause the newest value to propagate to more replicas.
>
> R+W>N guarantees serial order of operations: any read at CL=R that occurs
> after a write at CL=W will observe the write. I don't think this property is
> relevant to your current question, though.
>
> Cassandra has no mechanism to "roll back" the partial write, other than to
> simply write again. This may also fail.
>
> Best,
> Dave
>
>
> On Wed, Feb 23, 2011 at 10:12 AM,  wrote:
>
>> Hi Dave,
>> Thanks for your input. In the steps you mention, what happens when client
>> tries to read the value at step 6? Is it possible that the client may see
>> the new value? My understanding was if R + W > N, then client will not see
>> the new value as Quorum nodes will not agree on the new value. If that is
>> the case, then its alright to return failure to the client. However, if not,
>> then it is difficult to program as after every failure, you as an client are
>> not sure if failure is a pseudo failure with some side effects or real
>> failure.
>>
>> Thanks,
>> Ritesh
>>
>> 
>>
>> Ritesh,
>>
>> There is no commit protocol. Writes may be persisted on some replicas even
>> though the quorum fails. Here's a sequence of events that shows the
>> "problem:"
>>
>> 1. Some replica R fails, but recently, so its failure has not yet been
>> detected
>> 2. A client writes with consistency > 1
>> 3. The write goes to all replicas, all replicas except R persist the write
>> to disk
>> 4. Replica R never responds
>> 5. Failure is returned to the client, but the new value is still in the
>> cluster, on all replicas except R.
>>
>> Something very similar could happen for CL QUORUM.
>>
>> This is a conscious design decision because a commit protocol would
>> constitute tight coupling between nodes, which goes against the Cassandra
>> philosophy. But unfortunately you do have to write your app with this case
>> in mind.
>>
>> Best,
>> Dave
>>
>> On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh <
>> tijoriwala.rit...@gmail.com> wrote:
>>
>> >
>> > Hi,
>> > I wanted to get details on how does cassandra do synchronous writes to W
>> > replicas (out of N)? Does it do a 2PC? If not, how does it deal with
>> > failures of of nodes before it gets to write to W replicas? If the
>> > orchestrating node cannot write to W nodes successfully, I guess it will
>> > fail the write operation but what happens to the completed writes on X
>> (W
>> > >
>> > X) nodes?
>> >
>> > Thanks,
>> > Ritesh
>> > --
>> > View this message in context:
>> >
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
>> > Sent from the cassandra-u...@incubator.apache.org mailing list archive
>> at
>> > Nabble.com.
>> >
>>
>> 
>> Quoted from:
>>
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055408.html
>>
>
>


Re: How does Cassandra handle failure during synchronous writes

2011-02-23 Thread Dave Revell
Ritesh,

You have seen the problem. Clients may read the newly written value even
though the client performing the write saw it as a failure. When the client
reads, it will use the correct number of replicas for the chosen CL, then
return the newest value seen at any replica. This "newest value" could be
the result of a failed write.

Read repair will probably occur at that point (depending on your config),
which would cause the newest value to propagate to more replicas.

R+W>N guarantees serial order of operations: any read at CL=R that occurs
after a write at CL=W will observe the write. I don't think this property is
relevant to your current question, though.

Cassandra has no mechanism to "roll back" the partial write, other than to
simply write again. This may also fail.

Best,
Dave


On Wed, Feb 23, 2011 at 10:12 AM,  wrote:

> Hi Dave,
> Thanks for your input. In the steps you mention, what happens when client
> tries to read the value at step 6? Is it possible that the client may see
> the new value? My understanding was if R + W > N, then client will not see
> the new value as Quorum nodes will not agree on the new value. If that is
> the case, then its alright to return failure to the client. However, if not,
> then it is difficult to program as after every failure, you as an client are
> not sure if failure is a pseudo failure with some side effects or real
> failure.
>
> Thanks,
> Ritesh
>
> 
> Ritesh,
>
> There is no commit protocol. Writes may be persisted on some replicas even
> though the quorum fails. Here's a sequence of events that shows the
> "problem:"
>
> 1. Some replica R fails, but recently, so its failure has not yet been
> detected
> 2. A client writes with consistency > 1
> 3. The write goes to all replicas, all replicas except R persist the write
> to disk
> 4. Replica R never responds
> 5. Failure is returned to the client, but the new value is still in the
> cluster, on all replicas except R.
>
> Something very similar could happen for CL QUORUM.
>
> This is a conscious design decision because a commit protocol would
> constitute tight coupling between nodes, which goes against the Cassandra
> philosophy. But unfortunately you do have to write your app with this case
> in mind.
>
> Best,
> Dave
>
> On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh <
> tijoriwala.rit...@gmail.com> wrote:
>
> >
> > Hi,
> > I wanted to get details on how does cassandra do synchronous writes to W
> > replicas (out of N)? Does it do a 2PC? If not, how does it deal with
> > failures of of nodes before it gets to write to W replicas? If the
> > orchestrating node cannot write to W nodes successfully, I guess it will
> > fail the write operation but what happens to the completed writes on X (W
> > >
> > X) nodes?
> >
> > Thanks,
> > Ritesh
> > --
> > View this message in context:
> >
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
> > Sent from the cassandra-u...@incubator.apache.org mailing list archive
> at
> > Nabble.com.
> >
>
> 
> Quoted from:
>
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055408.html
>


Re: How does Cassandra handle failure during synchronous writes

2011-02-23 Thread Aaron Morton
At CL levels high than ANY hinted handoff will be used if enabled. It does not 
contribute to the number of replicas considered written by the coordinator 
though. E.g. If you ask for quorum, and this is 3 nodes, and only 2 are up the 
write will fail without starting. In this case the HH is included in the 
message sent to one of the up nodes.

At CL any HH is accepted as a viable replica. Even if all the natural endpoints 
are down the coordinator node will store the HH.

aaron
On 24/02/2011, at 3:28 AM, Javier Canillas  wrote:
> There is something call Hinted Handoff. Suppose that you WRITE something with 
> ConsistencyLevel.ONE on a cluster defined by 4 nodes. Then, the write is done 
> on the corresponding node and it is returned an OK to the client, even if the 
> ReplicationFactor over the destination Keyspace is set to a higher value.
> 
> If in that write, one of the replicated nodes is down, then the coordinator 
> node (the one that will hold value if first place) will mark that replication 
> message as not sent and will retry eventually, making the replication happens.
> 
> Please, if I have explained it wrongly correct me. 
> 
> On Wed, Feb 23, 2011 at 5:45 AM, Aaron Morton  wrote:
> In the case described below if less than CL nodes respond in rpc_timeout 
> (from conf yaml) the client will get a timeout error. I think most higher 
> level clients will automatically retry in this case.
> 
> If there are not enough nodes to start the request you will get an 
> Unavailable exception. Again the client can retry safely.
> 
> Aaron
> 
> 
> On 23/02/2011, at 8:07 PM, Dave Revell  wrote:
> 
>> Ritesh,
>> 
>> There is no commit protocol. Writes may be persisted on some replicas even 
>> though the quorum fails. Here's a sequence of events that shows the 
>> "problem:"
>> 
>> 1. Some replica R fails, but recently, so its failure has not yet been 
>> detected
>> 2. A client writes with consistency > 1
>> 3. The write goes to all replicas, all replicas except R persist the write 
>> to disk
>> 4. Replica R never responds
>> 5. Failure is returned to the client, but the new value is still in the 
>> cluster, on all replicas except R.
>> 
>> Something very similar could happen for CL QUORUM.
>> 
>> This is a conscious design decision because a commit protocol would 
>> constitute tight coupling between nodes, which goes against the Cassandra 
>> philosophy. But unfortunately you do have to write your app with this case 
>> in mind.
>> 
>> Best,
>> Dave
>> 
>> On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh 
>>  wrote:
>> 
>> Hi,
>> I wanted to get details on how does cassandra do synchronous writes to W
>> replicas (out of N)? Does it do a 2PC? If not, how does it deal with
>> failures of of nodes before it gets to write to W replicas? If the
>> orchestrating node cannot write to W nodes successfully, I guess it will
>> fail the write operation but what happens to the completed writes on X (W >
>> X) nodes?
>> 
>> Thanks,
>> Ritesh
>> --
>> View this message in context: 
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
>> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
>> Nabble.com.
>> 
> 


Re: How does Cassandra handle failure during synchronous writes

2011-02-23 Thread Javier Canillas
There is something call Hinted Handoff. Suppose that you WRITE something
with ConsistencyLevel.ONE on a cluster defined by 4 nodes. Then, the write
is done on the corresponding node and it is returned an OK to the client,
even if the ReplicationFactor over the destination Keyspace is set to a
higher value.

If in that write, one of the replicated nodes is down, then the coordinator
node (the one that will hold value if first place) will mark that
replication message as not sent and will retry eventually, making the
replication happens.

Please, if I have explained it wrongly correct me.

On Wed, Feb 23, 2011 at 5:45 AM, Aaron Morton wrote:

> In the case described below if less than CL nodes respond in rpc_timeout
> (from conf yaml) the client will get a timeout error. I think most higher
> level clients will automatically retry in this case.
>
> If there are not enough nodes to start the request you will get an
> Unavailable exception. Again the client can retry safely.
>
> Aaron
>
>
> On 23/02/2011, at 8:07 PM, Dave Revell  wrote:
>
> Ritesh,
>
> There is no commit protocol. Writes may be persisted on some replicas even
> though the quorum fails. Here's a sequence of events that shows the
> "problem:"
>
> 1. Some replica R fails, but recently, so its failure has not yet been
> detected
> 2. A client writes with consistency > 1
> 3. The write goes to all replicas, all replicas except R persist the write
> to disk
> 4. Replica R never responds
> 5. Failure is returned to the client, but the new value is still in the
> cluster, on all replicas except R.
>
> Something very similar could happen for CL QUORUM.
>
> This is a conscious design decision because a commit protocol would
> constitute tight coupling between nodes, which goes against the Cassandra
> philosophy. But unfortunately you do have to write your app with this case
> in mind.
>
> Best,
> Dave
>
> On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh 
> <
> tijoriwala.rit...@gmail.com> wrote:
>
>>
>> Hi,
>> I wanted to get details on how does cassandra do synchronous writes to W
>> replicas (out of N)? Does it do a 2PC? If not, how does it deal with
>> failures of of nodes before it gets to write to W replicas? If the
>> orchestrating node cannot write to W nodes successfully, I guess it will
>> fail the write operation but what happens to the completed writes on X (W
>> >
>> X) nodes?
>>
>> Thanks,
>> Ritesh
>> --
>> View this message in context:
>> 
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
>> Sent from the 
>> cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
>>
>
>


Re: How does Cassandra handle failure during synchronous writes

2011-02-23 Thread Aaron Morton
In the case described below if less than CL nodes respond in rpc_timeout (from 
conf yaml) the client will get a timeout error. I think most higher level 
clients will automatically retry in this case.

If there are not enough nodes to start the request you will get an Unavailable 
exception. Again the client can retry safely.

Aaron

On 23/02/2011, at 8:07 PM, Dave Revell  wrote:

> Ritesh,
> 
> There is no commit protocol. Writes may be persisted on some replicas even 
> though the quorum fails. Here's a sequence of events that shows the "problem:"
> 
> 1. Some replica R fails, but recently, so its failure has not yet been 
> detected
> 2. A client writes with consistency > 1
> 3. The write goes to all replicas, all replicas except R persist the write to 
> disk
> 4. Replica R never responds
> 5. Failure is returned to the client, but the new value is still in the 
> cluster, on all replicas except R.
> 
> Something very similar could happen for CL QUORUM.
> 
> This is a conscious design decision because a commit protocol would 
> constitute tight coupling between nodes, which goes against the Cassandra 
> philosophy. But unfortunately you do have to write your app with this case in 
> mind.
> 
> Best,
> Dave
> 
> On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh 
>  wrote:
> 
> Hi,
> I wanted to get details on how does cassandra do synchronous writes to W
> replicas (out of N)? Does it do a 2PC? If not, how does it deal with
> failures of of nodes before it gets to write to W replicas? If the
> orchestrating node cannot write to W nodes successfully, I guess it will
> fail the write operation but what happens to the completed writes on X (W >
> X) nodes?
> 
> Thanks,
> Ritesh
> --
> View this message in context: 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> Nabble.com.
> 


Re: How does Cassandra handle failure during synchronous writes

2011-02-22 Thread Dave Revell
Ritesh,

There is no commit protocol. Writes may be persisted on some replicas even
though the quorum fails. Here's a sequence of events that shows the
"problem:"

1. Some replica R fails, but recently, so its failure has not yet been
detected
2. A client writes with consistency > 1
3. The write goes to all replicas, all replicas except R persist the write
to disk
4. Replica R never responds
5. Failure is returned to the client, but the new value is still in the
cluster, on all replicas except R.

Something very similar could happen for CL QUORUM.

This is a conscious design decision because a commit protocol would
constitute tight coupling between nodes, which goes against the Cassandra
philosophy. But unfortunately you do have to write your app with this case
in mind.

Best,
Dave

On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh <
tijoriwala.rit...@gmail.com> wrote:

>
> Hi,
> I wanted to get details on how does cassandra do synchronous writes to W
> replicas (out of N)? Does it do a 2PC? If not, how does it deal with
> failures of of nodes before it gets to write to W replicas? If the
> orchestrating node cannot write to W nodes successfully, I guess it will
> fail the write operation but what happens to the completed writes on X (W >
> X) nodes?
>
> Thanks,
> Ritesh
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
> Nabble.com.
>