>In this case - N1 will be identified as a discrepancy and the change will be discarded via read repair
Brilliant. This does sound correct :) One more related question - how are read repairs protected against a quorum write that is in-progress? For e.g. say nodes A, B, C and Client C1 intends to write K = X for Quorum ( = 2 nodes) say on A & B and mean while just after it finishes writing on A and before writing on B, client C2 reads with Quorum. Then does that trigger a read repair and race with C1? Also, when a client reads with Quorum, does it read all nodes (A, B, C in this case) or any Quorum and if it cannot figure out a consistent value then it reads more? What is the process here? For e.g. in the above example, if C2 were to read A and C, then it will bet X and W which will not achieve a quorum so then would that trigger a read on C? And does this continue for some number of times until a quorum is achieved or timeout occurs? For e.g. under high concurrency for a specific value, values might be changing fast. Thanks, Ritesh On Wed, Feb 23, 2011 at 6:05 PM, Anthony John <chirayit...@gmail.com> wrote: > >Remember the simple rule. Column with highest timestamp is the one that > will be considered correct EVENTUALLY. So consider following case: > > I am sorry, that will return inconsistent results even a Q. Time stamp have > nothing to do with this. It is just an application provided artifact and > could be anything. > > >c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data > that was written to node1 will be returned. > > In this case - N1 will be identified as a discrepancy and the change will > be discarded via read repair > > On Wed, Feb 23, 2011 at 6:47 PM, Narendra Sharma < > narendra.sha...@gmail.com> wrote: > >> Remember the simple rule. Column with highest timestamp is the one that >> will be considered correct EVENTUALLY. So consider following case: >> >> Cluster size = 3 (say node1, node2 and node3), RF = 3, Read/Write CL = >> QUORUM >> a. QUORUM in this case requires 2 nodes. Write failed with successful >> write to only 1 node say node1. >> b. Read with CL = QUORUM. If read hits node2 and node3, old data will be >> returned with read repair triggered in background. On next read you will get >> the data that was written to node1. >> c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data >> that was written to node1 will be returned. >> >> HTH! >> >> Thanks, >> Naren >> >> >> >> On Wed, Feb 23, 2011 at 3:36 PM, Ritesh Tijoriwala < >> tijoriwala.rit...@gmail.com> wrote: >> >>> Hi Anthony, >>> I am not talking about the case of CL ANY. I am talking about the case >>> where your consistency level is R + W > N and you want to write to W nodes >>> but only succeed in writing to X ( where X < W) nodes and hence fail the >>> write to the client. >>> >>> thanks, >>> Ritesh >>> >>> On Wed, Feb 23, 2011 at 2:48 PM, Anthony John <chirayit...@gmail.com>wrote: >>> >>>> Ritesh, >>>> >>>> At CL ANY - if all endpoints are down - a HH is written. And it is a >>>> successful write - not a failed write. >>>> >>>> Now that does not guarantee a READ of the value just written - but that >>>> is a risk that you take when you use the ANY CL! >>>> >>>> HTH, >>>> >>>> -JA >>>> >>>> >>>> On Wed, Feb 23, 2011 at 4:40 PM, Ritesh Tijoriwala < >>>> tijoriwala.rit...@gmail.com> wrote: >>>> >>>>> hi Anthony, >>>>> While you stated the facts right, I don't see how it relates to the >>>>> question I ask. Can you elaborate specifically what happens in the case I >>>>> mentioned above to Dave? >>>>> >>>>> thanks, >>>>> Ritesh >>>>> >>>>> >>>>> On Wed, Feb 23, 2011 at 1:57 PM, Anthony John >>>>> <chirayit...@gmail.com>wrote: >>>>> >>>>>> Seems to me that the explanations are getting incredibly complicated - >>>>>> while I submit the real issue is not! >>>>>> >>>>>> Salient points here:- >>>>>> 1. To be guaranteed data consistency - the writes and reads have to be >>>>>> at Quorum CL or more >>>>>> 2. Any W/R at lesser CL means that the application has to handle the >>>>>> inconsistency, or has to be tolerant of it >>>>>> 3. Writing at "ANY" CL - a special case - means that writes will >>>>>> always go through (as long as any node is up), even if the destination >>>>>> nodes >>>>>> are not up. This is done via hinted handoff. But this can result in >>>>>> inconsistent reads, and yes that is a problem but refer to pt-2 above >>>>>> 4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to >>>>>> handle that case where a particular node is down and the write needs to >>>>>> be >>>>>> replicated to it. But this will not cause inconsistent R as the hinted >>>>>> handoff (in this case) only applies after Quorum is met - so a Quorum R >>>>>> is >>>>>> not dependent on the down node being up, and having got the hint. >>>>>> >>>>>> Hope I state this appropriately! >>>>>> >>>>>> HTH, >>>>>> >>>>>> -JA >>>>>> >>>>>> >>>>>> On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijoriwala < >>>>>> tijoriwala.rit...@gmail.com> wrote: >>>>>> >>>>>>> > Read repair will probably occur at that point (depending on your >>>>>>> config), which would cause the newest value to propagate to more >>>>>>> replicas. >>>>>>> >>>>>>> Is the newest value the "quorum" value which means it is the old >>>>>>> value that will be written back to the nodes having "newer non-quorum" >>>>>>> value >>>>>>> or the newest value is the real new value? :) If later, than this seems >>>>>>> kind >>>>>>> of odd to me and how it will be useful to any application. A bug? >>>>>>> >>>>>>> Thanks, >>>>>>> Ritesh >>>>>>> >>>>>>> >>>>>>> On Wed, Feb 23, 2011 at 12:43 PM, Dave Revell <d...@meebo-inc.com>wrote: >>>>>>> >>>>>>>> Ritesh, >>>>>>>> >>>>>>>> You have seen the problem. Clients may read the newly written value >>>>>>>> even though the client performing the write saw it as a failure. When >>>>>>>> the >>>>>>>> client reads, it will use the correct number of replicas for the >>>>>>>> chosen CL, >>>>>>>> then return the newest value seen at any replica. This "newest value" >>>>>>>> could >>>>>>>> be the result of a failed write. >>>>>>>> >>>>>>>> Read repair will probably occur at that point (depending on your >>>>>>>> config), which would cause the newest value to propagate to more >>>>>>>> replicas. >>>>>>>> >>>>>>>> R+W>N guarantees serial order of operations: any read at CL=R that >>>>>>>> occurs after a write at CL=W will observe the write. I don't think this >>>>>>>> property is relevant to your current question, though. >>>>>>>> >>>>>>>> Cassandra has no mechanism to "roll back" the partial write, other >>>>>>>> than to simply write again. This may also fail. >>>>>>>> >>>>>>>> Best, >>>>>>>> Dave >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Feb 23, 2011 at 10:12 AM, <tijoriwala.rit...@gmail.com>wrote: >>>>>>>> >>>>>>>>> Hi Dave, >>>>>>>>> Thanks for your input. In the steps you mention, what happens when >>>>>>>>> client tries to read the value at step 6? Is it possible that the >>>>>>>>> client may >>>>>>>>> see the new value? My understanding was if R + W > N, then client >>>>>>>>> will not >>>>>>>>> see the new value as Quorum nodes will not agree on the new value. If >>>>>>>>> that >>>>>>>>> is the case, then its alright to return failure to the client. >>>>>>>>> However, if >>>>>>>>> not, then it is difficult to program as after every failure, you as an >>>>>>>>> client are not sure if failure is a pseudo failure with some side >>>>>>>>> effects or >>>>>>>>> real failure. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Ritesh >>>>>>>>> >>>>>>>>> <quote author='Dave Revell'> >>>>>>>>> >>>>>>>>> Ritesh, >>>>>>>>> >>>>>>>>> There is no commit protocol. Writes may be persisted on some >>>>>>>>> replicas even >>>>>>>>> though the quorum fails. Here's a sequence of events that shows the >>>>>>>>> "problem:" >>>>>>>>> >>>>>>>>> 1. Some replica R fails, but recently, so its failure has not yet >>>>>>>>> been >>>>>>>>> detected >>>>>>>>> 2. A client writes with consistency > 1 >>>>>>>>> 3. The write goes to all replicas, all replicas except R persist >>>>>>>>> the write >>>>>>>>> to disk >>>>>>>>> 4. Replica R never responds >>>>>>>>> 5. Failure is returned to the client, but the new value is still in >>>>>>>>> the >>>>>>>>> cluster, on all replicas except R. >>>>>>>>> >>>>>>>>> Something very similar could happen for CL QUORUM. >>>>>>>>> >>>>>>>>> This is a conscious design decision because a commit protocol would >>>>>>>>> constitute tight coupling between nodes, which goes against the >>>>>>>>> Cassandra >>>>>>>>> philosophy. But unfortunately you do have to write your app with >>>>>>>>> this case >>>>>>>>> in mind. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Dave >>>>>>>>> >>>>>>>>> On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh < >>>>>>>>> tijoriwala.rit...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> > >>>>>>>>> > Hi, >>>>>>>>> > I wanted to get details on how does cassandra do synchronous >>>>>>>>> writes to W >>>>>>>>> > replicas (out of N)? Does it do a 2PC? If not, how does it deal >>>>>>>>> with >>>>>>>>> > failures of of nodes before it gets to write to W replicas? If >>>>>>>>> the >>>>>>>>> > orchestrating node cannot write to W nodes successfully, I guess >>>>>>>>> it will >>>>>>>>> > fail the write operation but what happens to the completed writes >>>>>>>>> on X (W >>>>>>>>> > > >>>>>>>>> > X) nodes? >>>>>>>>> > >>>>>>>>> > Thanks, >>>>>>>>> > Ritesh >>>>>>>>> > -- >>>>>>>>> > View this message in context: >>>>>>>>> > >>>>>>>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html >>>>>>>>> > Sent from the cassandra-u...@incubator.apache.org mailing list >>>>>>>>> archive at >>>>>>>>> > Nabble.com. >>>>>>>>> > >>>>>>>>> >>>>>>>>> </quote> >>>>>>>>> Quoted from: >>>>>>>>> >>>>>>>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055408.html >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >