Re: Consistency model

James Cipar Sun, 17 Apr 2011 07:10:07 -0700

I'm pretty new to Cassandra, but I've also written a client in C++ using the 
thrift API directly.  From what I've seen, wrapping writes in a retry loop is 
pretty much mandatory because if you are pushing a lot of data around, you're 
basically guaranteed to have TimedOutExceptions.  I suppose what I'm getting at 
is: if you don't have consistency in the case of a TimedOutException, you don't 
have consistency for any high-throughput application.  Is there a solution to 
this that I am missing?



On Apr 17, 2011, at 9:42 AM, William Oberman wrote:

> At first I was concerned and was going to +1  on a fix, but I think I was 
> confused on one detail and I'd like to confirm it.
> -An unsuccessful write implies readers can see either the old or new value
> ?
> 
> The trick is using a library, it sounds like there is a period of time a 
> write is unsuccessful but you don't know about it (as the retry is internal). 
>  But, (assuming writes are idempotent) QUORUM is actually consistent from 
> successful writes to successful reads... right?
> 
> On Sun, Apr 17, 2011 at 1:53 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
> Tyler is correct, because Cassandra doesn't wait until repair writes
> are acked before the answer is returned. This is something we can fix.
> 
> On Sun, Apr 17, 2011 at 12:05 AM, Sean Bridges <sean.brid...@gmail.com> wrote:
> > Tyler, your answer seems to contradict this email by Jonathan Ellis
> > [1].  In it Jonathan says,
> >
> > "The important guarantee this gives you is that once one quorum read
> > sees the new value, all others will too.   You can't see the newest
> > version, then see an older version on a subsequent write [sic, I
> > assume he meant read], which is the characteristic of non-strong
> > consistency"
> >
> > Jonathan also says,
> >
> > "{X, Y} and {X, Z} are equivalent: one node with the write, and one
> > without. The read will recognize that X's version needs to be sent to
> > Z, and the write will be complete.  This read and all subsequent ones
> > will see the write.  (Z [sic, I assume he meant Y] will be replicated
> > to asynchronously via read repair.)"
> >
> > To me, the statement "this read and all subsequent ones will see the
> > write" implies that the new value must be committed to Y or Z before
> > the read can return.  If not, the statement must be false.
> >
> > Sean
> >
> >
> > [1] : 
> > http://mail-archives.apache.org/mod_mbox/cassandra-user/201102.mbox/%3caanlktimegp8h87mgs_bxzknck-a59whxf-xx58hca...@mail.gmail.com%3E
> >
> > Sean
> >
> > On Sat, Apr 16, 2011 at 7:44 PM, Tyler Hobbs <ty...@datastax.com> wrote:
> >> Here's what's probably happening:
> >>
> >> I'm assuming RF=3 and QUORUM writes/reads here.  I'll call the replicas A,
> >> B, and C.
> >>
> >> 1.  Writer process writes sequence number 1 and everything works fine.  A,
> >> B, and C all have sequence number 1.
> >> 2.  Writer process writes sequence number 2.  Replica A writes 
> >> successfully,
> >> B and C fail to respond in time, and a TimedOutException is returned.
> >> pycassa waits to retry the operation.
> >> 3.  Reader process reads, gets a response from A and B.  When the row from 
> >> A
> >> and B is merged, sequence number 2 is the newest and is returned.  A read
> >> repair is pushed to B and C, but they don't yet update their data.
> >> 4.  Reader process reads again, gets a response from B and C (before 
> >> they've
> >> repaired).  These both report sequence number 1, so that's returned to the
> >> client.  This is were you get a decreasing sequence number.
> >> 5.  pycassa eventually retries the write; B and C eventually repair their
> >> data.  Either way, both B and C shortly have sequence number 2.
> >>
> >> I've left out some of the details of read repair, and this scenario could
> >> happen in several slightly different ways, but it should give you an idea 
> >> of
> >> what's happening.
> >>
> >> On Sat, Apr 16, 2011 at 8:35 PM, James Cipar <jci...@cmu.edu> wrote:
> >>>
> >>> Here it is.  There is some setup code and global variable definitions that
> >>> I left out of the previous code, but they are pretty similar to the setup
> >>> code here.
> >>>     import pycassa
> >>>     import random
> >>>     import time
> >>>     consistency_level = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM
> >>>     duration = 600
> >>>     sleeptime = 0.0
> >>>     hostlist = 'worker-hostlist'
> >>>     def read_servers(fn):
> >>>         f = open(fn)
> >>>         servers = []
> >>>         for line in f:
> >>>             servers.append(line.strip())
> >>>         f.close()
> >>>         return servers
> >>>     servers = read_servers(hostlist)
> >>>     start_time = time.time()
> >>>     seqnum = -1
> >>>     timestamp = 0
> >>>     while time.time() < start_time + duration:
> >>>         target_server = random.sample(servers, 1)[0]
> >>>         target_server = '%s:9160'%target_server
> >>>         try:
> >>>             pool = pycassa.connect('Keyspace1', [target_server])
> >>>             cf = pycassa.ColumnFamily(pool, 'Standard1')
> >>>             row = cf.get('foo', read_consistency_level=consistency_level)
> >>>             pool.dispose()
> >>>         except:
> >>>             time.sleep(sleeptime)
> >>>             continue
> >>>         sq = int(row['seqnum'])
> >>>         ts = float(row['timestamp'])
> >>>         if sq < seqnum:
> >>>             print 'Row changed: %i %f -> %i %f'%(seqnum, timestamp, sq,
> >>> ts)
> >>>         seqnum = sq
> >>>         timestamp = ts
> >>>         if sleeptime > 0.0:
> >>>             time.sleep(sleeptime)
> >>>
> >>>
> >>>
> >>> On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote:
> >>>
> >>> James,
> >>>
> >>> Would you mind sharing your reader process code as well?
> >>>
> >>> On Fri, Apr 15, 2011 at 1:14 PM, James Cipar <jci...@cmu.edu> wrote:
> >>>>
> >>>> I've been experimenting with the consistency model of Cassandra, and I
> >>>> found something that seems a bit unexpected.  In my experiment, I have 2
> >>>> processes, a reader and a writer, each accessing a Cassandra cluster 
> >>>> with a
> >>>> replication factor greater than 1.  In addition, sometimes I generate
> >>>> background traffic to simulate a busy cluster by uploading a large data 
> >>>> file
> >>>> to another table.
> >>>>
> >>>> The writer executes a loop where it writes a single row that contains
> >>>> just an sequentially increasing sequence number and a timestamp.  In 
> >>>> python
> >>>> this looks something like:
> >>>>
> >>>>    while time.time() < start_time + duration:
> >>>>        target_server = random.sample(servers, 1)[0]
> >>>>        target_server = '%s:9160'%target_server
> >>>>
> >>>>        row = {'seqnum':str(seqnum), 'timestamp':str(time.time())}
> >>>>        seqnum += 1
> >>>>        # print 'uploading to server %s, %s'%(target_server, row)
> >>>>
> >>>>        pool = pycassa.connect('Keyspace1', [target_server])
> >>>>        cf = pycassa.ColumnFamily(pool, 'Standard1')
> >>>>        cf.insert('foo', row, write_consistency_level=consistency_level)
> >>>>        pool.dispose()
> >>>>
> >>>>        if sleeptime > 0.0:
> >>>>            time.sleep(sleeptime)
> >>>>
> >>>>
> >>>> The reader simply executes a loop reading this row and reporting whenever
> >>>> a sequence number is *less* than the previous sequence number.  As 
> >>>> expected,
> >>>> with consistency_level=ConsistencyLevel.ONE there are many 
> >>>> inconsistencies,
> >>>> especially with a high replication factor.
> >>>>
> >>>> What is unexpected is that I still detect inconsistencies when it is set
> >>>> at ConsistencyLevel.QUORUM.  This is unexpected because the documentation
> >>>> seems to imply that QUORUM will give consistent results.  With background
> >>>> traffic the average difference in timestamps was 0.6s, and the maximum 
> >>>> was
> >>>> >3.5s.  This means that a client sees a version of the row, and can
> >>>> subsequently see another version of the row that is 3.5s older than the
> >>>> previous.
> >>>>
> >>>> What I imagine is happening is this, but I'd like someone who knows that
> >>>> they're talking about to tell me if it's actually the case:
> >>>>
> >>>> I think Cassandra is not using an atomic commit protocol to commit to the
> >>>> quorum of servers chosen when the write is made.  This means that at some
> >>>> point in the middle of the write, some subset of the quorum have seen the
> >>>> write, while others have not.  At this time, there is a quorum of servers
> >>>> that have not seen the update, so depending on which quorum the client 
> >>>> reads
> >>>> from, it may or may not see the update.
> >>>>
> >>>> Of course, I understand that the client is not *choosing* a bad quorum to
> >>>> read from, it is just the first `q` servers to respond, but in this case 
> >>>> it
> >>>> is effectively random and sometimes an bad quorum is "chosen".
> >>>>
> >>>> Does anyone have any other insight into what is going on here?
> >>>
> >>>
> >>> --
> >>> Tyler Hobbs
> >>> Software Engineer, DataStax
> >>> Maintainer of the pycassa Cassandra Python client library
> >>>
> >>>
> >>
> >>
> >>
> >> --
> >> Tyler Hobbs
> >> Software Engineer, DataStax
> >> Maintainer of the pycassa Cassandra Python client library
> >>
> >>
> >
> 
> 
> 
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
> 
> 
> 
> -- 
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) ober...@civicscience.com

Re: Consistency model

Reply via email to