I'm pretty new to Cassandra, but I've also written a client in C++ using the thrift API directly. From what I've seen, wrapping writes in a retry loop is pretty much mandatory because if you are pushing a lot of data around, you're basically guaranteed to have TimedOutExceptions. I suppose what I'm getting at is: if you don't have consistency in the case of a TimedOutException, you don't have consistency for any high-throughput application. Is there a solution to this that I am missing?
On Apr 17, 2011, at 9:42 AM, William Oberman wrote: > At first I was concerned and was going to +1 on a fix, but I think I was > confused on one detail and I'd like to confirm it. > -An unsuccessful write implies readers can see either the old or new value > ? > > The trick is using a library, it sounds like there is a period of time a > write is unsuccessful but you don't know about it (as the retry is internal). > But, (assuming writes are idempotent) QUORUM is actually consistent from > successful writes to successful reads... right? > > On Sun, Apr 17, 2011 at 1:53 AM, Jonathan Ellis <jbel...@gmail.com> wrote: > Tyler is correct, because Cassandra doesn't wait until repair writes > are acked before the answer is returned. This is something we can fix. > > On Sun, Apr 17, 2011 at 12:05 AM, Sean Bridges <sean.brid...@gmail.com> wrote: > > Tyler, your answer seems to contradict this email by Jonathan Ellis > > [1]. In it Jonathan says, > > > > "The important guarantee this gives you is that once one quorum read > > sees the new value, all others will too. You can't see the newest > > version, then see an older version on a subsequent write [sic, I > > assume he meant read], which is the characteristic of non-strong > > consistency" > > > > Jonathan also says, > > > > "{X, Y} and {X, Z} are equivalent: one node with the write, and one > > without. The read will recognize that X's version needs to be sent to > > Z, and the write will be complete. This read and all subsequent ones > > will see the write. (Z [sic, I assume he meant Y] will be replicated > > to asynchronously via read repair.)" > > > > To me, the statement "this read and all subsequent ones will see the > > write" implies that the new value must be committed to Y or Z before > > the read can return. If not, the statement must be false. > > > > Sean > > > > > > [1] : > > http://mail-archives.apache.org/mod_mbox/cassandra-user/201102.mbox/%3caanlktimegp8h87mgs_bxzknck-a59whxf-xx58hca...@mail.gmail.com%3E > > > > Sean > > > > On Sat, Apr 16, 2011 at 7:44 PM, Tyler Hobbs <ty...@datastax.com> wrote: > >> Here's what's probably happening: > >> > >> I'm assuming RF=3 and QUORUM writes/reads here. I'll call the replicas A, > >> B, and C. > >> > >> 1. Writer process writes sequence number 1 and everything works fine. A, > >> B, and C all have sequence number 1. > >> 2. Writer process writes sequence number 2. Replica A writes > >> successfully, > >> B and C fail to respond in time, and a TimedOutException is returned. > >> pycassa waits to retry the operation. > >> 3. Reader process reads, gets a response from A and B. When the row from > >> A > >> and B is merged, sequence number 2 is the newest and is returned. A read > >> repair is pushed to B and C, but they don't yet update their data. > >> 4. Reader process reads again, gets a response from B and C (before > >> they've > >> repaired). These both report sequence number 1, so that's returned to the > >> client. This is were you get a decreasing sequence number. > >> 5. pycassa eventually retries the write; B and C eventually repair their > >> data. Either way, both B and C shortly have sequence number 2. > >> > >> I've left out some of the details of read repair, and this scenario could > >> happen in several slightly different ways, but it should give you an idea > >> of > >> what's happening. > >> > >> On Sat, Apr 16, 2011 at 8:35 PM, James Cipar <jci...@cmu.edu> wrote: > >>> > >>> Here it is. There is some setup code and global variable definitions that > >>> I left out of the previous code, but they are pretty similar to the setup > >>> code here. > >>> import pycassa > >>> import random > >>> import time > >>> consistency_level = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM > >>> duration = 600 > >>> sleeptime = 0.0 > >>> hostlist = 'worker-hostlist' > >>> def read_servers(fn): > >>> f = open(fn) > >>> servers = [] > >>> for line in f: > >>> servers.append(line.strip()) > >>> f.close() > >>> return servers > >>> servers = read_servers(hostlist) > >>> start_time = time.time() > >>> seqnum = -1 > >>> timestamp = 0 > >>> while time.time() < start_time + duration: > >>> target_server = random.sample(servers, 1)[0] > >>> target_server = '%s:9160'%target_server > >>> try: > >>> pool = pycassa.connect('Keyspace1', [target_server]) > >>> cf = pycassa.ColumnFamily(pool, 'Standard1') > >>> row = cf.get('foo', read_consistency_level=consistency_level) > >>> pool.dispose() > >>> except: > >>> time.sleep(sleeptime) > >>> continue > >>> sq = int(row['seqnum']) > >>> ts = float(row['timestamp']) > >>> if sq < seqnum: > >>> print 'Row changed: %i %f -> %i %f'%(seqnum, timestamp, sq, > >>> ts) > >>> seqnum = sq > >>> timestamp = ts > >>> if sleeptime > 0.0: > >>> time.sleep(sleeptime) > >>> > >>> > >>> > >>> On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote: > >>> > >>> James, > >>> > >>> Would you mind sharing your reader process code as well? > >>> > >>> On Fri, Apr 15, 2011 at 1:14 PM, James Cipar <jci...@cmu.edu> wrote: > >>>> > >>>> I've been experimenting with the consistency model of Cassandra, and I > >>>> found something that seems a bit unexpected. In my experiment, I have 2 > >>>> processes, a reader and a writer, each accessing a Cassandra cluster > >>>> with a > >>>> replication factor greater than 1. In addition, sometimes I generate > >>>> background traffic to simulate a busy cluster by uploading a large data > >>>> file > >>>> to another table. > >>>> > >>>> The writer executes a loop where it writes a single row that contains > >>>> just an sequentially increasing sequence number and a timestamp. In > >>>> python > >>>> this looks something like: > >>>> > >>>> while time.time() < start_time + duration: > >>>> target_server = random.sample(servers, 1)[0] > >>>> target_server = '%s:9160'%target_server > >>>> > >>>> row = {'seqnum':str(seqnum), 'timestamp':str(time.time())} > >>>> seqnum += 1 > >>>> # print 'uploading to server %s, %s'%(target_server, row) > >>>> > >>>> pool = pycassa.connect('Keyspace1', [target_server]) > >>>> cf = pycassa.ColumnFamily(pool, 'Standard1') > >>>> cf.insert('foo', row, write_consistency_level=consistency_level) > >>>> pool.dispose() > >>>> > >>>> if sleeptime > 0.0: > >>>> time.sleep(sleeptime) > >>>> > >>>> > >>>> The reader simply executes a loop reading this row and reporting whenever > >>>> a sequence number is *less* than the previous sequence number. As > >>>> expected, > >>>> with consistency_level=ConsistencyLevel.ONE there are many > >>>> inconsistencies, > >>>> especially with a high replication factor. > >>>> > >>>> What is unexpected is that I still detect inconsistencies when it is set > >>>> at ConsistencyLevel.QUORUM. This is unexpected because the documentation > >>>> seems to imply that QUORUM will give consistent results. With background > >>>> traffic the average difference in timestamps was 0.6s, and the maximum > >>>> was > >>>> >3.5s. This means that a client sees a version of the row, and can > >>>> subsequently see another version of the row that is 3.5s older than the > >>>> previous. > >>>> > >>>> What I imagine is happening is this, but I'd like someone who knows that > >>>> they're talking about to tell me if it's actually the case: > >>>> > >>>> I think Cassandra is not using an atomic commit protocol to commit to the > >>>> quorum of servers chosen when the write is made. This means that at some > >>>> point in the middle of the write, some subset of the quorum have seen the > >>>> write, while others have not. At this time, there is a quorum of servers > >>>> that have not seen the update, so depending on which quorum the client > >>>> reads > >>>> from, it may or may not see the update. > >>>> > >>>> Of course, I understand that the client is not *choosing* a bad quorum to > >>>> read from, it is just the first `q` servers to respond, but in this case > >>>> it > >>>> is effectively random and sometimes an bad quorum is "chosen". > >>>> > >>>> Does anyone have any other insight into what is going on here? > >>> > >>> > >>> -- > >>> Tyler Hobbs > >>> Software Engineer, DataStax > >>> Maintainer of the pycassa Cassandra Python client library > >>> > >>> > >> > >> > >> > >> -- > >> Tyler Hobbs > >> Software Engineer, DataStax > >> Maintainer of the pycassa Cassandra Python client library > >> > >> > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com > > > > -- > Will Oberman > Civic Science, Inc. > 3030 Penn Avenue., First Floor > Pittsburgh, PA 15201 > (M) 412-480-7835 > (E) ober...@civicscience.com