Re: Consistency model
That's what I thought was happening, yes. A careful reading of the documentation suggests that this is correct behavior. Tyler says this can also occur because of a TimedOutException on the writes. This worries me because TimedOutExceptions are so frequent (at least for my test cluster), therefore using quorum reads and writes is not sufficient for consistency. Any application that wants consistency needs to have some external way of synchronizing readers and writers so that readers don't read in the middle of a write or in the writers retry loop. Does anyone have any intuition about whether this will happen with consistency_level=ALL? I will try it today, but I'd like to know what the expected behavior is. It seems like it would not happen in this case. On Apr 17, 2011, at 3:01 PM, William Oberman wrote: James: I feel like I understand what's going on in your code now based on this discussion, and I'm ok with the fact that DURING a QW you can get transitional results from a QR in another process (or either the before or after state of the QW). But once the QW succeeds, you must get the new value. That's what we're all saying now, right? In your read, read, read case, all 3 reads are happening during a QW, and some of them see the before and some of them see the after (that's why I specifically said single threaded, not because it's a single thread per se, but because a single thread can't read during a write by definition). will On Sun, Apr 17, 2011 at 1:27 PM, Milind Parikh milindpar...@gmail.com wrote: Same process or not: only successful QR reads after successful QW will behave with this guarantee. /*** sent from my android...please pardon occasional typos as I respond @ the speed of thought / On Apr 17, 2011 10:04 AM, James Cipar jci...@cmu.edu wrote: For a second, I thought this thread was saying I could see value(s) new value(s) within the same... That's exactly what I'm saying. Within a single process I see this behavior, when reading with consistency_level=QUORUM Read value 1 Read value 2 Read value 1 # uh oh! we've gone backwards On Apr 17, 2011, at 12:15 PM, William Oberman wrote: Cool, that is exactly what I was thinkin... -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: Consistency model
Does anyone have any intuition about whether this will happen with consistency_level=ALL? I will try it today, but I'd like to know what the expected behavior is. It seems like it would not happen in this case. Assuming my understanding is correct (see my comment in the JIRA ticket), then I expect that you don't see the value reverting back to an old version in your tests. However, this is not guaranteed. A read on CL.ALL will see the most recent value as returned by *any* node. However, suppose a failed write only is replicated to one node. That node subsequently goes up in smoke and is replaced. Now you may revert back to the old data unless the JIRA ticket is attended do. But then, use of CL.ALL kind of implies that you're not willing to accept downtime of any node. But it's important to keep in mind that if you're looking for the kind of guarantee of not reverting to an older value, then that not loose a single node applies not just to being up and serving reads, but also to maintaining consistency over time. So while I expect that CL.ALL would not fail the test, I would not use that information to conclude that the correct course of action is to use CL.ALL ;) -- / Peter Schuller
Re: Consistency model
If you are reading and writing at quorum, then what you are seeing shouldn't happen. You shouldn't be able to read N+1 until N+1 has been committed to a quorum of servers. At this point you should not be able to read N anymore, since there is no quorum that contains N. Dan - I think you are right, except that quorum reads should be consistent even during a quorum write. You are not guaranteed to read N+1 until *after* a successful quorum write of N+1, but once you see N+1, you should never see N again, even if the write failed. Sean On Fri, Apr 15, 2011 at 1:29 PM, Dan Hendry dan.hendry.j...@gmail.com wrote: So Cassandra does not use an atomic commit protocol at the cluster level. Strong consistency on a quorum read is only guaranteed *after* a successful quorum write. The behaviour you are seeing is possible if you are reading in the middle of a write or the write failed (which should be reported to your code via an exception). Dan -Original Message- From: James Cipar [mailto:jci...@cmu.edu] Sent: April-15-11 14:15 To: user@cassandra.apache.org Subject: Consistency model I've been experimenting with the consistency model of Cassandra, and I found something that seems a bit unexpected. In my experiment, I have 2 processes, a reader and a writer, each accessing a Cassandra cluster with a replication factor greater than 1. In addition, sometimes I generate background traffic to simulate a busy cluster by uploading a large data file to another table. The writer executes a loop where it writes a single row that contains just an sequentially increasing sequence number and a timestamp. In python this looks something like: while time.time() start_time + duration: target_server = random.sample(servers, 1)[0] target_server = '%s:9160'%target_server row = {'seqnum':str(seqnum), 'timestamp':str(time.time())} seqnum += 1 # print 'uploading to server %s, %s'%(target_server, row) pool = pycassa.connect('Keyspace1', [target_server]) cf = pycassa.ColumnFamily(pool, 'Standard1') cf.insert('foo', row, write_consistency_level=consistency_level) pool.dispose() if sleeptime 0.0: time.sleep(sleeptime) The reader simply executes a loop reading this row and reporting whenever a sequence number is *less* than the previous sequence number. As expected, with consistency_level=ConsistencyLevel.ONE there are many inconsistencies, especially with a high replication factor. What is unexpected is that I still detect inconsistencies when it is set at ConsistencyLevel.QUORUM. This is unexpected because the documentation seems to imply that QUORUM will give consistent results. With background traffic the average difference in timestamps was 0.6s, and the maximum was 3.5s. This means that a client sees a version of the row, and can subsequently see another version of the row that is 3.5s older than the previous. What I imagine is happening is this, but I'd like someone who knows that they're talking about to tell me if it's actually the case: I think Cassandra is not using an atomic commit protocol to commit to the quorum of servers chosen when the write is made. This means that at some point in the middle of the write, some subset of the quorum have seen the write, while others have not. At this time, there is a quorum of servers that have not seen the update, so depending on which quorum the client reads from, it may or may not see the update. Of course, I understand that the client is not *choosing* a bad quorum to read from, it is just the first `q` servers to respond, but in this case it is effectively random and sometimes an bad quorum is chosen. Does anyone have any other insight into what is going on here?= No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.894 / Virus Database: 271.1.1/3574 - Release Date: 04/15/11 02:34:00
Re: Consistency model
Here it is. There is some setup code and global variable definitions that I left out of the previous code, but they are pretty similar to the setup code here. import pycassa import random import time consistency_level = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM duration = 600 sleeptime = 0.0 hostlist = 'worker-hostlist' def read_servers(fn): f = open(fn) servers = [] for line in f: servers.append(line.strip()) f.close() return servers servers = read_servers(hostlist) start_time = time.time() seqnum = -1 timestamp = 0 while time.time() start_time + duration: target_server = random.sample(servers, 1)[0] target_server = '%s:9160'%target_server try: pool = pycassa.connect('Keyspace1', [target_server]) cf = pycassa.ColumnFamily(pool, 'Standard1') row = cf.get('foo', read_consistency_level=consistency_level) pool.dispose() except: time.sleep(sleeptime) continue sq = int(row['seqnum']) ts = float(row['timestamp']) if sq seqnum: print 'Row changed: %i %f - %i %f'%(seqnum, timestamp, sq, ts) seqnum = sq timestamp = ts if sleeptime 0.0: time.sleep(sleeptime) On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote: James, Would you mind sharing your reader process code as well? On Fri, Apr 15, 2011 at 1:14 PM, James Cipar jci...@cmu.edu wrote: I've been experimenting with the consistency model of Cassandra, and I found something that seems a bit unexpected. In my experiment, I have 2 processes, a reader and a writer, each accessing a Cassandra cluster with a replication factor greater than 1. In addition, sometimes I generate background traffic to simulate a busy cluster by uploading a large data file to another table. The writer executes a loop where it writes a single row that contains just an sequentially increasing sequence number and a timestamp. In python this looks something like: while time.time() start_time + duration: target_server = random.sample(servers, 1)[0] target_server = '%s:9160'%target_server row = {'seqnum':str(seqnum), 'timestamp':str(time.time())} seqnum += 1 # print 'uploading to server %s, %s'%(target_server, row) pool = pycassa.connect('Keyspace1', [target_server]) cf = pycassa.ColumnFamily(pool, 'Standard1') cf.insert('foo', row, write_consistency_level=consistency_level) pool.dispose() if sleeptime 0.0: time.sleep(sleeptime) The reader simply executes a loop reading this row and reporting whenever a sequence number is *less* than the previous sequence number. As expected, with consistency_level=ConsistencyLevel.ONE there are many inconsistencies, especially with a high replication factor. What is unexpected is that I still detect inconsistencies when it is set at ConsistencyLevel.QUORUM. This is unexpected because the documentation seems to imply that QUORUM will give consistent results. With background traffic the average difference in timestamps was 0.6s, and the maximum was 3.5s. This means that a client sees a version of the row, and can subsequently see another version of the row that is 3.5s older than the previous. What I imagine is happening is this, but I'd like someone who knows that they're talking about to tell me if it's actually the case: I think Cassandra is not using an atomic commit protocol to commit to the quorum of servers chosen when the write is made. This means that at some point in the middle of the write, some subset of the quorum have seen the write, while others have not. At this time, there is a quorum of servers that have not seen the update, so depending on which quorum the client reads from, it may or may not see the update. Of course, I understand that the client is not *choosing* a bad quorum to read from, it is just the first `q` servers to respond, but in this case it is effectively random and sometimes an bad quorum is chosen. Does anyone have any other insight into what is going on here? -- Tyler Hobbs Software Engineer, DataStax Maintainer of the pycassa Cassandra Python client library
Re: Consistency model
Here's what's probably happening: I'm assuming RF=3 and QUORUM writes/reads here. I'll call the replicas A, B, and C. 1. Writer process writes sequence number 1 and everything works fine. A, B, and C all have sequence number 1. 2. Writer process writes sequence number 2. Replica A writes successfully, B and C fail to respond in time, and a TimedOutException is returned. pycassa waits to retry the operation. 3. Reader process reads, gets a response from A and B. When the row from A and B is merged, sequence number 2 is the newest and is returned. A read repair is pushed to B and C, but they don't yet update their data. 4. Reader process reads again, gets a response from B and C (before they've repaired). These both report sequence number 1, so that's returned to the client. This is were you get a decreasing sequence number. 5. pycassa eventually retries the write; B and C eventually repair their data. Either way, both B and C shortly have sequence number 2. I've left out some of the details of read repair, and this scenario could happen in several slightly different ways, but it should give you an idea of what's happening. On Sat, Apr 16, 2011 at 8:35 PM, James Cipar jci...@cmu.edu wrote: Here it is. There is some setup code and global variable definitions that I left out of the previous code, but they are pretty similar to the setup code here. import pycassa import random import time consistency_level = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM duration = 600 sleeptime = 0.0 hostlist = 'worker-hostlist' def read_servers(fn): f = open(fn) servers = [] for line in f: servers.append(line.strip()) f.close() return servers servers = read_servers(hostlist) start_time = time.time() seqnum = -1 timestamp = 0 while time.time() start_time + duration: target_server = random.sample(servers, 1)[0] target_server = '%s:9160'%target_server try: pool = pycassa.connect('Keyspace1', [target_server]) cf = pycassa.ColumnFamily(pool, 'Standard1') row = cf.get('foo', read_consistency_level=consistency_level) pool.dispose() except: time.sleep(sleeptime) continue sq = int(row['seqnum']) ts = float(row['timestamp']) if sq seqnum: print 'Row changed: %i %f - %i %f'%(seqnum, timestamp, sq, ts) seqnum = sq timestamp = ts if sleeptime 0.0: time.sleep(sleeptime) On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote: James, Would you mind sharing your reader process code as well? On Fri, Apr 15, 2011 at 1:14 PM, James Cipar jci...@cmu.edu wrote: I've been experimenting with the consistency model of Cassandra, and I found something that seems a bit unexpected. In my experiment, I have 2 processes, a reader and a writer, each accessing a Cassandra cluster with a replication factor greater than 1. In addition, sometimes I generate background traffic to simulate a busy cluster by uploading a large data file to another table. The writer executes a loop where it writes a single row that contains just an sequentially increasing sequence number and a timestamp. In python this looks something like: while time.time() start_time + duration: target_server = random.sample(servers, 1)[0] target_server = '%s:9160'%target_server row = {'seqnum':str(seqnum), 'timestamp':str(time.time())} seqnum += 1 # print 'uploading to server %s, %s'%(target_server, row) pool = pycassa.connect('Keyspace1', [target_server]) cf = pycassa.ColumnFamily(pool, 'Standard1') cf.insert('foo', row, write_consistency_level=consistency_level) pool.dispose() if sleeptime 0.0: time.sleep(sleeptime) The reader simply executes a loop reading this row and reporting whenever a sequence number is *less* than the previous sequence number. As expected, with consistency_level=ConsistencyLevel.ONE there are many inconsistencies, especially with a high replication factor. What is unexpected is that I still detect inconsistencies when it is set at ConsistencyLevel.QUORUM. This is unexpected because the documentation seems to imply that QUORUM will give consistent results. With background traffic the average difference in timestamps was 0.6s, and the maximum was 3.5s. This means that a client sees a version of the row, and can subsequently see another version of the row that is 3.5s older than the previous. What I imagine is happening is this, but I'd like someone who knows that they're talking about to tell me if it's actually the case: I think Cassandra is not using an atomic commit protocol to commit to the quorum of servers chosen when the write is made. This means that at some point in the
Re: Consistency model
Tyler, your answer seems to contradict this email by Jonathan Ellis [1]. In it Jonathan says, The important guarantee this gives you is that once one quorum read sees the new value, all others will too. You can't see the newest version, then see an older version on a subsequent write [sic, I assume he meant read], which is the characteristic of non-strong consistency Jonathan also says, {X, Y} and {X, Z} are equivalent: one node with the write, and one without. The read will recognize that X's version needs to be sent to Z, and the write will be complete. This read and all subsequent ones will see the write. (Z [sic, I assume he meant Y] will be replicated to asynchronously via read repair.) To me, the statement this read and all subsequent ones will see the write implies that the new value must be committed to Y or Z before the read can return. If not, the statement must be false. Sean [1] : http://mail-archives.apache.org/mod_mbox/cassandra-user/201102.mbox/%3caanlktimegp8h87mgs_bxzknck-a59whxf-xx58hca...@mail.gmail.com%3E Sean On Sat, Apr 16, 2011 at 7:44 PM, Tyler Hobbs ty...@datastax.com wrote: Here's what's probably happening: I'm assuming RF=3 and QUORUM writes/reads here. I'll call the replicas A, B, and C. 1. Writer process writes sequence number 1 and everything works fine. A, B, and C all have sequence number 1. 2. Writer process writes sequence number 2. Replica A writes successfully, B and C fail to respond in time, and a TimedOutException is returned. pycassa waits to retry the operation. 3. Reader process reads, gets a response from A and B. When the row from A and B is merged, sequence number 2 is the newest and is returned. A read repair is pushed to B and C, but they don't yet update their data. 4. Reader process reads again, gets a response from B and C (before they've repaired). These both report sequence number 1, so that's returned to the client. This is were you get a decreasing sequence number. 5. pycassa eventually retries the write; B and C eventually repair their data. Either way, both B and C shortly have sequence number 2. I've left out some of the details of read repair, and this scenario could happen in several slightly different ways, but it should give you an idea of what's happening. On Sat, Apr 16, 2011 at 8:35 PM, James Cipar jci...@cmu.edu wrote: Here it is. There is some setup code and global variable definitions that I left out of the previous code, but they are pretty similar to the setup code here. import pycassa import random import time consistency_level = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM duration = 600 sleeptime = 0.0 hostlist = 'worker-hostlist' def read_servers(fn): f = open(fn) servers = [] for line in f: servers.append(line.strip()) f.close() return servers servers = read_servers(hostlist) start_time = time.time() seqnum = -1 timestamp = 0 while time.time() start_time + duration: target_server = random.sample(servers, 1)[0] target_server = '%s:9160'%target_server try: pool = pycassa.connect('Keyspace1', [target_server]) cf = pycassa.ColumnFamily(pool, 'Standard1') row = cf.get('foo', read_consistency_level=consistency_level) pool.dispose() except: time.sleep(sleeptime) continue sq = int(row['seqnum']) ts = float(row['timestamp']) if sq seqnum: print 'Row changed: %i %f - %i %f'%(seqnum, timestamp, sq, ts) seqnum = sq timestamp = ts if sleeptime 0.0: time.sleep(sleeptime) On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote: James, Would you mind sharing your reader process code as well? On Fri, Apr 15, 2011 at 1:14 PM, James Cipar jci...@cmu.edu wrote: I've been experimenting with the consistency model of Cassandra, and I found something that seems a bit unexpected. In my experiment, I have 2 processes, a reader and a writer, each accessing a Cassandra cluster with a replication factor greater than 1. In addition, sometimes I generate background traffic to simulate a busy cluster by uploading a large data file to another table. The writer executes a loop where it writes a single row that contains just an sequentially increasing sequence number and a timestamp. In python this looks something like: while time.time() start_time + duration: target_server = random.sample(servers, 1)[0] target_server = '%s:9160'%target_server row = {'seqnum':str(seqnum), 'timestamp':str(time.time())} seqnum += 1 # print 'uploading to server %s, %s'%(target_server, row) pool = pycassa.connect('Keyspace1', [target_server]) cf = pycassa.ColumnFamily(pool, 'Standard1') cf.insert('foo', row,
RE: Consistency model
So Cassandra does not use an atomic commit protocol at the cluster level. Strong consistency on a quorum read is only guaranteed *after* a successful quorum write. The behaviour you are seeing is possible if you are reading in the middle of a write or the write failed (which should be reported to your code via an exception). Dan -Original Message- From: James Cipar [mailto:jci...@cmu.edu] Sent: April-15-11 14:15 To: user@cassandra.apache.org Subject: Consistency model I've been experimenting with the consistency model of Cassandra, and I found something that seems a bit unexpected. In my experiment, I have 2 processes, a reader and a writer, each accessing a Cassandra cluster with a replication factor greater than 1. In addition, sometimes I generate background traffic to simulate a busy cluster by uploading a large data file to another table. The writer executes a loop where it writes a single row that contains just an sequentially increasing sequence number and a timestamp. In python this looks something like: while time.time() start_time + duration: target_server = random.sample(servers, 1)[0] target_server = '%s:9160'%target_server row = {'seqnum':str(seqnum), 'timestamp':str(time.time())} seqnum += 1 # print 'uploading to server %s, %s'%(target_server, row) pool = pycassa.connect('Keyspace1', [target_server]) cf = pycassa.ColumnFamily(pool, 'Standard1') cf.insert('foo', row, write_consistency_level=consistency_level) pool.dispose() if sleeptime 0.0: time.sleep(sleeptime) The reader simply executes a loop reading this row and reporting whenever a sequence number is *less* than the previous sequence number. As expected, with consistency_level=ConsistencyLevel.ONE there are many inconsistencies, especially with a high replication factor. What is unexpected is that I still detect inconsistencies when it is set at ConsistencyLevel.QUORUM. This is unexpected because the documentation seems to imply that QUORUM will give consistent results. With background traffic the average difference in timestamps was 0.6s, and the maximum was 3.5s. This means that a client sees a version of the row, and can subsequently see another version of the row that is 3.5s older than the previous. What I imagine is happening is this, but I'd like someone who knows that they're talking about to tell me if it's actually the case: I think Cassandra is not using an atomic commit protocol to commit to the quorum of servers chosen when the write is made. This means that at some point in the middle of the write, some subset of the quorum have seen the write, while others have not. At this time, there is a quorum of servers that have not seen the update, so depending on which quorum the client reads from, it may or may not see the update. Of course, I understand that the client is not *choosing* a bad quorum to read from, it is just the first `q` servers to respond, but in this case it is effectively random and sometimes an bad quorum is chosen. Does anyone have any other insight into what is going on here?= No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.894 / Virus Database: 271.1.1/3574 - Release Date: 04/15/11 02:34:00