Re: Consistency model

2011-04-18 Thread James Cipar
That's what I thought was happening, yes.  A careful reading of the 
documentation suggests that this is correct behavior.

Tyler says this can also occur because of a TimedOutException on the writes.  
This worries me because TimedOutExceptions are so frequent (at least for my 
test cluster), therefore using quorum reads and writes is not sufficient for 
consistency.  Any application that wants consistency needs to have some 
external way of synchronizing readers and writers so that readers don't read in 
the middle of a write or in the writers retry loop.

Does anyone have any intuition about whether this will happen with 
consistency_level=ALL?  I will try it today, but I'd like to know what the 
expected behavior is.  It seems like it would not happen in this case.




On Apr 17, 2011, at 3:01 PM, William Oberman wrote:

 James: I feel like I understand what's going on in your code now based on 
 this discussion, and I'm ok with the fact that DURING a QW you can get 
 transitional results from a QR in another process (or either the before or 
 after state of the QW).  But once the QW succeeds, you must get the new 
 value.  That's what we're all saying now, right?  In your read, read, read 
 case, all 3 reads are happening during a QW, and some of them see the 
 before and some of them see the after (that's why I specifically said 
 single threaded, not because it's a single thread per se, but because a 
 single thread can't read during a write by definition).
 
 will
 
 On Sun, Apr 17, 2011 at 1:27 PM, Milind Parikh milindpar...@gmail.com wrote:
 Same process or not: only successful QR reads after successful QW will behave 
 with this guarantee.
 
 /***
 sent from my android...please pardon occasional typos as I respond @ the 
 speed of thought
 /
 
 
 On Apr 17, 2011 10:04 AM, James Cipar jci...@cmu.edu wrote:
 
  For a second, I thought this thread was saying I could see value(s)  new 
  value(s) within the same...
 
 That's exactly what I'm saying.  Within a single process I see this 
 behavior, when reading with consistency_level=QUORUM
 
 Read value 1
 Read value 2
 Read value 1  # uh oh!  we've gone backwards
 
 
 
 
 
 On Apr 17, 2011, at 12:15 PM, William Oberman wrote:
 
  Cool, that is exactly what I was thinkin...
 
 
 
 
 
 -- 
 Will Oberman
 Civic Science, Inc.
 3030 Penn Avenue., First Floor
 Pittsburgh, PA 15201
 (M) 412-480-7835
 (E) ober...@civicscience.com



Re: Consistency model

2011-04-18 Thread Peter Schuller
 Does anyone have any intuition about whether this will happen with
 consistency_level=ALL?  I will try it today, but I'd like to know what the
 expected behavior is.  It seems like it would not happen in this case.

Assuming my understanding is correct (see my comment in the JIRA
ticket), then I expect that you don't see the value reverting back to
an old version in your tests. However, this is not guaranteed. A read
on CL.ALL will see the most recent value as returned by *any* node.
However, suppose a failed write only is replicated to one node. That
node subsequently goes up in smoke and is replaced. Now you may revert
back to the old data unless the JIRA ticket is attended do.

But then, use of CL.ALL kind of implies that you're not willing to
accept downtime of any node. But it's important to keep in mind that
if you're looking for the kind of guarantee of not reverting to an
older value, then that not loose a single node applies not just to
being up and serving reads, but also to maintaining consistency over
time.

So while I expect that CL.ALL would not fail the test, I would not use
that information to conclude that the correct course of action is to
use CL.ALL ;)

-- 
/ Peter Schuller


Re: Consistency model

2011-04-16 Thread Sean Bridges
If you are reading and writing at quorum, then what you are seeing
shouldn't happen.  You shouldn't be able to read N+1 until N+1 has
been committed to a quorum of servers.  At this point you should not
be able to read N anymore, since there is no quorum that contains N.

Dan - I think you are right, except that quorum reads should be
consistent even during a quorum write.  You are not guaranteed to read
N+1 until *after* a successful quorum write of N+1, but once you see
N+1, you should never see N again, even if the write failed.

Sean

On Fri, Apr 15, 2011 at 1:29 PM, Dan Hendry dan.hendry.j...@gmail.com wrote:
 So Cassandra does not use an atomic commit protocol at the cluster level.
 Strong consistency on a quorum read is only guaranteed *after* a successful
 quorum write. The behaviour you are seeing is possible if you are reading in
 the middle of a write or the write failed (which should be reported to your
 code via an exception).

 Dan

 -Original Message-
 From: James Cipar [mailto:jci...@cmu.edu]
 Sent: April-15-11 14:15
 To: user@cassandra.apache.org
 Subject: Consistency model

 I've been experimenting with the consistency model of Cassandra, and I found
 something that seems a bit unexpected.  In my experiment, I have 2
 processes, a reader and a writer, each accessing a Cassandra cluster with a
 replication factor greater than 1.  In addition, sometimes I generate
 background traffic to simulate a busy cluster by uploading a large data file
 to another table.

 The writer executes a loop where it writes a single row that contains just
 an sequentially increasing sequence number and a timestamp.  In python this
 looks something like:

    while time.time()  start_time + duration:
        target_server = random.sample(servers, 1)[0]
        target_server = '%s:9160'%target_server

        row = {'seqnum':str(seqnum), 'timestamp':str(time.time())}
        seqnum += 1
        # print 'uploading to server %s, %s'%(target_server, row)


        pool = pycassa.connect('Keyspace1', [target_server])
        cf = pycassa.ColumnFamily(pool, 'Standard1')
        cf.insert('foo', row, write_consistency_level=consistency_level)
        pool.dispose()

        if sleeptime  0.0:
            time.sleep(sleeptime)


 The reader simply executes a loop reading this row and reporting whenever a
 sequence number is *less* than the previous sequence number.  As expected,
 with consistency_level=ConsistencyLevel.ONE there are many inconsistencies,
 especially with a high replication factor.

 What is unexpected is that I still detect inconsistencies when it is set at
 ConsistencyLevel.QUORUM.  This is unexpected because the documentation seems
 to imply that QUORUM will give consistent results.  With background traffic
 the average difference in timestamps was 0.6s, and the maximum was 3.5s.
 This means that a client sees a version of the row, and can subsequently see
 another version of the row that is 3.5s older than the previous.

 What I imagine is happening is this, but I'd like someone who knows that
 they're talking about to tell me if it's actually the case:

 I think Cassandra is not using an atomic commit protocol to commit to the
 quorum of servers chosen when the write is made.  This means that at some
 point in the middle of the write, some subset of the quorum have seen the
 write, while others have not.  At this time, there is a quorum of servers
 that have not seen the update, so depending on which quorum the client reads
 from, it may or may not see the update.

 Of course, I understand that the client is not *choosing* a bad quorum to
 read from, it is just the first `q` servers to respond, but in this case it
 is effectively random and sometimes an bad quorum is chosen.

 Does anyone have any other insight into what is going on here?=
 No virus found in this incoming message.
 Checked by AVG - www.avg.com
 Version: 9.0.894 / Virus Database: 271.1.1/3574 - Release Date: 04/15/11
 02:34:00




Re: Consistency model

2011-04-16 Thread James Cipar
Here it is.  There is some setup code and global variable definitions that I 
left out of the previous code, but they are pretty similar to the setup code 
here.

import pycassa
import random
import time

consistency_level = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM
duration = 600
sleeptime = 0.0
hostlist = 'worker-hostlist'

def read_servers(fn):
f = open(fn)
servers = []
for line in f:
servers.append(line.strip())
f.close()
return servers

servers = read_servers(hostlist)
start_time = time.time()
seqnum = -1
timestamp = 0

while time.time()  start_time + duration:
target_server = random.sample(servers, 1)[0]
target_server = '%s:9160'%target_server

try:
pool = pycassa.connect('Keyspace1', [target_server])
cf = pycassa.ColumnFamily(pool, 'Standard1')
row = cf.get('foo', read_consistency_level=consistency_level)
pool.dispose()
except:
time.sleep(sleeptime)
continue

sq = int(row['seqnum'])
ts = float(row['timestamp'])

if sq  seqnum:
print 'Row changed: %i %f - %i %f'%(seqnum, timestamp, sq, ts)
seqnum = sq
timestamp = ts

if sleeptime  0.0:
time.sleep(sleeptime)




On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote:

 James,
 
 Would you mind sharing your reader process code as well?
 
 On Fri, Apr 15, 2011 at 1:14 PM, James Cipar jci...@cmu.edu wrote:
 I've been experimenting with the consistency model of Cassandra, and I found 
 something that seems a bit unexpected.  In my experiment, I have 2 processes, 
 a reader and a writer, each accessing a Cassandra cluster with a replication 
 factor greater than 1.  In addition, sometimes I generate background traffic 
 to simulate a busy cluster by uploading a large data file to another table.
 
 The writer executes a loop where it writes a single row that contains just an 
 sequentially increasing sequence number and a timestamp.  In python this 
 looks something like:
 
while time.time()  start_time + duration:
target_server = random.sample(servers, 1)[0]
target_server = '%s:9160'%target_server
 
row = {'seqnum':str(seqnum), 'timestamp':str(time.time())}
seqnum += 1
# print 'uploading to server %s, %s'%(target_server, row)
 
pool = pycassa.connect('Keyspace1', [target_server])
cf = pycassa.ColumnFamily(pool, 'Standard1')
cf.insert('foo', row, write_consistency_level=consistency_level)
pool.dispose()
 
if sleeptime  0.0:
time.sleep(sleeptime)
 
 
 The reader simply executes a loop reading this row and reporting whenever a 
 sequence number is *less* than the previous sequence number.  As expected, 
 with consistency_level=ConsistencyLevel.ONE there are many inconsistencies, 
 especially with a high replication factor.
 
 What is unexpected is that I still detect inconsistencies when it is set at 
 ConsistencyLevel.QUORUM.  This is unexpected because the documentation seems 
 to imply that QUORUM will give consistent results.  With background traffic 
 the average difference in timestamps was 0.6s, and the maximum was 3.5s.  
 This means that a client sees a version of the row, and can subsequently see 
 another version of the row that is 3.5s older than the previous.
 
 What I imagine is happening is this, but I'd like someone who knows that 
 they're talking about to tell me if it's actually the case:
 
 I think Cassandra is not using an atomic commit protocol to commit to the 
 quorum of servers chosen when the write is made.  This means that at some 
 point in the middle of the write, some subset of the quorum have seen the 
 write, while others have not.  At this time, there is a quorum of servers 
 that have not seen the update, so depending on which quorum the client reads 
 from, it may or may not see the update.
 
 Of course, I understand that the client is not *choosing* a bad quorum to 
 read from, it is just the first `q` servers to respond, but in this case it 
 is effectively random and sometimes an bad quorum is chosen.
 
 Does anyone have any other insight into what is going on here?
 
 
 
 -- 
 Tyler Hobbs
 Software Engineer, DataStax
 Maintainer of the pycassa Cassandra Python client library
 



Re: Consistency model

2011-04-16 Thread Tyler Hobbs
Here's what's probably happening:

I'm assuming RF=3 and QUORUM writes/reads here.  I'll call the replicas A,
B, and C.

1.  Writer process writes sequence number 1 and everything works fine.  A,
B, and C all have sequence number 1.
2.  Writer process writes sequence number 2.  Replica A writes successfully,
B and C fail to respond in time, and a TimedOutException is returned.
pycassa waits to retry the operation.
3.  Reader process reads, gets a response from A and B.  When the row from A
and B is merged, sequence number 2 is the newest and is returned.  A read
repair is pushed to B and C, but they don't yet update their data.
4.  Reader process reads again, gets a response from B and C (before they've
repaired).  These both report sequence number 1, so that's returned to the
client.  This is were you get a decreasing sequence number.
5.  pycassa eventually retries the write; B and C eventually repair their
data.  Either way, both B and C shortly have sequence number 2.

I've left out some of the details of read repair, and this scenario could
happen in several slightly different ways, but it should give you an idea of
what's happening.

On Sat, Apr 16, 2011 at 8:35 PM, James Cipar jci...@cmu.edu wrote:

 Here it is.  There is some setup code and global variable definitions that
 I left out of the previous code, but they are pretty similar to the setup
 code here.

 import pycassa
 import random
 import time

 consistency_level = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM
 duration = 600
 sleeptime = 0.0
 hostlist = 'worker-hostlist'

 def read_servers(fn):
 f = open(fn)
 servers = []
 for line in f:
 servers.append(line.strip())
 f.close()
 return servers

 servers = read_servers(hostlist)
 start_time = time.time()
 seqnum = -1
 timestamp = 0

 while time.time()  start_time + duration:
 target_server = random.sample(servers, 1)[0]
 target_server = '%s:9160'%target_server

 try:
 pool = pycassa.connect('Keyspace1', [target_server])
 cf = pycassa.ColumnFamily(pool, 'Standard1')
 row = cf.get('foo', read_consistency_level=consistency_level)
 pool.dispose()
 except:
 time.sleep(sleeptime)
 continue

 sq = int(row['seqnum'])
 ts = float(row['timestamp'])

 if sq  seqnum:
 print 'Row changed: %i %f - %i %f'%(seqnum, timestamp, sq, ts)
 seqnum = sq
 timestamp = ts

 if sleeptime  0.0:
 time.sleep(sleeptime)




 On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote:

 James,

 Would you mind sharing your reader process code as well?

 On Fri, Apr 15, 2011 at 1:14 PM, James Cipar jci...@cmu.edu wrote:

 I've been experimenting with the consistency model of Cassandra, and I
 found something that seems a bit unexpected.  In my experiment, I have 2
 processes, a reader and a writer, each accessing a Cassandra cluster with a
 replication factor greater than 1.  In addition, sometimes I generate
 background traffic to simulate a busy cluster by uploading a large data file
 to another table.

 The writer executes a loop where it writes a single row that contains just
 an sequentially increasing sequence number and a timestamp.  In python this
 looks something like:

while time.time()  start_time + duration:
target_server = random.sample(servers, 1)[0]
target_server = '%s:9160'%target_server

row = {'seqnum':str(seqnum), 'timestamp':str(time.time())}
seqnum += 1
# print 'uploading to server %s, %s'%(target_server, row)

pool = pycassa.connect('Keyspace1', [target_server])
cf = pycassa.ColumnFamily(pool, 'Standard1')
cf.insert('foo', row, write_consistency_level=consistency_level)
pool.dispose()

if sleeptime  0.0:
time.sleep(sleeptime)


 The reader simply executes a loop reading this row and reporting whenever
 a sequence number is *less* than the previous sequence number.  As expected,
 with consistency_level=ConsistencyLevel.ONE there are many inconsistencies,
 especially with a high replication factor.

 What is unexpected is that I still detect inconsistencies when it is set
 at ConsistencyLevel.QUORUM.  This is unexpected because the documentation
 seems to imply that QUORUM will give consistent results.  With background
 traffic the average difference in timestamps was 0.6s, and the maximum was
 3.5s.  This means that a client sees a version of the row, and can
 subsequently see another version of the row that is 3.5s older than the
 previous.

 What I imagine is happening is this, but I'd like someone who knows that
 they're talking about to tell me if it's actually the case:

 I think Cassandra is not using an atomic commit protocol to commit to the
 quorum of servers chosen when the write is made.  This means that at some
 point in the 

Re: Consistency model

2011-04-16 Thread Sean Bridges
Tyler, your answer seems to contradict this email by Jonathan Ellis
[1].  In it Jonathan says,

The important guarantee this gives you is that once one quorum read
sees the new value, all others will too.   You can't see the newest
version, then see an older version on a subsequent write [sic, I
assume he meant read], which is the characteristic of non-strong
consistency

Jonathan also says,

{X, Y} and {X, Z} are equivalent: one node with the write, and one
without. The read will recognize that X's version needs to be sent to
Z, and the write will be complete.  This read and all subsequent ones
will see the write.  (Z [sic, I assume he meant Y] will be replicated
to asynchronously via read repair.)

To me, the statement this read and all subsequent ones will see the
write implies that the new value must be committed to Y or Z before
the read can return.  If not, the statement must be false.

Sean


[1] : 
http://mail-archives.apache.org/mod_mbox/cassandra-user/201102.mbox/%3caanlktimegp8h87mgs_bxzknck-a59whxf-xx58hca...@mail.gmail.com%3E

Sean

On Sat, Apr 16, 2011 at 7:44 PM, Tyler Hobbs ty...@datastax.com wrote:
 Here's what's probably happening:

 I'm assuming RF=3 and QUORUM writes/reads here.  I'll call the replicas A,
 B, and C.

 1.  Writer process writes sequence number 1 and everything works fine.  A,
 B, and C all have sequence number 1.
 2.  Writer process writes sequence number 2.  Replica A writes successfully,
 B and C fail to respond in time, and a TimedOutException is returned.
 pycassa waits to retry the operation.
 3.  Reader process reads, gets a response from A and B.  When the row from A
 and B is merged, sequence number 2 is the newest and is returned.  A read
 repair is pushed to B and C, but they don't yet update their data.
 4.  Reader process reads again, gets a response from B and C (before they've
 repaired).  These both report sequence number 1, so that's returned to the
 client.  This is were you get a decreasing sequence number.
 5.  pycassa eventually retries the write; B and C eventually repair their
 data.  Either way, both B and C shortly have sequence number 2.

 I've left out some of the details of read repair, and this scenario could
 happen in several slightly different ways, but it should give you an idea of
 what's happening.

 On Sat, Apr 16, 2011 at 8:35 PM, James Cipar jci...@cmu.edu wrote:

 Here it is.  There is some setup code and global variable definitions that
 I left out of the previous code, but they are pretty similar to the setup
 code here.
     import pycassa
     import random
     import time
     consistency_level = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM
     duration = 600
     sleeptime = 0.0
     hostlist = 'worker-hostlist'
     def read_servers(fn):
         f = open(fn)
         servers = []
         for line in f:
             servers.append(line.strip())
         f.close()
         return servers
     servers = read_servers(hostlist)
     start_time = time.time()
     seqnum = -1
     timestamp = 0
     while time.time()  start_time + duration:
         target_server = random.sample(servers, 1)[0]
         target_server = '%s:9160'%target_server
         try:
             pool = pycassa.connect('Keyspace1', [target_server])
             cf = pycassa.ColumnFamily(pool, 'Standard1')
             row = cf.get('foo', read_consistency_level=consistency_level)
             pool.dispose()
         except:
             time.sleep(sleeptime)
             continue
         sq = int(row['seqnum'])
         ts = float(row['timestamp'])
         if sq  seqnum:
             print 'Row changed: %i %f - %i %f'%(seqnum, timestamp, sq,
 ts)
         seqnum = sq
         timestamp = ts
         if sleeptime  0.0:
             time.sleep(sleeptime)



 On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote:

 James,

 Would you mind sharing your reader process code as well?

 On Fri, Apr 15, 2011 at 1:14 PM, James Cipar jci...@cmu.edu wrote:

 I've been experimenting with the consistency model of Cassandra, and I
 found something that seems a bit unexpected.  In my experiment, I have 2
 processes, a reader and a writer, each accessing a Cassandra cluster with a
 replication factor greater than 1.  In addition, sometimes I generate
 background traffic to simulate a busy cluster by uploading a large data file
 to another table.

 The writer executes a loop where it writes a single row that contains
 just an sequentially increasing sequence number and a timestamp.  In python
 this looks something like:

    while time.time()  start_time + duration:
        target_server = random.sample(servers, 1)[0]
        target_server = '%s:9160'%target_server

        row = {'seqnum':str(seqnum), 'timestamp':str(time.time())}
        seqnum += 1
        # print 'uploading to server %s, %s'%(target_server, row)

        pool = pycassa.connect('Keyspace1', [target_server])
        cf = pycassa.ColumnFamily(pool, 'Standard1')
        cf.insert('foo', row, 

RE: Consistency model

2011-04-15 Thread Dan Hendry
So Cassandra does not use an atomic commit protocol at the cluster level.
Strong consistency on a quorum read is only guaranteed *after* a successful
quorum write. The behaviour you are seeing is possible if you are reading in
the middle of a write or the write failed (which should be reported to your
code via an exception). 

Dan

-Original Message-
From: James Cipar [mailto:jci...@cmu.edu] 
Sent: April-15-11 14:15
To: user@cassandra.apache.org
Subject: Consistency model

I've been experimenting with the consistency model of Cassandra, and I found
something that seems a bit unexpected.  In my experiment, I have 2
processes, a reader and a writer, each accessing a Cassandra cluster with a
replication factor greater than 1.  In addition, sometimes I generate
background traffic to simulate a busy cluster by uploading a large data file
to another table.

The writer executes a loop where it writes a single row that contains just
an sequentially increasing sequence number and a timestamp.  In python this
looks something like:

while time.time()  start_time + duration:
target_server = random.sample(servers, 1)[0]
target_server = '%s:9160'%target_server

row = {'seqnum':str(seqnum), 'timestamp':str(time.time())}
seqnum += 1
# print 'uploading to server %s, %s'%(target_server, row)


pool = pycassa.connect('Keyspace1', [target_server])
cf = pycassa.ColumnFamily(pool, 'Standard1')
cf.insert('foo', row, write_consistency_level=consistency_level)
pool.dispose()

if sleeptime  0.0:
time.sleep(sleeptime)


The reader simply executes a loop reading this row and reporting whenever a
sequence number is *less* than the previous sequence number.  As expected,
with consistency_level=ConsistencyLevel.ONE there are many inconsistencies,
especially with a high replication factor.

What is unexpected is that I still detect inconsistencies when it is set at
ConsistencyLevel.QUORUM.  This is unexpected because the documentation seems
to imply that QUORUM will give consistent results.  With background traffic
the average difference in timestamps was 0.6s, and the maximum was 3.5s.
This means that a client sees a version of the row, and can subsequently see
another version of the row that is 3.5s older than the previous.

What I imagine is happening is this, but I'd like someone who knows that
they're talking about to tell me if it's actually the case:

I think Cassandra is not using an atomic commit protocol to commit to the
quorum of servers chosen when the write is made.  This means that at some
point in the middle of the write, some subset of the quorum have seen the
write, while others have not.  At this time, there is a quorum of servers
that have not seen the update, so depending on which quorum the client reads
from, it may or may not see the update.

Of course, I understand that the client is not *choosing* a bad quorum to
read from, it is just the first `q` servers to respond, but in this case it
is effectively random and sometimes an bad quorum is chosen.

Does anyone have any other insight into what is going on here?=
No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.894 / Virus Database: 271.1.1/3574 - Release Date: 04/15/11
02:34:00