Re: Crash when uploading large data sets
-XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=8080 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlog4j.configuration=log4j-server.properties -Dlog4j.defaultInitOverride=true java_command: org.apache.cassandra.thrift.CassandraDaemon Launcher Type: SUN_STANDARD Environment Variables: PATH=/h/jcipar/SOFTWARE/ROOTS/Linux/x86_64/bin:/h/jcipar/bin:/h/jcipar/SOFTWARE/ROOTS/All/bin:/h/jcipar/SOFTWARE/ant/apache-ant-1.8.1/bin/:~mabdelm/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games LD_LIBRARY_PATH=/usr/lib/jvm/java-6-openjdk/jre/lib/amd64/server:/usr/lib/jvm/java-6-openjdk/jre/lib/amd64:/usr/lib/jvm/java-6-openjdk/jre/../lib/amd64 SHELL=/bin/bash Signal Handlers: SIGSEGV: [libjvm.so+0x5d2630], sa_mask[0]=0x7ffbfeff, sa_flags=0x1004 SIGBUS: [libjvm.so+0x5d2630], sa_mask[0]=0x7ffbfeff, sa_flags=0x1004 SIGFPE: [libjvm.so+0x4ab9d0], sa_mask[0]=0x7ffbfeff, sa_flags=0x1004 SIGPIPE: [libjvm.so+0x4ab9d0], sa_mask[0]=0x7ffbfeff, sa_flags=0x1004 SIGXFSZ: [libjvm.so+0x4ab9d0], sa_mask[0]=0x7ffbfeff, sa_flags=0x1004 SIGILL: [libjvm.so+0x4ab9d0], sa_mask[0]=0x7ffbfeff, sa_flags=0x1004 SIGUSR1: SIG_DFL, sa_mask[0]=0x, sa_flags=0x SIGUSR2: [libjvm.so+0x4ab380], sa_mask[0]=0x, sa_flags=0x1004 SIGHUP: [libjvm.so+0x4ad520], sa_mask[0]=0x7ffbfeff, sa_flags=0x1004 SIGINT: SIG_IGN, sa_mask[0]=0x, sa_flags=0x SIGTERM: [libjvm.so+0x4ad520], sa_mask[0]=0x7ffbfeff, sa_flags=0x1004 SIGQUIT: [libjvm.so+0x4ad520], sa_mask[0]=0x7ffbfeff, sa_flags=0x1004 --- S Y S T E M --- OS:5.0.6 uname:Linux 2.6.26-2-amd64 #1 SMP Thu Sep 16 15:56:38 UTC 2010 x86_64 libc:glibc 2.7 NPTL 2.7 rlimit: STACK 8192k, CORE 0k, NPROC 124096, NOFILE 1024, AS infinity load average:3.09 3.56 3.72 CPU:total 8 (1 cores per cpu, 1 threads per core) family 6 model 2 stepping 3, cmov, cx8, fxsr, mmx, sse, sse2, sse3 Memory: 4k page, physical 15075756k(6082384k free), swap 0k(0k free) vm_info: OpenJDK 64-Bit Server VM (1.6.0_0-b11) for linux-amd64 JRE (1.6.0_0-b11), built on Apr 9 2009 19:35:18 by pbuilder with gcc 4.3.2 time: Tue May 10 13:01:39 2011 elapsed time: 2175 seconds On May 12, 2011, at 9:30 PM, Jeffrey Kesselman wrote: If this a 64bit VM? A 32bit Java VM with default c-heap settings can only actually use about 2GB of Java Heap. On Thu, May 12, 2011 at 8:08 PM, James Cipar jci...@cmu.edu wrote: Oh, forgot this detail: I have no swap configured, so swapping is not the cause of the crash. Could it be that I'm running out of memory on a 15GB machine? That seems unlikely. I grepped dmesg for oom and didn't see anything from the oom killer, and I used the instructions from the following web page and didn't see that the oom killer had killed anything. http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer jcipar@172-19-149-62:~$ sudo cat /var/log/messages | grep --ignore-case killed process jcipar@172-19-149-62:~$ Also, this is pretty subjective, so I can't say for sure until it finishes, but this seems to be running *much* slower after setting the heap size and setting up JNA. On May 12, 2011, at 7:52 PM, James Cipar wrote: It looks like MAX_HEAP_SIZE is set in cassandra-env.sh to be half of my physical memory. These are 15GB VMs, so that's 7.5GB for Cassandra. I would have expected that to work, but I will override to 13 GB just to see what happens. I've also got the JNA thing set up. Do you think this would cause the crashes, or is it just a performance improvement? On May 12, 2011, at 7:27 PM, Sameer Farooqui wrote: The key JVM options for Cassandra are in cassandra.in.sh. What is your min and max heap size? The default setting of max heap size is 1GB. How much RAM do your nodes have? You may want to increase this setting. You can also set the -Xmx and -Xms options to the same value to keep Java from having to manage heap growth. On a 32-bit machine, you can get a max of about 1.6 GB of heap; you can get a lot more on 64-bit. Try messing with some of the other settings in the cassandra.in.sh file. You may not have DEBUG mode turned on for Cassandra and therefore may not be getting the full details of what's going on when the server crashes. In the cassandra-home/conf/log4j-server.properties file, set this line from the default of INFO to DEBUG: log4j.rootLogger=INFO,stdout,R Also, you haven't configured JNA on this server. Here's some info about it and how to configure it: JNA provides Java programs easy access to native shared libraries without writing anything but Java code. Note from Cassandra developers for why JNA is needed: Linux aggressively swaps out infrequently used memory to make more room
Crash when uploading large data sets
I'm using Cassandra 0.7.5, and uploading about 200 GB of data total (20 GB unique data), to a cluster of 10 servers. I'm using batch_mutate, and breaking the data up into chunks of about 10k records. Each record is about 5KB, so a total of about 50MB per batch. When I upload a smaller 2 GB data set, everything works fine. When I upload the 20 GB data set, servers will occasionally crash. Currently I have my client code automatically detect this and restart the server, but that is less than ideal. I'm not sure what information to gather to determine what's going on here. Here is a sample of a log file from when a crash occurred. The crash was immediately after the log entry tagged 2011-05-12 19:02:19,377. Any idea what's going on here? Any other info I can gather to try to debug this? INFO [ScheduledTasks:1] 2011-05-12 19:02:07,855 GCInspector.java (line 128) GC for ParNew: 375 ms, 576641232 reclaimed leaving 5471432144 used; max is 7774142464 INFO [ScheduledTasks:1] 2011-05-12 19:02:08,857 GCInspector.java (line 128) GC for ParNew: 450 ms, -63738232 reclaimed leaving 5546942544 used; max is 7774142464 INFO [COMMIT-LOG-WRITER] 2011-05-12 19:02:10,652 CommitLogSegment.java (line 50) Creating new commitlog segment /mnt/scratch/jcipar/cassandra/commitlog/CommitLog-1305241330652.log INFO [MutationStage:24] 2011-05-12 19:02:10,680 ColumnFamilyStore.java (line 1070) Enqueuing flush of Memtable-Standard1@1256245282(51921529 bytes, 1115783 operations) INFO [FlushWriter:1] 2011-05-12 19:02:10,680 Memtable.java (line 158) Writing Memtable-Standard1@1256245282(51921529 bytes, 1115783 operations) INFO [ScheduledTasks:1] 2011-05-12 19:02:12,932 GCInspector.java (line 128) GC for ParNew: 249 ms, 571827736 reclaimed leaving 3165899760 used; max is 7774142464 INFO [ScheduledTasks:1] 2011-05-12 19:02:15,253 GCInspector.java (line 128) GC for ParNew: 341 ms, 561823592 reclaimed leaving 1764208800 used; max is 7774142464 INFO [FlushWriter:1] 2011-05-12 19:02:16,743 Memtable.java (line 165) Completed flushing /mnt/scratch/jcipar/cassandra/data/Keyspace1/Standard1-f-74-Data.db (53646223 bytes) INFO [COMMIT-LOG-WRITER] 2011-05-12 19:02:16,745 CommitLog.java (line 440) Discarding obsolete commit log:CommitLogSegment(/mnt/scratch/jcipar/cassandra/commitlog/CommitLog-1305241306438.log) INFO [ScheduledTasks:1] 2011-05-12 19:02:18,256 GCInspector.java (line 128) GC for ParNew: 305 ms, 544491840 reclaimed leaving 865198712 used; max is 7774142464 INFO [MutationStage:19] 2011-05-12 19:02:19,000 ColumnFamilyStore.java (line 1070) Enqueuing flush of Memtable-Standard1@479849353(51941121 bytes, 1115783 operations) INFO [FlushWriter:1] 2011-05-12 19:02:19,000 Memtable.java (line 158) Writing Memtable-Standard1@479849353(51941121 bytes, 1115783 operations) INFO [NonPeriodicTasks:1] 2011-05-12 19:02:19,310 SSTable.java (line 147) Deleted /mnt/scratch/jcipar/cassandra/data/Keyspace1/Standard1-f-51 INFO [NonPeriodicTasks:1] 2011-05-12 19:02:19,324 SSTable.java (line 147) Deleted /mnt/scratch/jcipar/cassandra/data/Keyspace1/Standard1-f-55 INFO [NonPeriodicTasks:1] 2011-05-12 19:02:19,339 SSTable.java (line 147) Deleted /mnt/scratch/jcipar/cassandra/data/Keyspace1/Standard1-f-58 INFO [NonPeriodicTasks:1] 2011-05-12 19:02:19,357 SSTable.java (line 147) Deleted /mnt/scratch/jcipar/cassandra/data/Keyspace1/Standard1-f-67 INFO [NonPeriodicTasks:1] 2011-05-12 19:02:19,377 SSTable.java (line 147) Deleted /mnt/scratch/jcipar/cassandra/data/Keyspace1/Standard1-f-61 INFO [main] 2011-05-12 19:02:21,026 AbstractCassandraDaemon.java (line 78) Logging initialized INFO [main] 2011-05-12 19:02:21,040 AbstractCassandraDaemon.java (line 96) Heap size: 7634681856/7635730432 INFO [main] 2011-05-12 19:02:21,042 CLibrary.java (line 61) JNA not found. Native methods will be disabled. INFO [main] 2011-05-12 19:02:21,052 DatabaseDescriptor.java (line 121) Loading settings from file:/h/jcipar/Projects/HP/OtherDBs/Cassandra/apache-cassandra-0.7.5/conf/cassandra.yaml INFO [main] 2011-05-12 19:02:21,178 DatabaseDescriptor.java (line 181) DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap INFO [main] 2011-05-12 19:02:21,310 SSTableReader.java (line 154) Opening /mnt/scratch/jcipar/cassandra/data/system/Schema-f-1 INFO [main] 2011-05-12 19:02:21,327 SSTableReader.java (line 154) Opening /mnt/scratch/jcipar/cassandra/data/system/Schema-f-2 INFO [main] 2011-05-12 19:02:21,336 SSTableReader.java (line 154) Opening /mnt/scratch/jcipar/cassandra/data/system/Migrations-f-1 INFO [main] 2011-05-12 19:02:21,337 SSTableReader.java (line 154) Opening /mnt/scratch/jcipar/cassandra/data/system/Migrations-f-2 INFO [main] 2011-05-12 19:02:21,342 SSTableReader.java (line 154) Opening /mnt/scratch/jcipar/cassandra/data/system/LocationInfo-f-2 INFO [main] 2011-05-12 19:02:21,344 SSTableReader.java (line 154) Opening /mnt/scratch/jcipar/cassandra/data/system/LocationInfo-f-1 INFO
Re: Crash when uploading large data sets
It looks like MAX_HEAP_SIZE is set in cassandra-env.sh to be half of my physical memory. These are 15GB VMs, so that's 7.5GB for Cassandra. I would have expected that to work, but I will override to 13 GB just to see what happens. I've also got the JNA thing set up. Do you think this would cause the crashes, or is it just a performance improvement? On May 12, 2011, at 7:27 PM, Sameer Farooqui wrote: The key JVM options for Cassandra are in cassandra.in.sh. What is your min and max heap size? The default setting of max heap size is 1GB. How much RAM do your nodes have? You may want to increase this setting. You can also set the -Xmx and -Xms options to the same value to keep Java from having to manage heap growth. On a 32-bit machine, you can get a max of about 1.6 GB of heap; you can get a lot more on 64-bit. Try messing with some of the other settings in the cassandra.in.sh file. You may not have DEBUG mode turned on for Cassandra and therefore may not be getting the full details of what's going on when the server crashes. In the cassandra-home/conf/log4j-server.properties file, set this line from the default of INFO to DEBUG: log4j.rootLogger=INFO,stdout,R Also, you haven't configured JNA on this server. Here's some info about it and how to configure it: JNA provides Java programs easy access to native shared libraries without writing anything but Java code. Note from Cassandra developers for why JNA is needed: Linux aggressively swaps out infrequently used memory to make more room for its file system buffer cache. Unfortunately, modern generational garbage collectors like the JVM's leave parts of its heap un-touched for relatively large amounts of time, leading Linux to swap it out. When the JVM finally goes to use or GC that memory, swap hell ensues. Setting swappiness to zero can mitigate this behavior but does not eliminate it entirely. Turning off swap entirely is effective. But to avoid surprising people who don't know about this behavior, the best solution is to tell Linux not to swap out the JVM, and that is what we do now with mlockall via JNA. Because of licensing issues, we can't distribute JNA with Cassandra, so you must manually add it to the Cassandra lib/ directory or otherwise place it on the classpath. If the JNA jar is not present, Cassandra will continue as before. Get JNA with: cd ~ wget http://debian.riptano.com/debian/pool/libjna-java_3.2.7-0~nmu.2_amd64.deb To install: techlabs@cassandraN1:~$ sudo dpkg -i libjna-java_3.2.7-0~nmu.2_amd64.deb (Reading database ... 44334 files and directories currently installed.) Preparing to replace libjna-java 3.2.4-2 (using libjna-java_3.2.7-0~nmu.2_amd64.deb) ... Unpacking replacement libjna-java ... Setting up libjna-java (3.2.7-0~nmu.2) ... The deb package will install the JNA jar file to /usr/share/java/jna.jar, but Cassandra only loads it if its in the class path. The easy way to do this is just create a symlink into your Cassandra lib directory (note: replace /home/techlabs with your home dir location): ln -s /usr/share/java/jna.jar /home/techlabs/apache-cassandra-0.7.0/lib Research: http://journal.paul.querna.org/articles/2010/11/11/enabling-jna-in-cassandra/ - Sameer On Thu, May 12, 2011 at 4:15 PM, James Cipar jci...@cmu.edu wrote: I'm using Cassandra 0.7.5, and uploading about 200 GB of data total (20 GB unique data), to a cluster of 10 servers. I'm using batch_mutate, and breaking the data up into chunks of about 10k records. Each record is about 5KB, so a total of about 50MB per batch. When I upload a smaller 2 GB data set, everything works fine. When I upload the 20 GB data set, servers will occasionally crash. Currently I have my client code automatically detect this and restart the server, but that is less than ideal. I'm not sure what information to gather to determine what's going on here. Here is a sample of a log file from when a crash occurred. The crash was immediately after the log entry tagged 2011-05-12 19:02:19,377. Any idea what's going on here? Any other info I can gather to try to debug this? INFO [ScheduledTasks:1] 2011-05-12 19:02:07,855 GCInspector.java (line 128) GC for ParNew: 375 ms, 576641232 reclaimed leaving 5471432144 used; max is 7774142464 INFO [ScheduledTasks:1] 2011-05-12 19:02:08,857 GCInspector.java (line 128) GC for ParNew: 450 ms, -63738232 reclaimed leaving 5546942544 used; max is 7774142464 INFO [COMMIT-LOG-WRITER] 2011-05-12 19:02:10,652 CommitLogSegment.java (line 50) Creating new commitlog segment /mnt/scratch/jcipar/cassandra/commitlog/CommitLog-1305241330652.log INFO [MutationStage:24] 2011-05-12 19:02:10,680 ColumnFamilyStore.java (line 1070) Enqueuing flush of Memtable-Standard1@1256245282(51921529 bytes, 1115783 operations) INFO [FlushWriter:1] 2011-05-12 19:02:10,680 Memtable.java (line 158
Re: Crash when uploading large data sets
Oh, forgot this detail: I have no swap configured, so swapping is not the cause of the crash. Could it be that I'm running out of memory on a 15GB machine? That seems unlikely. I grepped dmesg for oom and didn't see anything from the oom killer, and I used the instructions from the following web page and didn't see that the oom killer had killed anything. http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer jcipar@172-19-149-62:~$ sudo cat /var/log/messages | grep --ignore-case killed process jcipar@172-19-149-62:~$ Also, this is pretty subjective, so I can't say for sure until it finishes, but this seems to be running *much* slower after setting the heap size and setting up JNA. On May 12, 2011, at 7:52 PM, James Cipar wrote: It looks like MAX_HEAP_SIZE is set in cassandra-env.sh to be half of my physical memory. These are 15GB VMs, so that's 7.5GB for Cassandra. I would have expected that to work, but I will override to 13 GB just to see what happens. I've also got the JNA thing set up. Do you think this would cause the crashes, or is it just a performance improvement? On May 12, 2011, at 7:27 PM, Sameer Farooqui wrote: The key JVM options for Cassandra are in cassandra.in.sh. What is your min and max heap size? The default setting of max heap size is 1GB. How much RAM do your nodes have? You may want to increase this setting. You can also set the -Xmx and -Xms options to the same value to keep Java from having to manage heap growth. On a 32-bit machine, you can get a max of about 1.6 GB of heap; you can get a lot more on 64-bit. Try messing with some of the other settings in the cassandra.in.sh file. You may not have DEBUG mode turned on for Cassandra and therefore may not be getting the full details of what's going on when the server crashes. In the cassandra-home/conf/log4j-server.properties file, set this line from the default of INFO to DEBUG: log4j.rootLogger=INFO,stdout,R Also, you haven't configured JNA on this server. Here's some info about it and how to configure it: JNA provides Java programs easy access to native shared libraries without writing anything but Java code. Note from Cassandra developers for why JNA is needed: Linux aggressively swaps out infrequently used memory to make more room for its file system buffer cache. Unfortunately, modern generational garbage collectors like the JVM's leave parts of its heap un-touched for relatively large amounts of time, leading Linux to swap it out. When the JVM finally goes to use or GC that memory, swap hell ensues. Setting swappiness to zero can mitigate this behavior but does not eliminate it entirely. Turning off swap entirely is effective. But to avoid surprising people who don't know about this behavior, the best solution is to tell Linux not to swap out the JVM, and that is what we do now with mlockall via JNA. Because of licensing issues, we can't distribute JNA with Cassandra, so you must manually add it to the Cassandra lib/ directory or otherwise place it on the classpath. If the JNA jar is not present, Cassandra will continue as before. Get JNA with: cd ~ wget http://debian.riptano.com/debian/pool/libjna-java_3.2.7-0~nmu.2_amd64.deb To install: techlabs@cassandraN1:~$ sudo dpkg -i libjna-java_3.2.7-0~nmu.2_amd64.deb (Reading database ... 44334 files and directories currently installed.) Preparing to replace libjna-java 3.2.4-2 (using libjna-java_3.2.7-0~nmu.2_amd64.deb) ... Unpacking replacement libjna-java ... Setting up libjna-java (3.2.7-0~nmu.2) ... The deb package will install the JNA jar file to /usr/share/java/jna.jar, but Cassandra only loads it if its in the class path. The easy way to do this is just create a symlink into your Cassandra lib directory (note: replace /home/techlabs with your home dir location): ln -s /usr/share/java/jna.jar /home/techlabs/apache-cassandra-0.7.0/lib Research: http://journal.paul.querna.org/articles/2010/11/11/enabling-jna-in-cassandra/ - Sameer On Thu, May 12, 2011 at 4:15 PM, James Cipar jci...@cmu.edu wrote: I'm using Cassandra 0.7.5, and uploading about 200 GB of data total (20 GB unique data), to a cluster of 10 servers. I'm using batch_mutate, and breaking the data up into chunks of about 10k records. Each record is about 5KB, so a total of about 50MB per batch. When I upload a smaller 2 GB data set, everything works fine. When I upload the 20 GB data set, servers will occasionally crash. Currently I have my client code automatically detect this and restart the server, but that is less than ideal. I'm not sure what information to gather to determine what's going on here. Here is a sample of a log file from when a crash occurred. The crash was immediately after the log entry tagged 2011-05-12 19:02:19,377. Any idea what's going on here? Any other info I can
Re: Consistency model
That's what I thought was happening, yes. A careful reading of the documentation suggests that this is correct behavior. Tyler says this can also occur because of a TimedOutException on the writes. This worries me because TimedOutExceptions are so frequent (at least for my test cluster), therefore using quorum reads and writes is not sufficient for consistency. Any application that wants consistency needs to have some external way of synchronizing readers and writers so that readers don't read in the middle of a write or in the writers retry loop. Does anyone have any intuition about whether this will happen with consistency_level=ALL? I will try it today, but I'd like to know what the expected behavior is. It seems like it would not happen in this case. On Apr 17, 2011, at 3:01 PM, William Oberman wrote: James: I feel like I understand what's going on in your code now based on this discussion, and I'm ok with the fact that DURING a QW you can get transitional results from a QR in another process (or either the before or after state of the QW). But once the QW succeeds, you must get the new value. That's what we're all saying now, right? In your read, read, read case, all 3 reads are happening during a QW, and some of them see the before and some of them see the after (that's why I specifically said single threaded, not because it's a single thread per se, but because a single thread can't read during a write by definition). will On Sun, Apr 17, 2011 at 1:27 PM, Milind Parikh milindpar...@gmail.com wrote: Same process or not: only successful QR reads after successful QW will behave with this guarantee. /*** sent from my android...please pardon occasional typos as I respond @ the speed of thought / On Apr 17, 2011 10:04 AM, James Cipar jci...@cmu.edu wrote: For a second, I thought this thread was saying I could see value(s) new value(s) within the same... That's exactly what I'm saying. Within a single process I see this behavior, when reading with consistency_level=QUORUM Read value 1 Read value 2 Read value 1 # uh oh! we've gone backwards On Apr 17, 2011, at 12:15 PM, William Oberman wrote: Cool, that is exactly what I was thinkin... -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: Consistency model
Here it is. There is some setup code and global variable definitions that I left out of the previous code, but they are pretty similar to the setup code here. import pycassa import random import time consistency_level = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM duration = 600 sleeptime = 0.0 hostlist = 'worker-hostlist' def read_servers(fn): f = open(fn) servers = [] for line in f: servers.append(line.strip()) f.close() return servers servers = read_servers(hostlist) start_time = time.time() seqnum = -1 timestamp = 0 while time.time() start_time + duration: target_server = random.sample(servers, 1)[0] target_server = '%s:9160'%target_server try: pool = pycassa.connect('Keyspace1', [target_server]) cf = pycassa.ColumnFamily(pool, 'Standard1') row = cf.get('foo', read_consistency_level=consistency_level) pool.dispose() except: time.sleep(sleeptime) continue sq = int(row['seqnum']) ts = float(row['timestamp']) if sq seqnum: print 'Row changed: %i %f - %i %f'%(seqnum, timestamp, sq, ts) seqnum = sq timestamp = ts if sleeptime 0.0: time.sleep(sleeptime) On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote: James, Would you mind sharing your reader process code as well? On Fri, Apr 15, 2011 at 1:14 PM, James Cipar jci...@cmu.edu wrote: I've been experimenting with the consistency model of Cassandra, and I found something that seems a bit unexpected. In my experiment, I have 2 processes, a reader and a writer, each accessing a Cassandra cluster with a replication factor greater than 1. In addition, sometimes I generate background traffic to simulate a busy cluster by uploading a large data file to another table. The writer executes a loop where it writes a single row that contains just an sequentially increasing sequence number and a timestamp. In python this looks something like: while time.time() start_time + duration: target_server = random.sample(servers, 1)[0] target_server = '%s:9160'%target_server row = {'seqnum':str(seqnum), 'timestamp':str(time.time())} seqnum += 1 # print 'uploading to server %s, %s'%(target_server, row) pool = pycassa.connect('Keyspace1', [target_server]) cf = pycassa.ColumnFamily(pool, 'Standard1') cf.insert('foo', row, write_consistency_level=consistency_level) pool.dispose() if sleeptime 0.0: time.sleep(sleeptime) The reader simply executes a loop reading this row and reporting whenever a sequence number is *less* than the previous sequence number. As expected, with consistency_level=ConsistencyLevel.ONE there are many inconsistencies, especially with a high replication factor. What is unexpected is that I still detect inconsistencies when it is set at ConsistencyLevel.QUORUM. This is unexpected because the documentation seems to imply that QUORUM will give consistent results. With background traffic the average difference in timestamps was 0.6s, and the maximum was 3.5s. This means that a client sees a version of the row, and can subsequently see another version of the row that is 3.5s older than the previous. What I imagine is happening is this, but I'd like someone who knows that they're talking about to tell me if it's actually the case: I think Cassandra is not using an atomic commit protocol to commit to the quorum of servers chosen when the write is made. This means that at some point in the middle of the write, some subset of the quorum have seen the write, while others have not. At this time, there is a quorum of servers that have not seen the update, so depending on which quorum the client reads from, it may or may not see the update. Of course, I understand that the client is not *choosing* a bad quorum to read from, it is just the first `q` servers to respond, but in this case it is effectively random and sometimes an bad quorum is chosen. Does anyone have any other insight into what is going on here? -- Tyler Hobbs Software Engineer, DataStax Maintainer of the pycassa Cassandra Python client library
Consistency model
I've been experimenting with the consistency model of Cassandra, and I found something that seems a bit unexpected. In my experiment, I have 2 processes, a reader and a writer, each accessing a Cassandra cluster with a replication factor greater than 1. In addition, sometimes I generate background traffic to simulate a busy cluster by uploading a large data file to another table. The writer executes a loop where it writes a single row that contains just an sequentially increasing sequence number and a timestamp. In python this looks something like: while time.time() start_time + duration: target_server = random.sample(servers, 1)[0] target_server = '%s:9160'%target_server row = {'seqnum':str(seqnum), 'timestamp':str(time.time())} seqnum += 1 # print 'uploading to server %s, %s'%(target_server, row) pool = pycassa.connect('Keyspace1', [target_server]) cf = pycassa.ColumnFamily(pool, 'Standard1') cf.insert('foo', row, write_consistency_level=consistency_level) pool.dispose() if sleeptime 0.0: time.sleep(sleeptime) The reader simply executes a loop reading this row and reporting whenever a sequence number is *less* than the previous sequence number. As expected, with consistency_level=ConsistencyLevel.ONE there are many inconsistencies, especially with a high replication factor. What is unexpected is that I still detect inconsistencies when it is set at ConsistencyLevel.QUORUM. This is unexpected because the documentation seems to imply that QUORUM will give consistent results. With background traffic the average difference in timestamps was 0.6s, and the maximum was 3.5s. This means that a client sees a version of the row, and can subsequently see another version of the row that is 3.5s older than the previous. What I imagine is happening is this, but I'd like someone who knows that they're talking about to tell me if it's actually the case: I think Cassandra is not using an atomic commit protocol to commit to the quorum of servers chosen when the write is made. This means that at some point in the middle of the write, some subset of the quorum have seen the write, while others have not. At this time, there is a quorum of servers that have not seen the update, so depending on which quorum the client reads from, it may or may not see the update. Of course, I understand that the client is not *choosing* a bad quorum to read from, it is just the first `q` servers to respond, but in this case it is effectively random and sometimes an bad quorum is chosen. Does anyone have any other insight into what is going on here?