Re: Crash when uploading large data sets

2011-05-13 Thread James Cipar
 
-XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 
-XX:+UseCMSInitiatingOccupancyOnly -Djava.net.preferIPv4Stack=true 
-Dcom.sun.management.jmxremote.port=8080 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dlog4j.configuration=log4j-server.properties -Dlog4j.defaultInitOverride=true 
java_command: org.apache.cassandra.thrift.CassandraDaemon
Launcher Type: SUN_STANDARD

Environment Variables:
PATH=/h/jcipar/SOFTWARE/ROOTS/Linux/x86_64/bin:/h/jcipar/bin:/h/jcipar/SOFTWARE/ROOTS/All/bin:/h/jcipar/SOFTWARE/ant/apache-ant-1.8.1/bin/:~mabdelm/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
LD_LIBRARY_PATH=/usr/lib/jvm/java-6-openjdk/jre/lib/amd64/server:/usr/lib/jvm/java-6-openjdk/jre/lib/amd64:/usr/lib/jvm/java-6-openjdk/jre/../lib/amd64
SHELL=/bin/bash

Signal Handlers:
SIGSEGV: [libjvm.so+0x5d2630], sa_mask[0]=0x7ffbfeff, sa_flags=0x1004
SIGBUS: [libjvm.so+0x5d2630], sa_mask[0]=0x7ffbfeff, sa_flags=0x1004
SIGFPE: [libjvm.so+0x4ab9d0], sa_mask[0]=0x7ffbfeff, sa_flags=0x1004
SIGPIPE: [libjvm.so+0x4ab9d0], sa_mask[0]=0x7ffbfeff, sa_flags=0x1004
SIGXFSZ: [libjvm.so+0x4ab9d0], sa_mask[0]=0x7ffbfeff, sa_flags=0x1004
SIGILL: [libjvm.so+0x4ab9d0], sa_mask[0]=0x7ffbfeff, sa_flags=0x1004
SIGUSR1: SIG_DFL, sa_mask[0]=0x, sa_flags=0x
SIGUSR2: [libjvm.so+0x4ab380], sa_mask[0]=0x, sa_flags=0x1004
SIGHUP: [libjvm.so+0x4ad520], sa_mask[0]=0x7ffbfeff, sa_flags=0x1004
SIGINT: SIG_IGN, sa_mask[0]=0x, sa_flags=0x
SIGTERM: [libjvm.so+0x4ad520], sa_mask[0]=0x7ffbfeff, sa_flags=0x1004
SIGQUIT: [libjvm.so+0x4ad520], sa_mask[0]=0x7ffbfeff, sa_flags=0x1004


---  S Y S T E M  ---

OS:5.0.6

uname:Linux 2.6.26-2-amd64 #1 SMP Thu Sep 16 15:56:38 UTC 2010 x86_64
libc:glibc 2.7 NPTL 2.7 
rlimit: STACK 8192k, CORE 0k, NPROC 124096, NOFILE 1024, AS infinity
load average:3.09 3.56 3.72

CPU:total 8 (1 cores per cpu, 1 threads per core) family 6 model 2 stepping 3, 
cmov, cx8, fxsr, mmx, sse, sse2, sse3

Memory: 4k page, physical 15075756k(6082384k free), swap 0k(0k free)

vm_info: OpenJDK 64-Bit Server VM (1.6.0_0-b11) for linux-amd64 JRE 
(1.6.0_0-b11), built on Apr  9 2009 19:35:18 by pbuilder with gcc 4.3.2

time: Tue May 10 13:01:39 2011
elapsed time: 2175 seconds










On May 12, 2011, at 9:30 PM, Jeffrey Kesselman wrote:

 If this a 64bit VM?
 
 A 32bit Java VM with default c-heap settings can only actually use
 about 2GB of Java Heap.
 
 On Thu, May 12, 2011 at 8:08 PM, James Cipar jci...@cmu.edu wrote:
 Oh, forgot this detail:  I have no swap configured, so swapping is not the 
 cause of the crash.  Could it be that I'm running out of memory on a 15GB 
 machine?  That seems unlikely.  I grepped dmesg for oom and didn't see 
 anything from the oom killer, and I used the instructions from the following 
 web page and didn't see that the oom killer had killed anything.
 
 http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer
 
 jcipar@172-19-149-62:~$ sudo cat /var/log/messages | grep --ignore-case 
 killed process
 jcipar@172-19-149-62:~$
 
 
 
 Also, this is pretty subjective, so I can't say for sure until it finishes, 
 but this seems to be running *much* slower after setting the heap size and 
 setting up JNA.
 
 
 
 On May 12, 2011, at 7:52 PM, James Cipar wrote:
 
 It looks like MAX_HEAP_SIZE is set in cassandra-env.sh to be half of my 
 physical memory.  These are 15GB VMs, so that's 7.5GB for Cassandra.  I 
 would have expected that to work, but I will override to 13 GB just to see 
 what happens.
 
 I've also got the JNA thing set up.  Do you think this would cause the 
 crashes, or is it just a performance improvement?
 
 
 
 On May 12, 2011, at 7:27 PM, Sameer Farooqui wrote:
 
 The key JVM options for Cassandra are in cassandra.in.sh.
 
 What is your min and max heap size?
 
 The default setting of max heap size is 1GB. How much RAM do your nodes 
 have? You may want to increase this setting. You can also set the -Xmx and 
 -Xms options to the same value to keep Java from having to manage heap 
 growth. On a 32-bit machine, you can get a max of about 1.6 GB of heap; 
 you can get a lot more on 64-bit.
 
 Try messing with some of the other settings in the cassandra.in.sh file.
 
 You may not have DEBUG mode turned on for Cassandra and therefore may not 
 be getting the full details of what's going on when the server crashes. In 
 the cassandra-home/conf/log4j-server.properties file, set this line from 
 the default of INFO to DEBUG:
 
 log4j.rootLogger=INFO,stdout,R
 
 
 Also, you haven't configured JNA on this server. Here's some info about it 
 and how to configure it:
 
 JNA provides Java programs easy access to native shared libraries without 
 writing anything but Java code.
 
 Note from Cassandra developers for why JNA is needed:
 Linux aggressively swaps out infrequently used memory to make more room

Crash when uploading large data sets

2011-05-12 Thread James Cipar
I'm using Cassandra 0.7.5, and uploading about 200 GB of data total (20 GB 
unique data), to a cluster of 10 servers.  I'm using batch_mutate, and breaking 
the data up into chunks of about 10k records.  Each record is about 5KB, so a 
total of about 50MB per batch.  When I upload a smaller 2 GB data set, 
everything works fine.  When I upload the 20 GB data set, servers will 
occasionally crash.  Currently I have my client code automatically detect this 
and restart the server, but that is less than ideal.

I'm not sure what information to gather to determine what's going on here.  
Here is a sample of a log file from when a crash occurred.  The crash was 
immediately after the log entry tagged 2011-05-12 19:02:19,377.  Any idea 
what's going on here?  Any other info I can gather to try to debug this?







 INFO [ScheduledTasks:1] 2011-05-12 19:02:07,855 GCInspector.java (line 128) GC 
for ParNew: 375 ms, 576641232 reclaimed leaving 5471432144 used; max is 
7774142464
 INFO [ScheduledTasks:1] 2011-05-12 19:02:08,857 GCInspector.java (line 128) GC 
for ParNew: 450 ms, -63738232 reclaimed leaving 5546942544 used; max is 
7774142464
 INFO [COMMIT-LOG-WRITER] 2011-05-12 19:02:10,652 CommitLogSegment.java (line 
50) Creating new commitlog segment 
/mnt/scratch/jcipar/cassandra/commitlog/CommitLog-1305241330652.log
 INFO [MutationStage:24] 2011-05-12 19:02:10,680 ColumnFamilyStore.java (line 
1070) Enqueuing flush of Memtable-Standard1@1256245282(51921529 bytes, 1115783 
operations)
 INFO [FlushWriter:1] 2011-05-12 19:02:10,680 Memtable.java (line 158) Writing 
Memtable-Standard1@1256245282(51921529 bytes, 1115783 operations)
 INFO [ScheduledTasks:1] 2011-05-12 19:02:12,932 GCInspector.java (line 128) GC 
for ParNew: 249 ms, 571827736 reclaimed leaving 3165899760 used; max is 
7774142464
 INFO [ScheduledTasks:1] 2011-05-12 19:02:15,253 GCInspector.java (line 128) GC 
for ParNew: 341 ms, 561823592 reclaimed leaving 1764208800 used; max is 
7774142464
 INFO [FlushWriter:1] 2011-05-12 19:02:16,743 Memtable.java (line 165) 
Completed flushing 
/mnt/scratch/jcipar/cassandra/data/Keyspace1/Standard1-f-74-Data.db (53646223 
bytes)
 INFO [COMMIT-LOG-WRITER] 2011-05-12 19:02:16,745 CommitLog.java (line 440) 
Discarding obsolete commit 
log:CommitLogSegment(/mnt/scratch/jcipar/cassandra/commitlog/CommitLog-1305241306438.log)
 INFO [ScheduledTasks:1] 2011-05-12 19:02:18,256 GCInspector.java (line 128) GC 
for ParNew: 305 ms, 544491840 reclaimed leaving 865198712 used; max is 
7774142464
 INFO [MutationStage:19] 2011-05-12 19:02:19,000 ColumnFamilyStore.java (line 
1070) Enqueuing flush of Memtable-Standard1@479849353(51941121 bytes, 1115783 
operations)
 INFO [FlushWriter:1] 2011-05-12 19:02:19,000 Memtable.java (line 158) Writing 
Memtable-Standard1@479849353(51941121 bytes, 1115783 operations)
 INFO [NonPeriodicTasks:1] 2011-05-12 19:02:19,310 SSTable.java (line 147) 
Deleted /mnt/scratch/jcipar/cassandra/data/Keyspace1/Standard1-f-51
 INFO [NonPeriodicTasks:1] 2011-05-12 19:02:19,324 SSTable.java (line 147) 
Deleted /mnt/scratch/jcipar/cassandra/data/Keyspace1/Standard1-f-55
 INFO [NonPeriodicTasks:1] 2011-05-12 19:02:19,339 SSTable.java (line 147) 
Deleted /mnt/scratch/jcipar/cassandra/data/Keyspace1/Standard1-f-58
 INFO [NonPeriodicTasks:1] 2011-05-12 19:02:19,357 SSTable.java (line 147) 
Deleted /mnt/scratch/jcipar/cassandra/data/Keyspace1/Standard1-f-67
 INFO [NonPeriodicTasks:1] 2011-05-12 19:02:19,377 SSTable.java (line 147) 
Deleted /mnt/scratch/jcipar/cassandra/data/Keyspace1/Standard1-f-61
 INFO [main] 2011-05-12 19:02:21,026 AbstractCassandraDaemon.java (line 78) 
Logging initialized
 INFO [main] 2011-05-12 19:02:21,040 AbstractCassandraDaemon.java (line 96) 
Heap size: 7634681856/7635730432
 INFO [main] 2011-05-12 19:02:21,042 CLibrary.java (line 61) JNA not found. 
Native methods will be disabled.
 INFO [main] 2011-05-12 19:02:21,052 DatabaseDescriptor.java (line 121) Loading 
settings from 
file:/h/jcipar/Projects/HP/OtherDBs/Cassandra/apache-cassandra-0.7.5/conf/cassandra.yaml
 INFO [main] 2011-05-12 19:02:21,178 DatabaseDescriptor.java (line 181) 
DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap
 INFO [main] 2011-05-12 19:02:21,310 SSTableReader.java (line 154) Opening 
/mnt/scratch/jcipar/cassandra/data/system/Schema-f-1
 INFO [main] 2011-05-12 19:02:21,327 SSTableReader.java (line 154) Opening 
/mnt/scratch/jcipar/cassandra/data/system/Schema-f-2
 INFO [main] 2011-05-12 19:02:21,336 SSTableReader.java (line 154) Opening 
/mnt/scratch/jcipar/cassandra/data/system/Migrations-f-1
 INFO [main] 2011-05-12 19:02:21,337 SSTableReader.java (line 154) Opening 
/mnt/scratch/jcipar/cassandra/data/system/Migrations-f-2
 INFO [main] 2011-05-12 19:02:21,342 SSTableReader.java (line 154) Opening 
/mnt/scratch/jcipar/cassandra/data/system/LocationInfo-f-2
 INFO [main] 2011-05-12 19:02:21,344 SSTableReader.java (line 154) Opening 
/mnt/scratch/jcipar/cassandra/data/system/LocationInfo-f-1
 INFO 

Re: Crash when uploading large data sets

2011-05-12 Thread James Cipar
It looks like MAX_HEAP_SIZE is set in cassandra-env.sh to be half of my 
physical memory.  These are 15GB VMs, so that's 7.5GB for Cassandra.  I would 
have expected that to work, but I will override to 13 GB just to see what 
happens.

I've also got the JNA thing set up.  Do you think this would cause the crashes, 
or is it just a performance improvement?



On May 12, 2011, at 7:27 PM, Sameer Farooqui wrote:

 The key JVM options for Cassandra are in cassandra.in.sh.
 
 What is your min and max heap size?
 
 The default setting of max heap size is 1GB. How much RAM do your nodes have? 
 You may want to increase this setting. You can also set the -Xmx and -Xms 
 options to the same value to keep Java from having to manage heap growth. On 
 a 32-bit machine, you can get a max of about 1.6 GB of heap; you can get a 
 lot more on 64-bit.
 
 Try messing with some of the other settings in the cassandra.in.sh file.
 
 You may not have DEBUG mode turned on for Cassandra and therefore may not be 
 getting the full details of what's going on when the server crashes. In the 
 cassandra-home/conf/log4j-server.properties file, set this line from the 
 default of INFO to DEBUG:
 
 log4j.rootLogger=INFO,stdout,R
 
 
 Also, you haven't configured JNA on this server. Here's some info about it 
 and how to configure it:
 
 JNA provides Java programs easy access to native shared libraries without 
 writing anything but Java code.
 
 Note from Cassandra developers for why JNA is needed:
 Linux aggressively swaps out infrequently used memory to make more room for 
 its file system buffer cache. Unfortunately, modern generational garbage 
 collectors like the JVM's leave parts of its heap un-touched for relatively 
 large amounts of time, leading Linux to swap it out. When the JVM finally 
 goes to use or GC that memory, swap hell ensues.
 
 Setting swappiness to zero can mitigate this behavior but does not eliminate 
 it entirely. Turning off swap entirely is effective. But to avoid surprising 
 people who don't know about this behavior, the best solution is to tell Linux 
 not to swap out the JVM, and that is what we do now with mlockall via JNA.
 
 Because of licensing issues, we can't distribute JNA with Cassandra, so you 
 must manually add it to the Cassandra lib/ directory or otherwise place it on 
 the classpath. If the JNA jar is not present, Cassandra will continue as 
 before.
 
 Get JNA with: 
 cd ~
 wget http://debian.riptano.com/debian/pool/libjna-java_3.2.7-0~nmu.2_amd64.deb
 
 To install: 
 techlabs@cassandraN1:~$ sudo dpkg -i libjna-java_3.2.7-0~nmu.2_amd64.deb
 (Reading database ... 44334 files and directories currently installed.)
 Preparing to replace libjna-java 3.2.4-2 (using 
 libjna-java_3.2.7-0~nmu.2_amd64.deb) ...
 Unpacking replacement libjna-java ...
 Setting up libjna-java (3.2.7-0~nmu.2) ...
 
 
 The deb package will install the JNA jar file to /usr/share/java/jna.jar, but 
 Cassandra only loads it if its in the class path. The easy way to do this is 
 just create a symlink into your Cassandra lib directory (note: replace 
 /home/techlabs with your home dir location):
 ln -s /usr/share/java/jna.jar /home/techlabs/apache-cassandra-0.7.0/lib
 
 Research:
 http://journal.paul.querna.org/articles/2010/11/11/enabling-jna-in-cassandra/
 
 
 - Sameer
 
 
 On Thu, May 12, 2011 at 4:15 PM, James Cipar jci...@cmu.edu wrote:
 I'm using Cassandra 0.7.5, and uploading about 200 GB of data total (20 GB 
 unique data), to a cluster of 10 servers.  I'm using batch_mutate, and 
 breaking the data up into chunks of about 10k records.  Each record is about 
 5KB, so a total of about 50MB per batch.  When I upload a smaller 2 GB data 
 set, everything works fine.  When I upload the 20 GB data set, servers will 
 occasionally crash.  Currently I have my client code automatically detect 
 this and restart the server, but that is less than ideal.
 
 I'm not sure what information to gather to determine what's going on here.  
 Here is a sample of a log file from when a crash occurred.  The crash was 
 immediately after the log entry tagged 2011-05-12 19:02:19,377.  Any idea 
 what's going on here?  Any other info I can gather to try to debug this?
 
 
 
 
 
 
 
  INFO [ScheduledTasks:1] 2011-05-12 19:02:07,855 GCInspector.java (line 128) 
 GC for ParNew: 375 ms, 576641232 reclaimed leaving 5471432144 used; max is 
 7774142464
  INFO [ScheduledTasks:1] 2011-05-12 19:02:08,857 GCInspector.java (line 128) 
 GC for ParNew: 450 ms, -63738232 reclaimed leaving 5546942544 used; max is 
 7774142464
  INFO [COMMIT-LOG-WRITER] 2011-05-12 19:02:10,652 CommitLogSegment.java (line 
 50) Creating new commitlog segment 
 /mnt/scratch/jcipar/cassandra/commitlog/CommitLog-1305241330652.log
  INFO [MutationStage:24] 2011-05-12 19:02:10,680 ColumnFamilyStore.java (line 
 1070) Enqueuing flush of Memtable-Standard1@1256245282(51921529 bytes, 
 1115783 operations)
  INFO [FlushWriter:1] 2011-05-12 19:02:10,680 Memtable.java (line 158

Re: Crash when uploading large data sets

2011-05-12 Thread James Cipar
Oh, forgot this detail:  I have no swap configured, so swapping is not the 
cause of the crash.  Could it be that I'm running out of memory on a 15GB 
machine?  That seems unlikely.  I grepped dmesg for oom and didn't see 
anything from the oom killer, and I used the instructions from the following 
web page and didn't see that the oom killer had killed anything.

http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer

jcipar@172-19-149-62:~$ sudo cat /var/log/messages | grep --ignore-case killed 
process
jcipar@172-19-149-62:~$ 



Also, this is pretty subjective, so I can't say for sure until it finishes, but 
this seems to be running *much* slower after setting the heap size and setting 
up JNA.



On May 12, 2011, at 7:52 PM, James Cipar wrote:

 It looks like MAX_HEAP_SIZE is set in cassandra-env.sh to be half of my 
 physical memory.  These are 15GB VMs, so that's 7.5GB for Cassandra.  I would 
 have expected that to work, but I will override to 13 GB just to see what 
 happens.
 
 I've also got the JNA thing set up.  Do you think this would cause the 
 crashes, or is it just a performance improvement?
 
 
 
 On May 12, 2011, at 7:27 PM, Sameer Farooqui wrote:
 
 The key JVM options for Cassandra are in cassandra.in.sh.
 
 What is your min and max heap size?
 
 The default setting of max heap size is 1GB. How much RAM do your nodes 
 have? You may want to increase this setting. You can also set the -Xmx and 
 -Xms options to the same value to keep Java from having to manage heap 
 growth. On a 32-bit machine, you can get a max of about 1.6 GB of heap; you 
 can get a lot more on 64-bit.
 
 Try messing with some of the other settings in the cassandra.in.sh file.
 
 You may not have DEBUG mode turned on for Cassandra and therefore may not be 
 getting the full details of what's going on when the server crashes. In the 
 cassandra-home/conf/log4j-server.properties file, set this line from the 
 default of INFO to DEBUG:
 
 log4j.rootLogger=INFO,stdout,R
 
 
 Also, you haven't configured JNA on this server. Here's some info about it 
 and how to configure it:
 
 JNA provides Java programs easy access to native shared libraries without 
 writing anything but Java code.
 
 Note from Cassandra developers for why JNA is needed:
 Linux aggressively swaps out infrequently used memory to make more room for 
 its file system buffer cache. Unfortunately, modern generational garbage 
 collectors like the JVM's leave parts of its heap un-touched for relatively 
 large amounts of time, leading Linux to swap it out. When the JVM finally 
 goes to use or GC that memory, swap hell ensues.
 
 Setting swappiness to zero can mitigate this behavior but does not eliminate 
 it entirely. Turning off swap entirely is effective. But to avoid surprising 
 people who don't know about this behavior, the best solution is to tell 
 Linux not to swap out the JVM, and that is what we do now with mlockall via 
 JNA.
 
 Because of licensing issues, we can't distribute JNA with Cassandra, so you 
 must manually add it to the Cassandra lib/ directory or otherwise place it 
 on the classpath. If the JNA jar is not present, Cassandra will continue as 
 before.
 
 Get JNA with: 
 cd ~
 wget 
 http://debian.riptano.com/debian/pool/libjna-java_3.2.7-0~nmu.2_amd64.deb
 
 To install: 
 techlabs@cassandraN1:~$ sudo dpkg -i libjna-java_3.2.7-0~nmu.2_amd64.deb
 (Reading database ... 44334 files and directories currently installed.)
 Preparing to replace libjna-java 3.2.4-2 (using 
 libjna-java_3.2.7-0~nmu.2_amd64.deb) ...
 Unpacking replacement libjna-java ...
 Setting up libjna-java (3.2.7-0~nmu.2) ...
 
 
 The deb package will install the JNA jar file to /usr/share/java/jna.jar, 
 but Cassandra only loads it if its in the class path. The easy way to do 
 this is just create a symlink into your Cassandra lib directory (note: 
 replace /home/techlabs with your home dir location):
 ln -s /usr/share/java/jna.jar /home/techlabs/apache-cassandra-0.7.0/lib
 
 Research:
 http://journal.paul.querna.org/articles/2010/11/11/enabling-jna-in-cassandra/
 
 
 - Sameer
 
 
 On Thu, May 12, 2011 at 4:15 PM, James Cipar jci...@cmu.edu wrote:
 I'm using Cassandra 0.7.5, and uploading about 200 GB of data total (20 GB 
 unique data), to a cluster of 10 servers.  I'm using batch_mutate, and 
 breaking the data up into chunks of about 10k records.  Each record is about 
 5KB, so a total of about 50MB per batch.  When I upload a smaller 2 GB data 
 set, everything works fine.  When I upload the 20 GB data set, servers will 
 occasionally crash.  Currently I have my client code automatically detect 
 this and restart the server, but that is less than ideal.
 
 I'm not sure what information to gather to determine what's going on here.  
 Here is a sample of a log file from when a crash occurred.  The crash was 
 immediately after the log entry tagged 2011-05-12 19:02:19,377.  Any idea 
 what's going on here?  Any other info I can

Re: Consistency model

2011-04-18 Thread James Cipar
That's what I thought was happening, yes.  A careful reading of the 
documentation suggests that this is correct behavior.

Tyler says this can also occur because of a TimedOutException on the writes.  
This worries me because TimedOutExceptions are so frequent (at least for my 
test cluster), therefore using quorum reads and writes is not sufficient for 
consistency.  Any application that wants consistency needs to have some 
external way of synchronizing readers and writers so that readers don't read in 
the middle of a write or in the writers retry loop.

Does anyone have any intuition about whether this will happen with 
consistency_level=ALL?  I will try it today, but I'd like to know what the 
expected behavior is.  It seems like it would not happen in this case.




On Apr 17, 2011, at 3:01 PM, William Oberman wrote:

 James: I feel like I understand what's going on in your code now based on 
 this discussion, and I'm ok with the fact that DURING a QW you can get 
 transitional results from a QR in another process (or either the before or 
 after state of the QW).  But once the QW succeeds, you must get the new 
 value.  That's what we're all saying now, right?  In your read, read, read 
 case, all 3 reads are happening during a QW, and some of them see the 
 before and some of them see the after (that's why I specifically said 
 single threaded, not because it's a single thread per se, but because a 
 single thread can't read during a write by definition).
 
 will
 
 On Sun, Apr 17, 2011 at 1:27 PM, Milind Parikh milindpar...@gmail.com wrote:
 Same process or not: only successful QR reads after successful QW will behave 
 with this guarantee.
 
 /***
 sent from my android...please pardon occasional typos as I respond @ the 
 speed of thought
 /
 
 
 On Apr 17, 2011 10:04 AM, James Cipar jci...@cmu.edu wrote:
 
  For a second, I thought this thread was saying I could see value(s)  new 
  value(s) within the same...
 
 That's exactly what I'm saying.  Within a single process I see this 
 behavior, when reading with consistency_level=QUORUM
 
 Read value 1
 Read value 2
 Read value 1  # uh oh!  we've gone backwards
 
 
 
 
 
 On Apr 17, 2011, at 12:15 PM, William Oberman wrote:
 
  Cool, that is exactly what I was thinkin...
 
 
 
 
 
 -- 
 Will Oberman
 Civic Science, Inc.
 3030 Penn Avenue., First Floor
 Pittsburgh, PA 15201
 (M) 412-480-7835
 (E) ober...@civicscience.com



Re: Consistency model

2011-04-16 Thread James Cipar
Here it is.  There is some setup code and global variable definitions that I 
left out of the previous code, but they are pretty similar to the setup code 
here.

import pycassa
import random
import time

consistency_level = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM
duration = 600
sleeptime = 0.0
hostlist = 'worker-hostlist'

def read_servers(fn):
f = open(fn)
servers = []
for line in f:
servers.append(line.strip())
f.close()
return servers

servers = read_servers(hostlist)
start_time = time.time()
seqnum = -1
timestamp = 0

while time.time()  start_time + duration:
target_server = random.sample(servers, 1)[0]
target_server = '%s:9160'%target_server

try:
pool = pycassa.connect('Keyspace1', [target_server])
cf = pycassa.ColumnFamily(pool, 'Standard1')
row = cf.get('foo', read_consistency_level=consistency_level)
pool.dispose()
except:
time.sleep(sleeptime)
continue

sq = int(row['seqnum'])
ts = float(row['timestamp'])

if sq  seqnum:
print 'Row changed: %i %f - %i %f'%(seqnum, timestamp, sq, ts)
seqnum = sq
timestamp = ts

if sleeptime  0.0:
time.sleep(sleeptime)




On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote:

 James,
 
 Would you mind sharing your reader process code as well?
 
 On Fri, Apr 15, 2011 at 1:14 PM, James Cipar jci...@cmu.edu wrote:
 I've been experimenting with the consistency model of Cassandra, and I found 
 something that seems a bit unexpected.  In my experiment, I have 2 processes, 
 a reader and a writer, each accessing a Cassandra cluster with a replication 
 factor greater than 1.  In addition, sometimes I generate background traffic 
 to simulate a busy cluster by uploading a large data file to another table.
 
 The writer executes a loop where it writes a single row that contains just an 
 sequentially increasing sequence number and a timestamp.  In python this 
 looks something like:
 
while time.time()  start_time + duration:
target_server = random.sample(servers, 1)[0]
target_server = '%s:9160'%target_server
 
row = {'seqnum':str(seqnum), 'timestamp':str(time.time())}
seqnum += 1
# print 'uploading to server %s, %s'%(target_server, row)
 
pool = pycassa.connect('Keyspace1', [target_server])
cf = pycassa.ColumnFamily(pool, 'Standard1')
cf.insert('foo', row, write_consistency_level=consistency_level)
pool.dispose()
 
if sleeptime  0.0:
time.sleep(sleeptime)
 
 
 The reader simply executes a loop reading this row and reporting whenever a 
 sequence number is *less* than the previous sequence number.  As expected, 
 with consistency_level=ConsistencyLevel.ONE there are many inconsistencies, 
 especially with a high replication factor.
 
 What is unexpected is that I still detect inconsistencies when it is set at 
 ConsistencyLevel.QUORUM.  This is unexpected because the documentation seems 
 to imply that QUORUM will give consistent results.  With background traffic 
 the average difference in timestamps was 0.6s, and the maximum was 3.5s.  
 This means that a client sees a version of the row, and can subsequently see 
 another version of the row that is 3.5s older than the previous.
 
 What I imagine is happening is this, but I'd like someone who knows that 
 they're talking about to tell me if it's actually the case:
 
 I think Cassandra is not using an atomic commit protocol to commit to the 
 quorum of servers chosen when the write is made.  This means that at some 
 point in the middle of the write, some subset of the quorum have seen the 
 write, while others have not.  At this time, there is a quorum of servers 
 that have not seen the update, so depending on which quorum the client reads 
 from, it may or may not see the update.
 
 Of course, I understand that the client is not *choosing* a bad quorum to 
 read from, it is just the first `q` servers to respond, but in this case it 
 is effectively random and sometimes an bad quorum is chosen.
 
 Does anyone have any other insight into what is going on here?
 
 
 
 -- 
 Tyler Hobbs
 Software Engineer, DataStax
 Maintainer of the pycassa Cassandra Python client library
 



Consistency model

2011-04-15 Thread James Cipar
I've been experimenting with the consistency model of Cassandra, and I found 
something that seems a bit unexpected.  In my experiment, I have 2 processes, a 
reader and a writer, each accessing a Cassandra cluster with a replication 
factor greater than 1.  In addition, sometimes I generate background traffic to 
simulate a busy cluster by uploading a large data file to another table.

The writer executes a loop where it writes a single row that contains just an 
sequentially increasing sequence number and a timestamp.  In python this looks 
something like:

while time.time()  start_time + duration:
target_server = random.sample(servers, 1)[0]
target_server = '%s:9160'%target_server

row = {'seqnum':str(seqnum), 'timestamp':str(time.time())}
seqnum += 1
# print 'uploading to server %s, %s'%(target_server, row)   



pool = pycassa.connect('Keyspace1', [target_server])
cf = pycassa.ColumnFamily(pool, 'Standard1')
cf.insert('foo', row, write_consistency_level=consistency_level)
pool.dispose()

if sleeptime  0.0:
time.sleep(sleeptime)


The reader simply executes a loop reading this row and reporting whenever a 
sequence number is *less* than the previous sequence number.  As expected, with 
consistency_level=ConsistencyLevel.ONE there are many inconsistencies, 
especially with a high replication factor.

What is unexpected is that I still detect inconsistencies when it is set at 
ConsistencyLevel.QUORUM.  This is unexpected because the documentation seems to 
imply that QUORUM will give consistent results.  With background traffic the 
average difference in timestamps was 0.6s, and the maximum was 3.5s.  This 
means that a client sees a version of the row, and can subsequently see another 
version of the row that is 3.5s older than the previous.

What I imagine is happening is this, but I'd like someone who knows that 
they're talking about to tell me if it's actually the case:

I think Cassandra is not using an atomic commit protocol to commit to the 
quorum of servers chosen when the write is made.  This means that at some point 
in the middle of the write, some subset of the quorum have seen the write, 
while others have not.  At this time, there is a quorum of servers that have 
not seen the update, so depending on which quorum the client reads from, it may 
or may not see the update.

Of course, I understand that the client is not *choosing* a bad quorum to read 
from, it is just the first `q` servers to respond, but in this case it is 
effectively random and sometimes an bad quorum is chosen.

Does anyone have any other insight into what is going on here?