Re: How to invoke getNaturalEndpoints with jconsole?
On 5/13/11 10:08 AM, Maki Watanabe wrote: I wrote a small JMX client to invoke getNaturalEndpoints. It works fine at my test environment, but throws NPE for keyspace we will use for our application (both 0.7.5). Does anyone know quick resolution of that before I setting up cassandra on eclipse to inspect what happens :) thanks Exception in thread "main" javax.management.RuntimeMBeanException: java.lang.NullPointerException at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrow(DefaultMBeanServerInterceptor.java:877) [snip] at javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.invoke(RMIConnector.java:993) at my.test.getNaturalEndpoints.main(getNaturalEndpoints.java:32) Caused by: java.lang.NullPointerException at org.apache.cassandra.db.Table.createReplicationStrategy(Table.java:266) at org.apache.cassandra.db.Table.(Table.java:212) at org.apache.cassandra.db.Table.open(Table.java:106) at org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:1497) [snip] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Did you by chance see this after dropping the keyspace? I believe I've seen this as well. If so (and if I'm interpreting the stack trace and code correctly) it might be related to queuing an op for a keyspace that's been dropped without checking if its metadata is null rather than your code.
Re: Ec2 Stress Results
Hey Adrian - Why did you choose four big instances rather than more smaller ones? Mostly to see the impact of additional CPUs on a write only load. The portion of the application we're migrating from MySQL is very write intensive. The other 8 core option was c1.xl with 7GB of RAM. I will very likely need more than that once I add reads as some things can benefit significantly from the row cache. I also thought that m2.4xls would come with 4 disks instead of two. For $8/hr you get four m2.4xl with a total of 8 disks. For $8.16/hr you could have twelve m1.xl with a total of 48 disks, 3x disk space, a bit less total RAM and much more CPU When an instance fails, you have a 25% loss of capacity with 4 or an 8% loss of capacity with 12. I don't think it makes sense (especially on EC2) to run fewer than 6 instances, we are mostly starting at 12-15. We can also spread the instances over three EC2 availability zones, with RF=3 and one copy of the data in each zone. Agree on all points. The reason I'm keeping the cluster small now is to more easily monitor what's going on/find where things break down. Eventually it will be an 8+ node cluster spread across AZs as you mentioned (and likely m2.4xls as they do seem to provide the most value/$ for this type of system). I'm interested in hearing about your experience(s) and will continue to share mine. Alex.
Re: Ec2 Stress Results
On 5/9/11 9:49 PM, Jonathan Ellis wrote: On Mon, May 9, 2011 at 5:58 PM, Alex Araujo> How many replicas are you writing? Replication factor is 3. So you're actually spot on the predicted numbers: you're pushing 20k*3=60k "raw" rows/s across your 4 machines. You might get another 10% or so from increasing memtable thresholds, but bottom line is you're right around what we'd expect to see. Furthermore, CPU is the primary bottleneck which is what you want to see on a pure write workload. That makes a lot more sense. I upgraded the cluster to 4 m2.4xlarge instances (68GB of RAM/8 CPU cores) in preparation for application stress tests and the results were impressive @ 200 threads per client: +--+--+--+--+--+--+--+--+--+ | Server Nodes | Client Nodes | --keep-going | Columns| Client|Total | Rep Factor | Test Rate | Cluster Rate | | | | | | Threads| Threads| | (writes/s) | (writes/s) | +==+==+==+==+==+==+==+==+==+ | 4 | 3 | N | 1000 | 200 | 600 | 3 |44644 |133931| +--+--+--+--+--+--+--+--+--+ The issue I'm seeing with app stress tests is that the rate will be comparable/acceptable at first (~100k w/s) and will degrade considerably (~48k w/s) until a flush and restart. CPU usage will correspondingly be high at first (500-700%) and taper down to 50-200%. My data model is pretty standard ( is pseudo-type information): Users "UserId<32CharHash>" : { "email": "a...@b.com", "first_name": "John", "last_name": "Doe" } UserGroups "GroupId": { "UserId<32CharHash>": { "date_joined": "2011-05-10 13:14.789", "date_left": "2011-05-11 13:14.789", "active": "0|1" } } UserGroupTimeline "GroupId": { "date_joined": "UserId<32CharHash>" } UserGroupStatus "CompositeId('GroupId:UserId<32CharHash>')": { "active": "0|1" } Every new User has a row in Users and a ColumnOrSuperColumn in the other 3 CFs (total of 4 operations). One notable difference is that the RAID0 on this instance type (surprisingly) only contains two ephemeral volumes and appear a bit more saturated in iostat, although not enough to clearly stand out as the bottleneck. Is the bottleneck in this scenario likely memtable flush and/or commitlog rotation settings? RF = 2; ConsistencyLevel = One; -Xmx = 6GB; concurrent_writes: 64; all other settings are the defaults. Thanks, Alex.
Re: Data types for cross language access
On 5/11/11 5:27 AM, Oliver Dungey wrote: I am currently working on a system with Cassandra that is written purely in Java. I know our end solution will require other languages to access the data in Cassandra (Python, C++ etc.). What is the best way to store data to ensure I can do this? Should I serialize everything to strings/json/xml prior to the byte conversion? We currently use the Hector serializer, I wondered if we should just switch this to something like Jackson/JAXB? Any thoughts very welcome. I believe most high level (non-Thrift) clients convert types to/from bytes consistently without additional serialization (XML, JSON, etc). There may be a few tricks to working with TimeUUIDs for slices, but at least the Java and Python versions appear to be compatible. It's probably worth writing a few tests using your target languages to make sure: http://wiki.apache.org/cassandra/ClientOptions Don't see a C++ client, but a quick Google search turned up: http://github.com/posulliv/libcassandra
Re: Ec2 Stress Results
On 5/6/11 9:47 PM, Jonathan Ellis wrote: On Fri, May 6, 2011 at 5:13 PM, Alex Araujo wrote: I raised the default MAX_HEAP setting from the AMI to 12GB (~80% of available memory). This is going to make GC pauses larger for no good reason. Good point - only doing writes at the moment. I will revert the change and raise this conservatively once I add reads to the mix. raised concurrent_writes to 300 based on a (perhaps arbitrary?) recommendation in 'Cassandra: The Definitive Guide' That's never been a good recommendation. It seemed to contradict the '8 * number of cores' rule of thumb. I set that back to the default of 32. Based on the above, would I be correct in assuming that frequent memtable flushes and/or commitlog I/O are the likely bottlenecks? Did I miss where you said what CPU usage was? I observed a consistent 200-350% initially; 300-380% once 'hot' for all runs. Here is an average case sample: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 15108 cassandr 20 0 5406m 4.5g 15m S 331 30.4 89:32.50 jsvc How many replicas are you writing? Replication factor is 3. Recent testing suggests that putting the commitlog on the raid0 volume is better than on the root volume on ec2, since the root isn't really a separate device. I migrated the commitlog to the raid0 volume and retested with the above changes. I/O appeared more consistent in iostat. Here's an average case (%util in the teens): avg-cpu: %user %nice %system %iowait %steal %idle 36.844.05 13.973.04 18.42 23.68 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvdap10.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 xvdb 0.00 0.000.00 222.00 0.00 18944.00 85.3313.80 62.16 0.59 13.00 xvdc 0.00 0.000.00 231.00 0.00 19480.00 84.33 5.80 25.11 0.78 18.00 xvdd 0.00 0.000.00 228.00 0.00 19456.00 85.3317.43 76.45 0.57 13.00 xvde 0.00 0.000.00 229.00 0.00 19464.00 85.0010.41 45.46 0.44 10.00 md0 0.00 0.000.00 910.00 0.00 77344.00 84.99 0.000.00 0.00 0.00 and worst case (%util above 60): avg-cpu: %user %nice %system %iowait %steal %idle 44.330.00 24.540.82 15.46 14.85 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvdap10.00 1.000.004.00 0.0040.00 10.00 0.15 37.50 22.50 9.00 xvdb 0.00 0.000.00 427.00 0.00 36440.00 85.3454.12 147.85 1.69 72.00 xvdc 0.00 0.001.00 295.00 8.00 25072.00 84.7334.56 84.32 2.13 63.00 xvdd 0.00 0.000.00 355.00 0.00 30296.00 85.3494.49 257.61 2.17 77.00 xvde 0.00 0.000.00 373.00 0.00 31768.00 85.1768.50 189.33 1.88 70.00 md0 0.00 0.001.00 1418.00 8.00 120824.00 85.15 0.000.00 0.00 0.00 Overall, results were roughly the same. The most noticeable difference was no timeouts until number of client threads was 350 (previously 200): +--+--+--+--+--+--+--+ | Server | Client | --keep- | Columns | Client | Total | Combined | | Nodes | Nodes | going | | Threads | Threads | Rate (wr | | | | | | | | ites/s) | +==+==+==+==+==+==+==+ |4 |3 |N | 1000 | 150| 450| 21241 | +--+--+--+--+--+--+--+ |4 |3 |N | 1000 | 200| 600| 21536 | +--+--+--+--+--+--+--+ |4 |3 |N | 1000 | 250| 750| 19451 | +--+--+--+--+--+--+--+ |4 |3 |N | 1000 | 300| 900| 19741 | +--+--+--+--+--+--+--+ Those results are after I compiled/deployed the latest cassandra-0.7 with the patch for https://issues.apache.org/jira/browse/CASSANDRA-2578. Thoughts?
Re: Ec2 Stress Results
63 Up Normal 330.19 MB 25.00% 42535295865117307932921825928971026432 10.110.63.247 Up Normal 361.38 MB 25.00% 85070591730234615865843651857942052864 10.46.143.223 Up Normal 1.6 GB 25.00% 127605887595351923798765477786913079296 and after runs: [i-94e8d2fb] alex@cassandra-qa-1:~$ nodetool -h localhost ring Address Status State LoadOwnsToken 127605887595351923798765477786913079296 10.240.114.143 Up Normal 3.9 GB 25.00% 0 10.210.154.63 Up Normal 2.05 GB 25.00% 42535295865117307932921825928971026432 10.110.63.247 Up Normal 2.07 GB 25.00% 85070591730234615865843651857942052864 10.46.143.223 Up Normal 3.33 GB 25.00% 127605887595351923798765477786913079296 Based on the above, would I be correct in assuming that frequent memtable flushes and/or commitlog I/O are the likely bottlenecks? Could %steal be partially contributing to the low throughput numbers as well? If a single XL node can do ~12k writes/s, would it be reasonable to expect ~40k writes/s with the above work load and number of nodes? Thanks for your help, Alex. On 4/25/11 11:23 AM, Joaquin Casares wrote: Did the images have EBS storage or Instance Store storage? Typically EBS volumes aren't the best to be benchmarking against: http://www.mail-archive.com/user@cassandra.apache.org/msg11022.html Joaquin Casares DataStax Software Engineer/Support On Wed, Apr 20, 2011 at 5:12 PM, Jonathan Ellis <mailto:jbel...@gmail.com>> wrote: A few months ago I was seeing 12k writes/s on a single EC2 XL. So something is wrong. My first suspicion is that your client node may be the bottleneck. On Wed, Apr 20, 2011 at 2:56 PM, Alex Araujo mailto:cassandra-us...@alex.otherinbox.com>> wrote: > Does anyone have any Ec2 benchmarks/experiences they can share? I am trying > to get a sense for what to expect from a production cluster on Ec2 so that I > can compare my application's performance against a sane baseline. What I > have done so far is: > > 1. Lunched a 4 node cluster of m1.xlarge instances in the same availability > zone using PyStratus (https://github.com/digitalreasoning/PyStratus). Each > node has the following specs (according to Amazon): > 15 GB memory > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each) > 1,690 GB instance storage > 64-bit platform > > 2. Changed the default PyStratus directories in order to have commit logs on > the root partition and data files on ephemeral storage: > commitlog_directory: /var/cassandra-logs > data_file_directories: [/mnt/cassandra-data] > > 2. Gave each node 10GB of MAX_HEAP; 1GB HEAP_NEWSIZE in > conf/cassandra-env.sh > > 3. Ran `contrib/stress/bin/stress -d node1,..,node4 -n 1000 -t 100` on a > separate m1.large instance: > total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time > ... > 9832712,7120,7120,0.004948514851485148,842 > 9907616,7490,7490,0.0043189949802413755,852 > 9978357,7074,7074,0.004560353967289125,863 > 1000,2164,2164,0.004065933558194335,867 > > 4. Truncated Keyspace1.Standard1: > # /usr/local/apache-cassandra/bin/cassandra-cli -host localhost -port 9160 > Connected to: "Test Cluster" on x.x.x.x/9160 > Welcome to cassandra CLI. > > Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit. > [default@unknown] use Keyspace1; > Authenticated to keyspace: Keyspace1 > [default@Keyspace1] truncate Standard1; > null > > 5. Expanded the cluster to 8 nodes using PyStratus and sanity checked using > nodetool: > # /usr/local/apache-cassandra/bin/nodetool -h localhost ring > Address Status State LoadOwns > Token > x.x.x.x Up Normal 1.3 GB 12.50% > 21267647932558653966460912964485513216 > x.x.x.x Up Normal 3.06 GB 12.50% > 42535295865117307932921825928971026432 > x.x.x.x Up Normal 1.16 GB 12.50% > 63802943797675961899382738893456539648 > x.x.x.x Up Normal 2.43 GB 12.50% > 85070591730234615865843651857942052864 > x.x.x.x Up Normal 1.22 GB 12.50% > 106338239662793269832304564822427566080 > x.x.x.xUp Normal 2.74 GB 12.50% > 127605887595351923798765477786913079296 > x.x.x.xUp Normal 1.22 GB 12.50% > 148873535527910577765226390751398592512 >
Ec2 Stress Results
Does anyone have any Ec2 benchmarks/experiences they can share? I am trying to get a sense for what to expect from a production cluster on Ec2 so that I can compare my application's performance against a sane baseline. What I have done so far is: 1. Lunched a 4 node cluster of m1.xlarge instances in the same availability zone using PyStratus (https://github.com/digitalreasoning/PyStratus). Each node has the following specs (according to Amazon): 15 GB memory 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each) 1,690 GB instance storage 64-bit platform 2. Changed the default PyStratus directories in order to have commit logs on the root partition and data files on ephemeral storage: commitlog_directory: /var/cassandra-logs data_file_directories: [/mnt/cassandra-data] 2. Gave each node 10GB of MAX_HEAP; 1GB HEAP_NEWSIZE in conf/cassandra-env.sh 3. Ran `contrib/stress/bin/stress -d node1,..,node4 -n 1000 -t 100` on a separate m1.large instance: total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time ... 9832712,7120,7120,0.004948514851485148,842 9907616,7490,7490,0.0043189949802413755,852 9978357,7074,7074,0.004560353967289125,863 1000,2164,2164,0.004065933558194335,867 4. Truncated Keyspace1.Standard1: # /usr/local/apache-cassandra/bin/cassandra-cli -host localhost -port 9160 Connected to: "Test Cluster" on x.x.x.x/9160 Welcome to cassandra CLI. Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit. [default@unknown] use Keyspace1; Authenticated to keyspace: Keyspace1 [default@Keyspace1] truncate Standard1; null 5. Expanded the cluster to 8 nodes using PyStratus and sanity checked using nodetool: # /usr/local/apache-cassandra/bin/nodetool -h localhost ring Address Status State LoadOwnsToken x.x.x.x Up Normal 1.3 GB 12.50% 21267647932558653966460912964485513216 x.x.x.x Up Normal 3.06 GB 12.50% 42535295865117307932921825928971026432 x.x.x.x Up Normal 1.16 GB 12.50% 63802943797675961899382738893456539648 x.x.x.x Up Normal 2.43 GB 12.50% 85070591730234615865843651857942052864 x.x.x.x Up Normal 1.22 GB 12.50% 106338239662793269832304564822427566080 x.x.x.xUp Normal 2.74 GB 12.50% 127605887595351923798765477786913079296 x.x.x.xUp Normal 1.22 GB 12.50% 148873535527910577765226390751398592512 x.x.x.x Up Normal 2.57 GB 12.50% 170141183460469231731687303715884105728 6. Ran `contrib/stress/bin/stress -d node1,..,node8 -n 1000 -t 100` on a separate m1.large instance again: total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time ... 9880360,9649,9649,0.003210443956226165,720 9942718,6235,6235,0.003206934154398794,731 9997035,5431,5431,0.0032615939761032457,741 1000,296,296,0.002660033726812816,742 In a nutshell, 4 nodes inserted at 11,534 writes/sec and 8 nodes inserted at 13,477 writes/sec. Those numbers seem a little low to me, but I don't have anything to compare to. I'd like to hear others' opinions before I spin my wheels with with number of nodes, threads, memtable, memory, and/or GC settings. Cheers, Alex.
Re: Atomicity Strategies
On 4/8/11 5:46 PM, Drew Kutcharian wrote: I'm interested in this too, but I don't think this can be done with Cassandra alone. Cassandra doesn't support transactions. I think hector can retry operations, but I'm not sure about the atomicity of the whole thing. On Apr 8, 2011, at 1:26 PM, Alex Araujo wrote: Hi, I was wondering if there are any patterns/best practices for creating atomic units of work when dealing with several column families and their inverted indices. For example, if I have Users and Groups column families and did something like: Users.insert( user_id, columns ) UserGroupTimeline.insert( group_id, { timeuuid() : user_id } ) UserGroupStatus.insert( group_id + ":" + user_id, { "Active" : "True" } ) UserEvents.insert( timeuuid(), { "user_id" : user_id, "group_id" : group_id, "event_type" : "join" } ) Would I want the client to retry all subsequent operations that failed against other nodes after n succeeded, maintain an "undo" queue of operations to run, batch the mutations and choose a strong consistency level, some combination of these/others, etc? Thanks, Alex Thanks Drew. I'm familiar with lack of transactions and have read about people using ZK (possibly Cages as well?) to accomplish this, but since it seems that inverted indices are common place I'm interested in how anyone is mitigating lack of atomicity to any extent without the use of such tools. It appears that Hector and Pelops have retrying built in to their APIs and I'm fairly confident that proper use of those capabilities may help. Just trying to cover all bases. Hopefully someone can share their approaches and/or experiences. Cheers, Alex.
Atomicity Strategies
Hi, I was wondering if there are any patterns/best practices for creating atomic units of work when dealing with several column families and their inverted indices. For example, if I have Users and Groups column families and did something like: Users.insert( user_id, columns ) UserGroupTimeline.insert( group_id, { timeuuid() : user_id } ) UserGroupStatus.insert( group_id + ":" + user_id, { "Active" : "True" } ) UserEvents.insert( timeuuid(), { "user_id" : user_id, "group_id" : group_id, "event_type" : "join" } ) Would I want the client to retry all subsequent operations that failed against other nodes after n succeeded, maintain an "undo" queue of operations to run, batch the mutations and choose a strong consistency level, some combination of these/others, etc? Thanks, Alex