Re: How to invoke getNaturalEndpoints with jconsole?

2011-05-13 Thread Alex Araujo

On 5/13/11 10:08 AM, Maki Watanabe wrote:

I wrote a small JMX client to invoke getNaturalEndpoints.
It works fine at my test environment, but throws NPE for keyspace we
will use for our application (both 0.7.5).
Does anyone know quick resolution of that before I setting up
cassandra on eclipse to inspect what happens :)

thanks

Exception in thread "main" javax.management.RuntimeMBeanException:
java.lang.NullPointerException
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrow(DefaultMBeanServerInterceptor.java:877)
[snip]
at 
javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.invoke(RMIConnector.java:993)
at my.test.getNaturalEndpoints.main(getNaturalEndpoints.java:32)
Caused by: java.lang.NullPointerException
at 
org.apache.cassandra.db.Table.createReplicationStrategy(Table.java:266)
at org.apache.cassandra.db.Table.(Table.java:212)
at org.apache.cassandra.db.Table.open(Table.java:106)
at 
org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:1497)
[snip]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

Did you by chance see this after dropping the keyspace?  I believe I've 
seen this as well.  If so (and if I'm interpreting the stack trace and 
code correctly) it might be related to queuing an op for a keyspace 
that's been dropped without checking if its metadata is null rather than 
your code.


Re: Ec2 Stress Results

2011-05-11 Thread Alex Araujo

Hey Adrian -

Why did you choose four big instances rather than more smaller ones?
Mostly to see the impact of additional CPUs on a write only load.  The 
portion of the application we're migrating from MySQL is very write 
intensive.  The other 8 core option was c1.xl with 7GB of RAM.  I will 
very likely need more than that once I add reads as some things can 
benefit significantly from the row cache.  I also thought that m2.4xls 
would come with 4 disks instead of two.

For $8/hr you get four m2.4xl with a total of 8 disks.
For $8.16/hr you could have twelve m1.xl with a total of 48 disks, 3x
disk space, a bit less total RAM and much more CPU

When an instance fails, you have a 25% loss of capacity with 4 or an
8% loss of capacity with 12.

I don't think it makes sense (especially on EC2) to run fewer than 6
instances, we are mostly starting at 12-15.
We can also spread the instances over three EC2 availability zones,
with RF=3 and one copy of the data in each zone.
Agree on all points.  The reason I'm keeping the cluster small now is to 
more easily monitor what's going on/find where things break down.  
Eventually it will be an 8+ node cluster spread across AZs as you 
mentioned (and likely m2.4xls as they do seem to provide the most 
value/$ for this type of system).


I'm interested in hearing about your experience(s) and will continue to 
share mine.  Alex.


Re: Ec2 Stress Results

2011-05-11 Thread Alex Araujo

On 5/9/11 9:49 PM, Jonathan Ellis wrote:

On Mon, May 9, 2011 at 5:58 PM, Alex Araujo>  How many
replicas are you writing?

Replication factor is 3.

So you're actually spot on the predicted numbers: you're pushing
20k*3=60k "raw" rows/s across your 4 machines.

You might get another 10% or so from increasing memtable thresholds,
but bottom line is you're right around what we'd expect to see.
Furthermore, CPU is the primary bottleneck which is what you want to
see on a pure write workload.

That makes a lot more sense.  I upgraded the cluster to 4 m2.4xlarge 
instances (68GB of RAM/8 CPU cores) in preparation for application 
stress tests and the results were impressive @ 200 threads per client:


+--+--+--+--+--+--+--+--+--+
| Server Nodes | Client Nodes | --keep-going |   Columns|
Client|Total |  Rep Factor  |  Test Rate   | Cluster Rate |
|  |  |  |  |   
Threads|   Threads|  |  (writes/s)  |  (writes/s)  |

+==+==+==+==+==+==+==+==+==+
|  4   |  3   |  N   |   1000   | 
200  | 600  |  3   |44644 |133931|

+--+--+--+--+--+--+--+--+--+

The issue I'm seeing with app stress tests is that the rate will be 
comparable/acceptable at first (~100k w/s) and will degrade considerably 
(~48k w/s) until a flush and restart.  CPU usage will correspondingly be 
high at first (500-700%) and taper down to 50-200%.  My data model is 
pretty standard ( is pseudo-type information):


Users
"UserId<32CharHash>" : {
"email": "a...@b.com",
"first_name": "John",
"last_name": "Doe"
}

UserGroups
"GroupId": {
"UserId<32CharHash>": {
"date_joined": "2011-05-10 13:14.789",
"date_left": "2011-05-11 13:14.789",
"active": "0|1"
}
}

UserGroupTimeline
"GroupId": {
"date_joined": "UserId<32CharHash>"
}

UserGroupStatus
"CompositeId('GroupId:UserId<32CharHash>')": {
"active": "0|1"
}

Every new User has a row in Users and a ColumnOrSuperColumn in the other 
3 CFs (total of 4 operations).  One notable difference is that the RAID0 
on this instance type (surprisingly) only contains two ephemeral volumes 
and appear a bit more saturated in iostat, although not enough to 
clearly stand out as the bottleneck.  Is the bottleneck in this scenario 
likely memtable flush and/or commitlog rotation settings?


RF = 2; ConsistencyLevel = One; -Xmx = 6GB; concurrent_writes: 64; all 
other settings are the defaults.  Thanks, Alex.


Re: Data types for cross language access

2011-05-11 Thread Alex Araujo

On 5/11/11 5:27 AM, Oliver Dungey wrote:
I am currently working on a system with Cassandra that is written 
purely in Java. I know our end solution will require other languages 
to access the data in Cassandra (Python, C++ etc.). What is the best 
way to store data to ensure I can do this? Should I serialize 
everything to strings/json/xml prior to the byte conversion? We 
currently use the Hector serializer, I wondered if we should just 
switch this to something like Jackson/JAXB? Any thoughts very welcome. 
I believe most high level (non-Thrift) clients convert types to/from 
bytes consistently without additional serialization (XML, JSON, etc).  
There may be a few tricks to working with TimeUUIDs for slices, but at 
least the Java and Python versions appear to be compatible. It's 
probably worth writing a few tests using your target languages to make sure:


http://wiki.apache.org/cassandra/ClientOptions

Don't see a C++ client, but a quick Google search turned up:

http://github.com/posulliv/libcassandra


Re: Ec2 Stress Results

2011-05-09 Thread Alex Araujo

On 5/6/11 9:47 PM, Jonathan Ellis wrote:

On Fri, May 6, 2011 at 5:13 PM, Alex Araujo
  wrote:

I raised the default MAX_HEAP setting from the AMI to 12GB (~80% of
available memory).

This is going to make GC pauses larger for no good reason.
Good point - only doing writes at the moment.  I will revert the change 
and raise this conservatively once I add reads to the mix.



raised
concurrent_writes to 300 based on a (perhaps arbitrary?) recommendation in
'Cassandra: The Definitive Guide'

That's never been a good recommendation.
It seemed to contradict the '8 * number of cores' rule of thumb.  I set 
that back to the default of 32.



Based on the above, would I be correct in assuming that frequent memtable
flushes and/or commitlog I/O are the likely bottlenecks?

Did I miss where you said what CPU usage was?
I observed a consistent 200-350% initially; 300-380% once 'hot' for all 
runs.  Here is an average case sample:


 PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
15108 cassandr  20   0 5406m 4.5g  15m S  331 30.4  89:32.50 jsvc


How many replicas are you writing?


Replication factor is 3.


Recent testing suggests that putting the commitlog on the raid0 volume
is better than on the root volume on ec2, since the root isn't really
a separate device.

I migrated the commitlog to the raid0 volume and retested with the above 
changes.  I/O appeared more consistent in iostat.  Here's an average 
case (%util in the teens):


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  36.844.05   13.973.04   18.42   23.68

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s 
avgrq-sz avgqu-sz   await  svctm  %util
xvdap10.00 0.000.000.00 0.00 0.00 
0.00 0.000.00   0.00   0.00
xvdb  0.00 0.000.00  222.00 0.00 18944.00
85.3313.80   62.16   0.59  13.00
xvdc  0.00 0.000.00  231.00 0.00 19480.00
84.33 5.80   25.11   0.78  18.00
xvdd  0.00 0.000.00  228.00 0.00 19456.00
85.3317.43   76.45   0.57  13.00
xvde  0.00 0.000.00  229.00 0.00 19464.00
85.0010.41   45.46   0.44  10.00
md0   0.00 0.000.00  910.00 0.00 77344.00
84.99 0.000.00   0.00   0.00


and worst case (%util above 60):

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  44.330.00   24.540.82   15.46   14.85

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s 
avgrq-sz avgqu-sz   await  svctm  %util
xvdap10.00 1.000.004.00 0.0040.00
10.00 0.15   37.50  22.50   9.00
xvdb  0.00 0.000.00  427.00 0.00 36440.00
85.3454.12  147.85   1.69  72.00
xvdc  0.00 0.001.00  295.00 8.00 25072.00
84.7334.56   84.32   2.13  63.00
xvdd  0.00 0.000.00  355.00 0.00 30296.00
85.3494.49  257.61   2.17  77.00
xvde  0.00 0.000.00  373.00 0.00 31768.00
85.1768.50  189.33   1.88  70.00
md0   0.00 0.001.00 1418.00 8.00 120824.00
85.15 0.000.00   0.00   0.00


Overall, results were roughly the same.  The most noticeable difference 
was no timeouts until number of client threads was 350 (previously 200):


+--+--+--+--+--+--+--+
|  Server  |  Client  | --keep-  | Columns  |  Client  |  Total   | 
Combined |
|  Nodes   |  Nodes   |  going   |  | Threads  | Threads  | Rate 
(wr |
|  |  |  |  |  |  | 
ites/s)  |

+==+==+==+==+==+==+==+
|4 |3 |N | 1000 |   150|   450|  
21241   |

+--+--+--+--+--+--+--+
|4 |3 |N | 1000 |   200|   600|  
21536   |

+--+--+--+--+--+--+--+
|4 |3 |N | 1000 |   250|   750|  
19451   |

+--+--+--+--+--+--+--+
|4 |3 |N | 1000 |   300|   900|  
19741   |

+--+--+--+--+--+--+--+

Those results are after I compiled/deployed the latest cassandra-0.7 
with the patch for 
https://issues.apache.org/jira/browse/CASSANDRA-2578.  Thoughts?





Re: Ec2 Stress Results

2011-05-06 Thread Alex Araujo
63   Up Normal  330.19 MB   25.00%  
42535295865117307932921825928971026432
10.110.63.247   Up Normal  361.38 MB   25.00%  
85070591730234615865843651857942052864
10.46.143.223   Up Normal  1.6 GB  25.00%  
127605887595351923798765477786913079296


and after runs:

[i-94e8d2fb] alex@cassandra-qa-1:~$ nodetool -h localhost ring
Address Status State   LoadOwnsToken
   
127605887595351923798765477786913079296

10.240.114.143  Up Normal  3.9 GB  25.00%  0
10.210.154.63   Up Normal  2.05 GB 25.00%  
42535295865117307932921825928971026432
10.110.63.247   Up Normal  2.07 GB 25.00%  
85070591730234615865843651857942052864
10.46.143.223   Up Normal  3.33 GB 25.00%  
127605887595351923798765477786913079296


Based on the above, would I be correct in assuming that frequent 
memtable flushes and/or commitlog I/O are the likely bottlenecks?  Could 
%steal be partially contributing to the low throughput numbers as well?  
If a single XL node can do ~12k writes/s, would it be reasonable to 
expect ~40k writes/s with the above work load and number of nodes?


Thanks for your help, Alex.

On 4/25/11 11:23 AM, Joaquin Casares wrote:

Did the images have EBS storage or Instance Store storage?

Typically EBS volumes aren't the best to be benchmarking against:
http://www.mail-archive.com/user@cassandra.apache.org/msg11022.html

Joaquin Casares
DataStax
Software Engineer/Support



On Wed, Apr 20, 2011 at 5:12 PM, Jonathan Ellis <mailto:jbel...@gmail.com>> wrote:


A few months ago I was seeing 12k writes/s on a single EC2 XL. So
something is wrong.

My first suspicion is that your client node may be the bottleneck.

On Wed, Apr 20, 2011 at 2:56 PM, Alex Araujo
mailto:cassandra-us...@alex.otherinbox.com>> wrote:
> Does anyone have any Ec2 benchmarks/experiences they can share? 
I am trying

> to get a sense for what to expect from a production cluster on
Ec2 so that I
> can compare my application's performance against a sane
baseline.  What I
> have done so far is:
>
> 1. Lunched a 4 node cluster of m1.xlarge instances in the same
availability
> zone using PyStratus
(https://github.com/digitalreasoning/PyStratus).  Each
> node has the following specs (according to Amazon):
> 15 GB memory
> 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
> 1,690 GB instance storage
> 64-bit platform
>
> 2. Changed the default PyStratus directories in order to have
commit logs on
> the root partition and data files on ephemeral storage:
> commitlog_directory: /var/cassandra-logs
> data_file_directories: [/mnt/cassandra-data]
>
> 2. Gave each node 10GB of MAX_HEAP; 1GB HEAP_NEWSIZE in
> conf/cassandra-env.sh
>
> 3. Ran `contrib/stress/bin/stress -d node1,..,node4 -n 1000
-t 100` on a
> separate m1.large instance:
> total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
> ...
> 9832712,7120,7120,0.004948514851485148,842
> 9907616,7490,7490,0.0043189949802413755,852
> 9978357,7074,7074,0.004560353967289125,863
> 1000,2164,2164,0.004065933558194335,867
>
> 4. Truncated Keyspace1.Standard1:
> # /usr/local/apache-cassandra/bin/cassandra-cli -host localhost
-port 9160
> Connected to: "Test Cluster" on x.x.x.x/9160
> Welcome to cassandra CLI.
>
> Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit.
> [default@unknown] use Keyspace1;
> Authenticated to keyspace: Keyspace1
> [default@Keyspace1] truncate Standard1;
> null
>
> 5. Expanded the cluster to 8 nodes using PyStratus and sanity
checked using
> nodetool:
> # /usr/local/apache-cassandra/bin/nodetool -h localhost ring
> Address Status State   LoadOwns
> Token
> x.x.x.x  Up Normal  1.3 GB  12.50%
> 21267647932558653966460912964485513216
> x.x.x.x   Up Normal  3.06 GB 12.50%
> 42535295865117307932921825928971026432
> x.x.x.x Up Normal  1.16 GB 12.50%
> 63802943797675961899382738893456539648
> x.x.x.x   Up Normal  2.43 GB 12.50%
> 85070591730234615865843651857942052864
> x.x.x.x   Up Normal  1.22 GB 12.50%
> 106338239662793269832304564822427566080
> x.x.x.xUp Normal  2.74 GB 12.50%
> 127605887595351923798765477786913079296
> x.x.x.xUp Normal  1.22 GB 12.50%
> 148873535527910577765226390751398592512
>

Ec2 Stress Results

2011-04-20 Thread Alex Araujo
Does anyone have any Ec2 benchmarks/experiences they can share?  I am 
trying to get a sense for what to expect from a production cluster on 
Ec2 so that I can compare my application's performance against a sane 
baseline.  What I have done so far is:


1. Lunched a 4 node cluster of m1.xlarge instances in the same 
availability zone using PyStratus 
(https://github.com/digitalreasoning/PyStratus).  Each node has the 
following specs (according to Amazon):

15 GB memory
8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
1,690 GB instance storage
64-bit platform

2. Changed the default PyStratus directories in order to have commit 
logs on the root partition and data files on ephemeral storage:

commitlog_directory: /var/cassandra-logs
data_file_directories: [/mnt/cassandra-data]

2. Gave each node 10GB of MAX_HEAP; 1GB HEAP_NEWSIZE in 
conf/cassandra-env.sh


3. Ran `contrib/stress/bin/stress -d node1,..,node4 -n 1000 -t 100` 
on a separate m1.large instance:

total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
...
9832712,7120,7120,0.004948514851485148,842
9907616,7490,7490,0.0043189949802413755,852
9978357,7074,7074,0.004560353967289125,863
1000,2164,2164,0.004065933558194335,867

4. Truncated Keyspace1.Standard1:
# /usr/local/apache-cassandra/bin/cassandra-cli -host localhost -port 9160
Connected to: "Test Cluster" on x.x.x.x/9160
Welcome to cassandra CLI.

Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit.
[default@unknown] use Keyspace1;
Authenticated to keyspace: Keyspace1
[default@Keyspace1] truncate Standard1;
null

5. Expanded the cluster to 8 nodes using PyStratus and sanity checked 
using nodetool:

# /usr/local/apache-cassandra/bin/nodetool -h localhost ring
Address Status State   LoadOwnsToken
x.x.x.x  Up Normal  1.3 GB  12.50%  
21267647932558653966460912964485513216
x.x.x.x   Up Normal  3.06 GB 12.50%  
42535295865117307932921825928971026432
x.x.x.x Up Normal  1.16 GB 12.50%  
63802943797675961899382738893456539648
x.x.x.x   Up Normal  2.43 GB 12.50%  
85070591730234615865843651857942052864
x.x.x.x   Up Normal  1.22 GB 12.50%  
106338239662793269832304564822427566080
x.x.x.xUp Normal  2.74 GB 12.50%  
127605887595351923798765477786913079296
x.x.x.xUp Normal  1.22 GB 12.50%  
148873535527910577765226390751398592512
x.x.x.x   Up Normal  2.57 GB 12.50%  
170141183460469231731687303715884105728


6. Ran `contrib/stress/bin/stress -d node1,..,node8 -n 1000 -t 100` 
on a separate m1.large instance again:

total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
...
9880360,9649,9649,0.003210443956226165,720
9942718,6235,6235,0.003206934154398794,731
9997035,5431,5431,0.0032615939761032457,741
1000,296,296,0.002660033726812816,742

In a nutshell, 4 nodes inserted at 11,534 writes/sec and 8 nodes 
inserted at 13,477 writes/sec.


Those numbers seem a little low to me, but I don't have anything to 
compare to.  I'd like to hear others' opinions before I spin my wheels 
with with number of nodes, threads,  memtable, memory, and/or GC 
settings.  Cheers, Alex.


Re: Atomicity Strategies

2011-04-08 Thread Alex Araujo

On 4/8/11 5:46 PM, Drew Kutcharian wrote:

I'm interested in this too, but I don't think this can be done with Cassandra 
alone. Cassandra doesn't support transactions. I think hector can retry 
operations, but I'm not sure about the atomicity of the whole thing.



On Apr 8, 2011, at 1:26 PM, Alex Araujo wrote:


Hi, I was wondering if there are any patterns/best practices for creating 
atomic units of work when dealing with several column families and their 
inverted indices.

For example, if I have Users and Groups column families and did something like:

Users.insert( user_id, columns )
UserGroupTimeline.insert( group_id, { timeuuid() : user_id } )
UserGroupStatus.insert( group_id + ":" + user_id, { "Active" : "True" } )
UserEvents.insert( timeuuid(), { "user_id" : user_id, "group_id" : group_id, "event_type" 
: "join" } )

Would I want the client to retry all subsequent operations that failed against other 
nodes after n succeeded,  maintain an "undo" queue of operations to run, batch 
the mutations and choose a strong consistency level, some combination of these/others, 
etc?

Thanks,
Alex


Thanks Drew.  I'm familiar with lack of transactions and have read about 
people using ZK (possibly Cages as well?) to accomplish this, but since 
it seems that inverted indices are  common place I'm interested in how 
anyone is mitigating lack of atomicity to any extent without the use of 
such tools.  It appears that Hector and Pelops have retrying built in to 
their APIs and I'm fairly confident that proper use of those 
capabilities may help.  Just trying to cover all bases.  Hopefully 
someone can share their approaches and/or experiences.  Cheers, Alex.


Atomicity Strategies

2011-04-08 Thread Alex Araujo
Hi, I was wondering if there are any patterns/best practices for 
creating atomic units of work when dealing with several column families 
and their inverted indices.


For example, if I have Users and Groups column families and did 
something like:


Users.insert( user_id, columns )
UserGroupTimeline.insert( group_id, { timeuuid() : user_id } )
UserGroupStatus.insert( group_id + ":" + user_id, { "Active" : "True" } )
UserEvents.insert( timeuuid(), { "user_id" : user_id, "group_id" : 
group_id, "event_type" : "join" } )


Would I want the client to retry all subsequent operations that failed 
against other nodes after n succeeded,  maintain an "undo" queue of 
operations to run, batch the mutations and choose a strong consistency 
level, some combination of these/others, etc?


Thanks,
Alex