high level of MemtablePostFlusher pending events

2012-03-13 Thread David Hawthorne
5 node cluster running 1.0.2, doing about 1300 reads and 1300 writes/sec into 3 
column families in the same keyspace.  2 client machines, doing about the same 
amount of reads/writes, but one has an average response time in the 4-40ms 
range and the other in the 200-800ms range.  Both running identical software, 
homebrew with hector-1.0-3 client.

Traffic was peaking out at 6k reads and 6k writes/sec, according to reporting 
from our software, and now it's topping out at 1300/sec each.  The cpus on the 
cassy boxes are bored.  None of the threads within cassandra are chewing more 
than 3% cpu.  Disk is only 10% full on the most loaded box.

MemtablePostFlusher   1   102 36

Not all servers have the same number of pending tasks.  They have 0, 1, 17, 37, 
and 105.

It looks like it's stuck and not recovering, cuz it's been like this for an 
hour.  I've attached the end of the cassandra.log from the server with the most 
pending tasks.  There are some interesting exceptions in there.

As always, all help is always appreciated!  :p




cassandra.log
Description: Binary data


Re: Replicate On Write behavior

2011-09-09 Thread David Hawthorne
They are evenly distributed.  5 nodes * 40 connections each using hector, and I 
can confirm that all 200 are active when this happened (from hector's 
perspective, from graphing the hector jmx data), and all 5 nodes saw roughly 40 
connections, and all were receiving traffic over those connections.  (netstat + 
ntop + trafshow, etc)  I can also confirm that I changed my insert strategy to 
break up the rows using composite row keys, which reduced the row lengths and 
gave me an almost perfectly even data distribution among the nodes, and that 
was when I started to really dig into why the ROWs were still backing up on one 
node specifically, and why 2 nodes weren't seeing any.  It was the 20%, 20%, 
60% ROW distribution that really got me thinking, and when I took the 60% node 
out of the cluster, that ROW load jumped back to the node with the next-lowest 
IP address, and the 2 nodes that weren't seeing any *still* wheren't seeing any 
ROWs.  At that point I tore down the cluster, recreated it as a 3 node cluster 
several times using various permutations of the 5 nodes available, and ROW load 
was *always* on the node with the lowest IP address.

the theory might not be right, but it certainly represents the behavior I saw.


On Sep 9, 2011, at 12:17 AM, Sylvain Lebresne wrote:

 We'll solve #2890 and we should have done it sooner.
 
 That being said, a quick question: how do you do your inserts from the
 clients ? Are you evenly
 distributing the inserts among the nodes ? Or are you always hitting
 the same coordinator ?
 
 Because provided the nodes are correctly distributed on the ring, if
 you distribute the inserts
 (increment) requests across the nodes (again I'm talking of client
 requests), you should not
 see the behavior you observe.
 
 --
 Sylvain
 
 On Thu, Sep 8, 2011 at 9:48 PM, David Hawthorne dha...@gmx.3crowd.com wrote:
 It was exactly due to 2890, and the fact that the first replica is always 
 the one with the lowest value IP address.  I patched cassandra to pick a 
 random node out of the replica set in StorageProxy.java findSuitableEndpoint:
 
 Random rng = new Random();
 
 return endpoints.get(rng.nextInt(endpoints.size()));  // instead of return 
 endpoints.get(0);
 
 Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the 
 inserts/sec throughput.
 
 Here's the behavior I saw, and disk work refers to the ReplicateOnWrite 
 load of a counter insert:
 
 One node will get RF/n of the disk work.  Two nodes will always get 0 disk 
 work.
 
 in a 3 node cluster, 1 node gets disk hit really hard.  You get the 
 performance of a one-node cluster.
 in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you 
 the performance of ~2 node cluster.
 in a 10 node cluster, 1 node gets 30% of the disk work, giving you the 
 performance of a ~3 node cluster.
 
 I confirmed this behavior with a 3, 4, and 5 node cluster size.
 
 
 
 On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
 normal?  I'm using RandomPartitioner...
 
 Address DC  RackStatus State   Load
 OwnsToken

 136112946768375385385349842972707284580
 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00% 
  0
 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00% 
  34028236692093846346337460743176821145
 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00% 
  68056473384187692692674921486353642290
 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00% 
  102084710076281539039012382229530463435
 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00% 
  136112946768375385385349842972707284580
 
 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
 and last node both have a count of 0.  This is a clean cluster, and I've 
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
 hours.  The last time this test ran, it went all the way down to 500 
 inserts/sec before I killed it.
 
 Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
 
 --
 Sylvain
 
 



Re: Disable hector stats

2011-09-09 Thread David Hawthorne
I do it with a log4j properties file:

log4j.appender.null=org.apache.log4j.varia.NullAppender

log4j.category.me.prettyprint.cassandra.hector.TimingLogger=INFO, null
log4j.additivity.me.prettyprint.cassandra.hector.TimingLogger=false


On Sep 9, 2011, at 2:07 PM, Daning wrote:

 Hi,
 
 How to disable hector stats? We keep getting this in log
 
 (PeriodicalLog.java:221) INFO  Thread-53040 2011-09-09 13:24:03,290 
 Statistics from Fri Sep 09 13:23:03 PDT 2011 to Fri Sep 09 13:24:03 PDT 2011
 (PeriodicalLog.java:221) INFO  Thread-53040 2011-09-09 13:24:03,291 Tag   
 Avg(ms)  Min  Max  Std Dev 
 95th   Count
 (PeriodicalLog.java:221) INFO  Thread-53040 2011-09-09 13:24:03,291
 (
 
 We have tried to set log4j like this but that does not work,
 
 log4j.logger.com.ecyrd.speed4j.log.PeriodicalLog=ERROR
 
 Thanks,
 
 Daning
 



Re: Replicate On Write behavior

2011-09-08 Thread David Hawthorne
It was exactly due to 2890, and the fact that the first replica is always the 
one with the lowest value IP address.  I patched cassandra to pick a random 
node out of the replica set in StorageProxy.java findSuitableEndpoint:

Random rng = new Random();

return endpoints.get(rng.nextInt(endpoints.size()));  // instead of return 
endpoints.get(0);

Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the 
inserts/sec throughput.

Here's the behavior I saw, and disk work refers to the ReplicateOnWrite load 
of a counter insert:

One node will get RF/n of the disk work.  Two nodes will always get 0 disk work.

in a 3 node cluster, 1 node gets disk hit really hard.  You get the performance 
of a one-node cluster.
in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you the 
performance of ~2 node cluster.
in a 10 node cluster, 1 node gets 30% of the disk work, giving you the 
performance of a ~3 node cluster.

I confirmed this behavior with a 3, 4, and 5 node cluster size.


 
 On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
 normal?  I'm using RandomPartitioner...
 
 Address DC  RackStatus State   LoadOwns  
   Token

 136112946768375385385349842972707284580
 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00%  0
 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00%  
 34028236692093846346337460743176821145
 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00%  
 68056473384187692692674921486353642290
 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00%  
 102084710076281539039012382229530463435
 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00%  
 136112946768375385385349842972707284580
 
 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
 and last node both have a count of 0.  This is a clean cluster, and I've 
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
 hours.  The last time this test ran, it went all the way down to 500 
 inserts/sec before I killed it.
 
 Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
 
 --
 Sylvain



Solved: NoSuchMethodError with google guava/collections starting embedded cassandra service

2011-09-06 Thread David Hawthorne
I ran into this problem today.  It's common enough that it shows up in google, 
but not common enough to have a documented resolution, so here's one.

[junit] 
com.google.common.collect.ImmutableSet.copyOf(Ljava/util/Collection;)Lcom/google/common/collect/ImmutableSet;
[junit] java.lang.NoSuchMethodError: 
com.google.common.collect.ImmutableSet.copyOf(Ljava/util/Collection;)Lcom/google/common/collect/ImmutableSet;
[junit] at 
org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:479)
[junit] at 
org.apache.cassandra.db.DataTracker.replace(DataTracker.java:248)
[junit] at 
org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:219)
[junit] at 
org.apache.cassandra.db.ColumnFamilyStore.init(ColumnFamilyStore.java:294)
[junit] at 
org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:466)
[junit] at 
org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:436)
[junit] at org.apache.cassandra.db.Table.initCf(Table.java:369)
[junit] at org.apache.cassandra.db.Table.init(Table.java:306)
[junit] at org.apache.cassandra.db.Table.open(Table.java:111)
[junit] at 
org.apache.cassandra.db.SystemTable.checkHealth(SystemTable.java:212)
[junit] at 
org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:149)
[junit] at 
org.apache.cassandra.service.AbstractCassandraDaemon.init(AbstractCassandraDaemon.java:237)
[junit] at 
org.apache.cassandra.service.EmbeddedCassandraService.start(EmbeddedCassandraService.java:57)

Resolution:

I was creating a monolithic jar file that takes the contents of all component 
jar files and puts them into one big jar, to simplify deployment of the 
cassandra client I'm writing.  One of the component jars was also a monolithic 
jar and had the contents of google-collections in it. guava is a superset of 
google collections, and that google-collections contents was overwriting the 
expansion of the guava jar.

Removing the component jar file (which was an option in this case) solved the 
problem.




Re: Replicate On Write behavior

2011-09-02 Thread David Hawthorne
That's interesting.  I did an experiment wherein I added some entropy to the 
row name based on the time when the increment came in, (e.g. row = row + / + 
(timestamp - (timestamp % 300))) and now not only is the load (in GB) on my 
cluster more balanced, the performance has not decayed and has stayed steady 
(inserts/sec) with a relatively low average ms/insert.  Each row is now 
significantly shorter as a result of this change.



On Sep 2, 2011, at 12:30 AM, Sylvain Lebresne wrote:

 On Thu, Sep 1, 2011 at 8:52 PM, David Hawthorne dha...@gmx.3crowd.com wrote:
 I'm curious... digging through the source, it looks like replicate on write 
 triggers a read of the entire row, and not just the columns/supercolumns 
 that are affected by the counter update.  Is this the case?  It would 
 certainly explain why my inserts/sec decay over time and why the average 
 insert latency increases over time.  The strange thing is that I'm not 
 seeing disk read IO increase over that same period, but that might be due to 
 the OS buffer cache...
 
 It does not. It only reads the columns/supercolumns affected by the
 counter update.
 In the source, this happens in CounterMutation.java. If you look at
 addReadCommandFromColumnFamily you'll see that it does a query by name
 only for the column involved in the update (the update is basically
 the content of the columnFamily parameter there).
 
 And Cassandra does *not* always reads a full row. Never had, never will.
 
 On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
 normal?  I'm using RandomPartitioner...
 
 Address DC  RackStatus State   LoadOwns  
   Token

 136112946768375385385349842972707284580
 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00%  0
 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00%  
 34028236692093846346337460743176821145
 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00%  
 68056473384187692692674921486353642290
 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00%  
 102084710076281539039012382229530463435
 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00%  
 136112946768375385385349842972707284580
 
 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
 and last node both have a count of 0.  This is a clean cluster, and I've 
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
 hours.  The last time this test ran, it went all the way down to 500 
 inserts/sec before I killed it.
 
 Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
 
 --
 Sylvain



JMX TotalReadLatencyMicros sanity check

2011-09-02 Thread David Hawthorne
I've graphed the rate of change of the TotalReadLatencyMicros counter over the 
last 12 hours, and divided by 1,000,000 to get it in seconds.  I'm grabbing it 
every 10 seconds, so I divided by another 10 to get per-second rates.

The result is that I have a CF doing 10 seconds of read *every second*.

Does that make sense?

If I divide it by the number of reads done, it matches up with the latency I'm 
seeing from cfstats:  1.5ms/read.

Re: Replicate On Write behavior

2011-09-02 Thread David Hawthorne
Does it always pick the node with the lowest IP address?  All of my hosts are 
in the same /24.  The fourth node in the 5 node cluster has the lowest value in 
the 4th octet (54).  I erased the cluster and rebuilt it from scratch as a 3 
node cluster using the first 3 nodes, and now the ReplicateOnWrites are all 
going to the third node, which is also the lowest valued IP address (55).

That would explain why only 1 node gets writes in a 3 node cluster (RF=3) and 
why 3 nodes get writes in a 5 node cluster, and why one of those 3 is taking 
66% of the writes.


 
 On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
 normal?  I'm using RandomPartitioner...
 
 Address DC  RackStatus State   LoadOwns  
   Token

 136112946768375385385349842972707284580
 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00%  0
 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00%  
 34028236692093846346337460743176821145
 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00%  
 68056473384187692692674921486353642290
 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00%  
 102084710076281539039012382229530463435
 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00%  
 136112946768375385385349842972707284580
 
 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
 and last node both have a count of 0.  This is a clean cluster, and I've 
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
 hours.  The last time this test ran, it went all the way down to 500 
 inserts/sec before I killed it.
 
 Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
 
 --
 Sylvain



Replicate On Write behavior

2011-09-01 Thread David Hawthorne
I'm curious... digging through the source, it looks like replicate on write 
triggers a read of the entire row, and not just the columns/supercolumns that 
are affected by the counter update.  Is this the case?  It would certainly 
explain why my inserts/sec decay over time and why the average insert latency 
increases over time.  The strange thing is that I'm not seeing disk read IO 
increase over that same period, but that might be due to the OS buffer cache...

On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that normal?  
I'm using RandomPartitioner...

Address DC  RackStatus State   LoadOwns
Token  

136112946768375385385349842972707284580 
10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00%  0   

10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00%  
34028236692093846346337460743176821145  
10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00%  
68056473384187692692674921486353642290  
10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00%  
102084710076281539039012382229530463435 
10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00%  
136112946768375385385349842972707284580 

The nodes with ReplicateOnWrites are the 3 in the middle.  The first node and 
last node both have a count of 0.  This is a clean cluster, and I've been doing 
3k ... 2.5k (decaying performance) inserts/sec for the last 12 hours.  The last 
time this test ran, it went all the way down to 500 inserts/sec before I killed 
it.

Re: nodetool tpstats feature request

2011-08-31 Thread David Hawthorne
watch doesn't calculate diffs

On Aug 31, 2011, at 2:29 PM, Peter Sanford wrote:

 I use `watch` to do this:
 
 watch -n 5 nodetool -h localhost tpstats
 
 -psanford
 
 On Wed, Aug 31, 2011 at 1:59 PM, David Hawthorne dha...@gmx.3crowd.com 
 wrote:
 It would be very useful to be able to get refreshing statistics from 
 tpstats, a la top.
 
 nodetool -h localhost tpstats [n]
 
 refresh every second, show me the Active and Pending and Blocked columns as 
 they currently exist, but under Completed show me a per-second rate based on 
 the delta from the previous counter and the number of seconds (n) specified 
 on the command line.
 
 nodetool -h localhost tpstats 1 to update every second
 nodetool -h localhost tpstats 5 to update every 5 seconds, etc



randompartitioner cluster unbalanced

2011-08-31 Thread David Hawthorne
$ ./nodetool -h localhost ring
Address DC  RackStatus State   LoadOwns
Token   

136112946768375385385349842972707284580 
10.0.0.57datacenter1 rack1   Up Normal  8.31 GB 20.00%  0   

10.0.0.56datacenter1 rack1   Up Normal  13.7 GB 20.00%  
34028236692093846346337460743176821145  
10.0.0.55datacenter1 rack1   Up Normal  13.87 GB20.00%  
68056473384187692692674921486353642290  
10.0.0.54datacenter1 rack1   Up Normal  8.03 GB 20.00%  
102084710076281539039012382229530463435 
10.0.0.72datacenter1 rack1   Up Normal  1.77 GB 20.00%  
136112946768375385385349842972707284580 


This is a brand new cluster we just brought up and started loading data into a 
few days ago.  It's using the RandomPartitioner, RF=3 on everything, and we're 
doing QUORUM writes.  All keyspaces and CFs are for counter super columns.  All 
keys are moderately sized ascii strings with good variation between them.  All 
supercolumn names are longs.  All column names are ascii strings.  No 
decrements are done, no rows or columns are deleted, and read load is almost 
nonexistent.  Column values may get overwritten on account of being incremented 
because they are counters.  This is expected to happen quite a bit.  Not all 
rows are the same length.

Insert latency from my hector client box to the cluster averages at 70ms - 
200ms, which is really high.  Inserts/sec from hector's perspective peaks out 
at 750/sec, and consistently drops down (and stays at) 120/sec.  This is not 
due to compactions, based on the output of nodetool compactionstats.

I wiped the cluster this afternoon, started from scratch, and I'm seeing the 
same distribution on a smaller scale, with the same latencies.  Inserts

Going by statistics from cassandra via jmx, it looks like all hosts are getting 
about the same number of MutationStage Completed Tasks/sec.  However, I see one 
host consistently has Pending MutationStage and ReplicateOnWriteStage tasks 
(50/30 respectively - 211/42 respectively, throughout the day).  Now, I know 
that ReplicateOnWrite can go really slow if you have large SuperColumns, but I 
do not.  I'm working on proving that at the moment, pending a couple of code 
pushes.  This same box typically runs CPU up around 600-700%, and it's all user 
space cpu, not IO wait.  We monitor these boxes like crazy, and we've tweaked 
it a bit to try to rule things out (enabling mmap'd io, disabled swap, mounted 
ext4 with noatime), none of which has made a single bit of difference.  If I 
kill cassandra on that one box, the load moves to the box before it in the 
ring, ruling out this one box as bad hardware, etc.  Mutations and ROWs back 
up, and cpu jumps to 600%.  Heap memory usage sits at 600MB-2GB and heap size 
is 4G on all 5 boxes.

CPU usage and Mutations/ROWs are not affected by hector client connections;  if 
I remove this single host from the hector configuration and confirm that there 
are 0 connections from my client to this one box, I still see high Mutations 
and ROWs and CPU usage.  If I increase the number of client connections in the 
hector pool, performance does not change.

concurrent_writes are set at 48, concurrent_reads at 32, num cores per box is 
8.  memtable flush size in mb is 28MB and flush based on ops is 131k.  Our 
memtables flush every 3 minutes (based on graphs, and this aligns exactly with 
the 131k / (Mutations/sec each box is doing)).  commitlog and data are on the 
same disk, but our disks seem bored.

key cache is enabled and I see an almost perfect 100% hit rate.  row cache is 
disabled.

My questions are:

is this normal to see load unevenly spread out when using RandomPartitioner?
how do I fix it?  Do I need to assign token ranges manually even with 
RandomPartitioner?
is there a way to see the total row counts assigned to each box?
why is this one host running 600% cpu while the rest are sitting at 0%?


For reference, here's cfstats taken from the host with the high cpu usage.

Keyspace: STATS_TEST
Read Count: 18744838
Read Latency: 2.568355930309987 ms.
Write Count: 18744845
Write Latency: 0.020453476835898085 ms.
Pending Tasks: 0
Column Family: rollup1h
SSTable count: 4
Space used (live): 194724367
Space used (total): 260574143
Number of Keys (estimate): 11904
Memtable Columns Count: 34708
Memtable Data Size: 27280700
Memtable Switch Count: 67
Read Count: 9255646
Read Latency: 2.498 ms.
Write Count: 9255658
Write Latency: 0.021 ms.
Pending Tasks: 0
   

ReplicateOnWrite issues

2011-07-12 Thread David Hawthorne
This is now my fourth attempt to get the message through.  Apologies if you see 
multiple copies.

I've tried to give as much relevant data as I can think of, but please let me 
know if you need any other info.  I spent the day getting jmxtrans to talk to 
statsd with the cassandra JMX data, so I can provide graphs of just about 
anything you want.  I can also re-run the tests to replicate the problem if 
need be.  As always, thanks for taking a look!

I have a 4 host test cluster that I'm writing counters into running 0.8.1, and 
I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run with each 
and both gave the same results).  Partitioner is the RandomPartitioner, and 
RF=3.  nodetool ring shows that actual data is well balanced:

Address DC  RackStatus State   LoadOwns
Token   
  
127605887595351923798765477786913079296 
10.0.0.57datacenter1 rack1   Up Normal  1.59 GB 25.00%  0   

10.0.0.56datacenter1 rack1   Up Normal  1.39 GB 25.00%  
42535295865117307932921825928971026432  
10.0.0.55datacenter1 rack1   Up Normal  1.24 GB 25.00%  
85070591730234615865843651857942052864  
10.0.0.54datacenter1 rack1   Up Normal  1.34 GB 25.00%  
127605887595351923798765477786913079296 

I also tested against a 1-box test cluster with RF=1.

Both show the same results:  high performance for a while, and then 
ReplicateOnWrite backs way up (I've seen it as high as a million), the 
cassandra process becomes unresponsive, and the hector client starts throwing 
exceptions.  High performance in this case equates to about 5-20k inserts/sec 
on both the 4-box and 1-box cluster (I'm inserting one long row at a time, so 
the performance is bounded by the one box owning the row).  The cassandra 
process never seems to recover, even if left for 24 hours.  It's still chewing 
through those ReplicateOnWrite pending tasks.

GC doesn't seem to be a factor.  Logs show GCs completing in  1s on all 4 
boxes in the test cluster, as well as on the 1 standalone server.

I'm also seeing highly erratic performance from the cluster in general:  
inserts/sec usually start at around 3k, but then maybe creep up towards 6k, 
then drop back down to below 1k for an extended period of time, then maybe 1% 
of the time they'll spike up to 8k or even 15k, but that doesn't last more than 
30 seconds.

Here's the tpstats output from all 4 boxes:

Slowest/most problematic/highest cpu usage one:

This tpstats output is also representative of the one box cluster about 20 
minutes into my insert run.

4 $ ./nodetool -h localhost tpstats
Pool NameActive   Pending  Completed
ReadStage 0 0  0
RequestResponseStage  0 0   59095840
MutationStage 0 0   31182942
ReadRepairStage   0 0  0
ReplicateOnWriteStage32928042   28246716
GossipStage   1 1 449464
AntiEntropyStage  0 0  0
MigrationStage0 0  4
MemtablePostFlusher   0 0445
StreamStage   0 0  0
FlushWriter   0 0445
FILEUTILS-DELETE-POOL 0 0 38
MiscStage 0 0  0
FlushSorter   0 0  0
InternalResponseStage 0 0  4
HintedHandoff 0 0  0

Second highest cpu usage one:

3 $ ./nodetool -h localhost tpstats
Pool NameActive   Pending  Completed
ReadStage 0 0  0
RequestResponseStage  0 0   27413910
MutationStage 0 0   27523094
ReadRepairStage   0 0  0
ReplicateOnWriteStage 0 0   1971
GossipStage   0 0 335422
AntiEntropyStage  0 0  0
MigrationStage0 0  4
MemtablePostFlusher   0 0 66
StreamStage   0 0  0
FlushWriter   0 0 66
FILEUTILS-DELETE-POOL 0 0 31
MiscStage 0 0  0
FlushSorter   0 0  0
InternalResponseStage 0 0  4
HintedHandoff 0 0  0

The other 2 that see 

Re: ReplicateOnWrite issues

2011-07-12 Thread David Hawthorne
It's definitely for counters, and some of the rows I'm inserting are long-ish, 
if 1.3MB is long.

Maybe it would help if I said I was using counter super columns.  I'm also 
writing to only a handful of rows at a time, until they are full.  It looks 
like the counter super column code in addReadCommandFromColumnFamily, in the 
ReplicateOnWrite handling, might not be optimized for this kind of load.

It looks like I may be able to turn off replicate_on_write for the CFs as long 
as I use CL.ONE to get around this in the short term, but I'd like to know what 
happens when replicate on write is true and when it is false so I can make a 
more informed choice.

On Jul 12, 2011, at 12:20 AM, Yang wrote:

 interesting,
 
 first just to make sure: since replicateOnWrite is for Counters, you
 are using counters (you use the word insert instead of
 add/increment ) right?
 
 if you are using counters, supposedly the leader runs
 replicateOnWrite, somehow all your adds find the one box as leader,
 that's probably something worth investigating.
 
 finally, u could try debugging into those replicateOnWrite tasks and
 see what they are doing exactly, maybe there is something immediately
 wrong
 
 
 
 On Mon, Jul 11, 2011 at 11:57 PM, David Hawthorne dha...@gmx.3crowd.com 
 wrote:
 This is now my fourth attempt to get the message through.  Apologies if you 
 see multiple copies.
 
 I've tried to give as much relevant data as I can think of, but please let 
 me know if you need any other info.  I spent the day getting jmxtrans to 
 talk to statsd with the cassandra JMX data, so I can provide graphs of just 
 about anything you want.  I can also re-run the tests to replicate the 
 problem if need be.  As always, thanks for taking a look!
 
 I have a 4 host test cluster that I'm writing counters into running 0.8.1, 
 and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run 
 with each and both gave the same results).  Partitioner is the 
 RandomPartitioner, and RF=3.  nodetool ring shows that actual data is well 
 balanced:
 
 Address DC  RackStatus State   LoadOwns  
   Token
  
 127605887595351923798765477786913079296
 10.0.0.57datacenter1 rack1   Up Normal  1.59 GB 25.00%  0
 10.0.0.56datacenter1 rack1   Up Normal  1.39 GB 25.00%  
 42535295865117307932921825928971026432
 10.0.0.55datacenter1 rack1   Up Normal  1.24 GB 25.00%  
 85070591730234615865843651857942052864
 10.0.0.54datacenter1 rack1   Up Normal  1.34 GB 25.00%  
 127605887595351923798765477786913079296
 
 I also tested against a 1-box test cluster with RF=1.
 
 Both show the same results:  high performance for a while, and then 
 ReplicateOnWrite backs way up (I've seen it as high as a million), the 
 cassandra process becomes unresponsive, and the hector client starts 
 throwing exceptions.  High performance in this case equates to about 5-20k 
 inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row 
 at a time, so the performance is bounded by the one box owning the row).  
 The cassandra process never seems to recover, even if left for 24 hours.  
 It's still chewing through those ReplicateOnWrite pending tasks.
 
 GC doesn't seem to be a factor.  Logs show GCs completing in  1s on all 4 
 boxes in the test cluster, as well as on the 1 standalone server.
 
 I'm also seeing highly erratic performance from the cluster in general:  
 inserts/sec usually start at around 3k, but then maybe creep up towards 6k, 
 then drop back down to below 1k for an extended period of time, then maybe 
 1% of the time they'll spike up to 8k or even 15k, but that doesn't last 
 more than 30 seconds.
 
 Here's the tpstats output from all 4 boxes:
 
 Slowest/most problematic/highest cpu usage one:
 
 This tpstats output is also representative of the one box cluster about 20 
 minutes into my insert run.
 
 4 $ ./nodetool -h localhost tpstats
 Pool NameActive   Pending  Completed
 ReadStage 0 0  0
 RequestResponseStage  0 0   59095840
 MutationStage 0 0   31182942
 ReadRepairStage   0 0  0
 ReplicateOnWriteStage32928042   28246716
 GossipStage   1 1 449464
 AntiEntropyStage  0 0  0
 MigrationStage0 0  4
 MemtablePostFlusher   0 0445
 StreamStage   0 0  0
 FlushWriter   0 0445
 FILEUTILS-DELETE-POOL 0 0 38
 MiscStage 0 0  0
 FlushSorter   0 0  0

Re: ReplicateOnWrite issues

2011-07-12 Thread David Hawthorne
Well, I was using a large number of clients:  I tried configuring a hector pool 
of 20-200 to see what affect that had on throughput.  There's definitely a 
point after which there's no gain, so I dialed it back down.  To clarify a few 
other things, when I say inserts I mean increments, as this test only involves 
counters.  When I say a row is full, I mean I've written all of the data into 
it that belongs in it from our application's perspective.  Not all rows are the 
same length, and the average number of subcolumns per supercolumn is higher for 
some rows than for others.

I'm curious, is it necessary for a read to happen for replication when using 
RF=1?

I am currently running a test against a single node cluster, RF=1, CL.ONE, 4 
connections/concurrent increments, and graphing various data points.  If 
something interesting happens, I'll send a graph.

When you say stagger writes, do you mean just lower the overall number of 
increments/sec my client is attempting, or do you mean I should stagger the 
load amongst many CFs/rows at once?  Right now I'm writing heavily into a small 
set of rows, each going into 3 CFs, before moving on to the next small set of 
rows.


On Jul 12, 2011, at 12:26 PM, Sylvain Lebresne wrote:

 When you do counter increment at CL.ONE, the write is acknowledged as
 soon as the first replica getting the the write has pushed the
 increment into his memtable. However, there is a read happening for
 the replication to the other replicas (this is necessary to the
 counter design). What is happening is that our nodes are not able to
 do those reads fast enough. Or, another way to see it is that you are
 writing faster than you cluster can handle.
 
 I've opened https://issues.apache.org/jira/browse/CASSANDRA-2889 to do
 something about that, but for now you'll have to stagger your writes.
 I'm more surprised that tasks on the replicate on write stage would
 stack up at QUORUM, unless you're using a very large number of
 clients.
 
 --
 Sylvain
 
 On Tue, Jul 12, 2011 at 8:57 AM, David Hawthorne dha...@gmx.3crowd.com 
 wrote:
 This is now my fourth attempt to get the message through.  Apologies if you 
 see multiple copies.
 
 I've tried to give as much relevant data as I can think of, but please let 
 me know if you need any other info.  I spent the day getting jmxtrans to 
 talk to statsd with the cassandra JMX data, so I can provide graphs of just 
 about anything you want.  I can also re-run the tests to replicate the 
 problem if need be.  As always, thanks for taking a look!
 
 I have a 4 host test cluster that I'm writing counters into running 0.8.1, 
 and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run 
 with each and both gave the same results).  Partitioner is the 
 RandomPartitioner, and RF=3.  nodetool ring shows that actual data is well 
 balanced:
 
 Address DC  RackStatus State   LoadOwns  
   Token
  
 127605887595351923798765477786913079296
 10.0.0.57datacenter1 rack1   Up Normal  1.59 GB 25.00%  0
 10.0.0.56datacenter1 rack1   Up Normal  1.39 GB 25.00%  
 42535295865117307932921825928971026432
 10.0.0.55datacenter1 rack1   Up Normal  1.24 GB 25.00%  
 85070591730234615865843651857942052864
 10.0.0.54datacenter1 rack1   Up Normal  1.34 GB 25.00%  
 127605887595351923798765477786913079296
 
 I also tested against a 1-box test cluster with RF=1.
 
 Both show the same results:  high performance for a while, and then 
 ReplicateOnWrite backs way up (I've seen it as high as a million), the 
 cassandra process becomes unresponsive, and the hector client starts 
 throwing exceptions.  High performance in this case equates to about 5-20k 
 inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row 
 at a time, so the performance is bounded by the one box owning the row).  
 The cassandra process never seems to recover, even if left for 24 hours.  
 It's still chewing through those ReplicateOnWrite pending tasks.
 
 GC doesn't seem to be a factor.  Logs show GCs completing in  1s on all 4 
 boxes in the test cluster, as well as on the 1 standalone server.
 
 I'm also seeing highly erratic performance from the cluster in general:  
 inserts/sec usually start at around 3k, but then maybe creep up towards 6k, 
 then drop back down to below 1k for an extended period of time, then maybe 
 1% of the time they'll spike up to 8k or even 15k, but that doesn't last 
 more than 30 seconds.
 
 Here's the tpstats output from all 4 boxes:
 
 Slowest/most problematic/highest cpu usage one:
 
 This tpstats output is also representative of the one box cluster about 20 
 minutes into my insert run.
 
 4 $ ./nodetool -h localhost tpstats
 Pool NameActive   Pending  Completed
 ReadStage 0 0  0

Re: ReplicateOnWrite issues

2011-07-12 Thread David Hawthorne

On Jul 12, 2011, at 3:02 PM, Sylvain Lebresne wrote:

 On Tue, Jul 12, 2011 at 11:42 PM, David Hawthorne dha...@gmx.3crowd.com 
 wrote:
 Well, I was using a large number of clients:  I tried configuring a hector 
 pool of 20-200 to see what affect that had on throughput. There's definitely 
 a point after which there's no gain, so I dialed it back down.  To clarify a 
 few other things, when I say inserts I mean increments, as this test only 
 involves counters.  When I say a row is full, I mean I've written all of the 
 data into it that belongs in it from our application's perspective.  Not all 
 rows are the same length, and the average number of subcolumns per 
 supercolumn is higher for some rows than for others.
 
 
 Oh yes, you're using supercolumns. That could help magnify the problem
 a little bit, given that to read any subcolumn of a supercolumn, we
 have to read the whole supercolumn. Depends how many counters per
 supercolumns we're talking about.

Does this imply that incrementing 50 counters in one supercolumn in one row in 
one CF requires 50! reads of the supercolumn?  One read of the entire 
supercolumn for each insert? 


 
 I'm curious, is it necessary for a read to happen for replication when using 
 RF=1?
 
 No, it's not necessary with RF=1 ... but we still do it :(. Ok, it's
 basically a bug, we'll
 fix that for 0.8.2.
 
 I am currently running a test against a single node cluster, RF=1, CL.ONE, 4 
 connections/concurrent increments, and graphing various data points.  If 
 something interesting happens, I'll send a graph.
 
 When you say stagger writes, do you mean just lower the overall number of 
 increments/sec my client is attempting,
 
 Yes, this is what I mean.
 
 or do you mean I should stagger the load amongst many CFs/rows at once?  
 Right now I'm writing heavily into a small set of rows, each going into 3 
 CFs, before moving on to the next small set of rows.
 
 
 On Jul 12, 2011, at 12:26 PM, Sylvain Lebresne wrote:
 
 When you do counter increment at CL.ONE, the write is acknowledged as
 soon as the first replica getting the the write has pushed the
 increment into his memtable. However, there is a read happening for
 the replication to the other replicas (this is necessary to the
 counter design). What is happening is that our nodes are not able to
 do those reads fast enough. Or, another way to see it is that you are
 writing faster than you cluster can handle.
 
 I've opened https://issues.apache.org/jira/browse/CASSANDRA-2889 to do
 something about that, but for now you'll have to stagger your writes.
 I'm more surprised that tasks on the replicate on write stage would
 stack up at QUORUM, unless you're using a very large number of
 clients.
 
 --
 Sylvain
 
 On Tue, Jul 12, 2011 at 8:57 AM, David Hawthorne dha...@gmx.3crowd.com 
 wrote:
 This is now my fourth attempt to get the message through.  Apologies if 
 you see multiple copies.
 
 I've tried to give as much relevant data as I can think of, but please let 
 me know if you need any other info.  I spent the day getting jmxtrans to 
 talk to statsd with the cassandra JMX data, so I can provide graphs of 
 just about anything you want.  I can also re-run the tests to replicate 
 the problem if need be.  As always, thanks for taking a look!
 
 I have a 4 host test cluster that I'm writing counters into running 0.8.1, 
 and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run 
 with each and both gave the same results).  Partitioner is the 
 RandomPartitioner, and RF=3.  nodetool ring shows that actual data is well 
 balanced:
 
 Address DC  RackStatus State   Load
 OwnsToken

   127605887595351923798765477786913079296
 10.0.0.57datacenter1 rack1   Up Normal  1.59 GB 25.00% 
  0
 10.0.0.56datacenter1 rack1   Up Normal  1.39 GB 25.00% 
  42535295865117307932921825928971026432
 10.0.0.55datacenter1 rack1   Up Normal  1.24 GB 25.00% 
  85070591730234615865843651857942052864
 10.0.0.54datacenter1 rack1   Up Normal  1.34 GB 25.00% 
  127605887595351923798765477786913079296
 
 I also tested against a 1-box test cluster with RF=1.
 
 Both show the same results:  high performance for a while, and then 
 ReplicateOnWrite backs way up (I've seen it as high as a million), the 
 cassandra process becomes unresponsive, and the hector client starts 
 throwing exceptions.  High performance in this case equates to about 5-20k 
 inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long 
 row at a time, so the performance is bounded by the one box owning the 
 row).  The cassandra process never seems to recover, even if left for 24 
 hours.  It's still chewing through those ReplicateOnWrite pending tasks.
 
 GC doesn't seem to be a factor.  Logs show GCs completing in  1s on all 4 
 boxes in the test cluster

possible 'coming back to life' bug with counters

2011-06-10 Thread David Hawthorne
Please take a look at this thread over in the hector-users mailing list:

http://groups.google.com/group/hector-users/browse_thread/thread/99835159b9ea1766

It looks as if the deleted columns are coming back to life when they shouldn't 
be.

I don't want to open a bug on something if it's already got one that I just 
couldn't find when I scanned the list of open bugs.

I'm using hector 0.8 against cassandra 0.8 release.  I can give you whatever 
logs or files you'd like.

Re: RTG/MRTG/Cricket replacement using Cassandra?

2011-05-23 Thread David Hawthorne
Have you looked at graphite?  It would be very cool to see graphite using 
cassandra as a backend, and then to have cacti feeding data into cassandra.

On May 20, 2011, at 4:48 PM, Edward Capriolo wrote:

 The first love of my open life was cacti. I am going to discuss with
 them porting some of the system to cassandra.
 
 On Friday, May 20, 2011, yangyangyyy tedd...@gmail.com wrote:
 hi Ryan:
 
 
 Thanks for the link.
 I read the slides, could you please provide some more details on how the
 temporal aggregation is implemented?
 do you use time+granularity as the key ? or as column names ?
 
 Thanks
 Yang
 
 In reply to this post by Aaron Turner
 We have a solution for time series data on cassandra at Twitter that
 we'd like to open source, but it requires 0.8/trunk so we're not going
 to release it until that's stable.
 
 See
 http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
 
 -ryan
 
 On Thu, Mar 31, 2011 at 3:56 PM, Aaron Turner [hidden email] wrote:
 
 I've been looking at replacing our PostgreSQL backend for RTG (a SNMP
 based polling and graphing solution for network traffic/ports) with
 something using Cassandra in order to solve our scalability and
 redundancy requirements.  Based on a lot of what I've read, Cassandra
 is an ideal data store for this  time series data.  In fact, Eric
 Evans in his presentation on the Cassandra home page suggests that
 this kind of use case is perfect for Cassandra.
 
 So this got me wondering if someone had already come up with a CF
 model for this kind of data, including daily/weekly/monthly/yearly
 rollups.  Perhaps there's even an open source project or two
 implementing this sorta thing?  I've found flewton
 (https://github.com/flewton/flewton), which is possibly relevant, but
 my Java skills are pretty non-existent so I'm having a hard time
 figuring it out.
 
 Thanks,
 Aaron
 
 --
 Aaron Turner
  http://synfin.net/  http://synfin.net/ Twitter: @synfinatic
 http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix 
 Windows
 Those who would give up essential Liberty, to purchase a little temporary
 Safety, deserve neither Liberty nor Safety.
 -- Benjamin Franklin
 carpe diem quam minimum credula postero
 
 ... [show rest of quote]
 
 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/RTG-MRTG-Cricket-replacement-using-Cassandra-tp6229322p6388192.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.
 



Re: RTG/MRTG/Cricket replacement using Cassandra?

2011-03-31 Thread David Hawthorne
I know cloudkick is doing something like this, and we're developing our own 
in-house method, but it would be nice for there to be a generically-available 
package that would do this.  Lately I've been wishing that someone would take 
graphite (written in python) and put the frontend on top of cassandra instead 
of their proprietary whisper databases.  The counter support in 0.8 should help 
with keeping the rollups up-to-date.

On Mar 31, 2011, at 3:56 PM, Aaron Turner wrote:

 I've been looking at replacing our PostgreSQL backend for RTG (a SNMP
 based polling and graphing solution for network traffic/ports) with
 something using Cassandra in order to solve our scalability and
 redundancy requirements.  Based on a lot of what I've read, Cassandra
 is an ideal data store for this  time series data.  In fact, Eric
 Evans in his presentation on the Cassandra home page suggests that
 this kind of use case is perfect for Cassandra.
 
 So this got me wondering if someone had already come up with a CF
 model for this kind of data, including daily/weekly/monthly/yearly
 rollups.  Perhaps there's even an open source project or two
 implementing this sorta thing?  I've found flewton
 (https://github.com/flewton/flewton), which is possibly relevant, but
 my Java skills are pretty non-existent so I'm having a hard time
 figuring it out.
 
 Thanks,
 Aaron
 
 -- 
 Aaron Turner
 http://synfin.net/ Twitter: @synfinatic
 http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix  
 Windows
 Those who would give up essential Liberty, to purchase a little temporary
 Safety, deserve neither Liberty nor Safety.
 -- Benjamin Franklin
 carpe diem quam minimum credula postero



questions about cassandra-1072/1546

2010-12-06 Thread David Hawthorne
Can we get an update?  After reading through the comments on 1072, it looks 
like this is getting close to finished, but it's hard for someone not knee-deep 
in the project to tell.  I'm primarily interested in the timeline you foresee 
for getting the increment support into trunk for 0.7, and some documentation 
around how counters will be supported from the user's perspective - chiefly 
what a Column and SuperColumn will look like with counters and what the thrift 
API will be.  Some documentation about the remaining issues and concerns we 
should be aware of when using counters would be good, too, since it looks like 
there were some in the comments.  Again, as someone not knee-deep in the 
project, it's hard to tell how severe they are or how or when they would apply 
in general use.