Re: Slowdowns during repair

2011-06-16 Thread aaron morton
Look for log messages at the ERROR level first to find out why it's crashing. 

Check for GC pressure during the repair, either using JConsole or log messages 
from the GCInspector. 

Check the nodetool tpstats to get an idea if the nodes are saturated, i.e. are 
their tasks in the pending list. Or are they just running with high latency. 

If a node crashes when calculating the Merkle tree's for it's neighbours the 
repair will hang (for 48 hours i think) on the node that initiated the repair. 
I dont think this is immediately obvious though tpstats .

Start with why it's crashing and whats happening with the GC. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 16 Jun 2011, at 10:20, Aurynn Shaw wrote:

 Hey all;
 
 So, we have Cassandra running on a 5-server ring, with a RF of 3, and we're 
 regularly seeing major slowdowns in read  write performance while running 
 nodetool repair, as well as the occasional Cassandra crash during the repair 
 window - slowdowns past 10 seconds to perform a single write.
 
 The repair cycle runs nightly on a different server, so each server has it 
 run once a week.
 
 We're running 0.7.0 currently, and we'll be upgrading to 0.7.6 shortly.
 
 System load on the Cassandra servers is never more than 10% CPU and utterly 
 minimal IO usage, so I wouldn't think we'd be seeing issues quite like this.
 
 What sort of knobs should I be looking at tuning to reduce the impact that 
 nodetool repair has on Cassandra? What questions should I be asking as to why 
 Cassandra slows down to the level that it does, and what I should be 
 optimizing?
 
 Additionally, what should I be looking for in the logs when this is 
 happening? There's a lot in the logs, but I'm not sure what to look for.
 
 Cassadra is, in this instance, backing a system that supports around a 
 million requests a day, so not terribly heavy traffic.
 
 Thanks,
 
 Aurynn



Slowdowns during repair

2011-06-15 Thread Aurynn Shaw

Hey all;

So, we have Cassandra running on a 5-server ring, with a RF of 3, and 
we're regularly seeing major slowdowns in read  write performance while 
running nodetool repair, as well as the occasional Cassandra crash 
during the repair window - slowdowns past 10 seconds to perform a single 
write.


The repair cycle runs nightly on a different server, so each server has 
it run once a week.


We're running 0.7.0 currently, and we'll be upgrading to 0.7.6 shortly.

System load on the Cassandra servers is never more than 10% CPU and 
utterly minimal IO usage, so I wouldn't think we'd be seeing issues 
quite like this.


What sort of knobs should I be looking at tuning to reduce the impact 
that nodetool repair has on Cassandra? What questions should I be asking 
as to why Cassandra slows down to the level that it does, and what I 
should be optimizing?


Additionally, what should I be looking for in the logs when this is 
happening? There's a lot in the logs, but I'm not sure what to look for.


Cassadra is, in this instance, backing a system that supports around a 
million requests a day, so not terribly heavy traffic.


Thanks,

Aurynn