Look for log messages at the ERROR level first to find out why it's crashing. 

Check for GC pressure during the repair, either using JConsole or log messages 
from the GCInspector. 

Check the nodetool tpstats to get an idea if the nodes are saturated, i.e. are 
their tasks in the pending list. Or are they just running with high latency. 

If a node crashes when calculating the Merkle tree's for it's neighbours the 
repair will hang (for 48 hours i think) on the node that initiated the repair. 
I dont think this is immediately obvious though tpstats .

Start with why it's crashing and whats happening with the GC. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 16 Jun 2011, at 10:20, Aurynn Shaw wrote:

> Hey all;
> 
> So, we have Cassandra running on a 5-server ring, with a RF of 3, and we're 
> regularly seeing major slowdowns in read & write performance while running 
> nodetool repair, as well as the occasional Cassandra crash during the repair 
> window - slowdowns past 10 seconds to perform a single write.
> 
> The repair cycle runs nightly on a different server, so each server has it 
> run once a week.
> 
> We're running 0.7.0 currently, and we'll be upgrading to 0.7.6 shortly.
> 
> System load on the Cassandra servers is never more than 10% CPU and utterly 
> minimal IO usage, so I wouldn't think we'd be seeing issues quite like this.
> 
> What sort of knobs should I be looking at tuning to reduce the impact that 
> nodetool repair has on Cassandra? What questions should I be asking as to why 
> Cassandra slows down to the level that it does, and what I should be 
> optimizing?
> 
> Additionally, what should I be looking for in the logs when this is 
> happening? There's a lot in the logs, but I'm not sure what to look for.
> 
> Cassadra is, in this instance, backing a system that supports around a 
> million requests a day, so not terribly heavy traffic.
> 
> Thanks,
> 
> Aurynn

Reply via email to