> Latency is fine, basically the service suddenly freezes. On top of that to > reduce the number of reads I have memcache fronting this @ a 92% hit rate
Ok. In that case if feels most likely to me that you're not throwing too much read traffic at it consistently, but rather that there is either (a) some gradual accumulation of e.g. stuck threads in the read stage or similar, or (b) something happening more suddenly that is directly causing your problem (e.g., sudden extreme spike in traffic triggering some bug, touching some particular row that goes to an unreadable part of disk, etc). > I have very detailed stats on the exceptions thrown from the cassandra > client. For about 3-5 days I have a 99% success ratio with connections + > service of pulling a single hash key with a single column. > > i.e. {$sn}_{$userid}_{$click} => {$when} > > then I have about a 25-40% failure rate when the hang occurs. Are the 1% failures you normally experience (pre-freeze) timeouts? -- / Peter Schuller