> Latency is fine, basically the service suddenly freezes. On top of that to
> reduce the number of reads I have memcache fronting this @ a 92% hit rate

Ok. In that case if feels most likely to me that you're not throwing
too much read traffic at it consistently, but rather that there is
either

  (a) some gradual accumulation of e.g. stuck threads in the read
stage or similar, or
  (b) something happening more suddenly that is directly causing your
problem (e.g., sudden extreme spike in traffic triggering some bug,
touching some particular row that goes to an unreadable part of disk,
etc).

> I have very detailed stats on the exceptions thrown from the cassandra
> client. For about 3-5 days I have a 99% success ratio with connections +
> service of pulling a single hash key with a single column.
>
> i.e. {$sn}_{$userid}_{$click} => {$when}
>
> then I have about a 25-40% failure rate when the hang occurs.

Are the 1% failures you normally experience (pre-freeze) timeouts?

-- 
/ Peter Schuller

Reply via email to