Heya!

I’ve been observing some strange and worrying behaviour all this week with row 
cache hits taking hundreds of milliseconds.

Cassandra 1.2.15, Datastax CQL driver 1.0.4.
EC2 m1.xlarge instances
RF=3, N=4
vnodes in use
key cache: 200M
row cache: 200M
row_cache_provider: SerializingCacheProvider
Query: PreparedStatement SELECT * from uniques3 WHERE hash=? AND item=? AND 
event=?. All values are < 20 bytes. 
All data is written with a TTL of days.

Row is not particularly wide (see cfhistograms in the pastebin).  Row cache hit 
can take hundreds of milliseconds, pretty much screwing performance. My initial 
thought was garbage collection, but I collected traces and GC logs to the 
pastebin below, so while there *is* plenty of GC going on, I don’t think it’s 
the reason.  We also have other column families accessed through Thrift which 
do not exhibit this behaviour at all. There are no abnormal query times for 
cache misses.

http://pastebin.com/ac6PVHhm

Notice also the weird “triple hump” on the cfhistograms - I’m kinda used to 
seeing two humps, one for cache hits and one for disk access, but this one has 
clearly three humps, one at the 200ms area. Also odd is the very large fp false 
ratio, but that might be just our data.

Armed with the traces I formed a hypothesis that perhaps row cache is a bad 
idea, turned it off for this CF, and hey! The average read latencies dropped to 
about 2 milliseconds.  So I’m kinda fine here now, but I would really 
appreciate it if someone could explain to me what is going on, and why would a 
row cache hit ever take up to 450 milliseconds? In our usecase, this CF does 
contain some hot data, and the row cache hit ratio is around 80%, so keeping it 
would be kinda useful.

(The pastebin contains a couple of traces, GC logs from all servers noted in 
the trace, cfstats, cfhistograms and schema.)

/Janne

Reply via email to