Suggestions for tuning and monitoring.

Henrik Schröder Mon, 10 Sep 2012 07:06:12 -0700

Hi all,

We're running a small Cassandra cluster (v1.0.10) serving data to our web
application, and as our traffic grows, we're starting to see some weird
issues. The biggest of these is that sometimes, a single node becomes
unresponsive. It's impossible to start new connections, or impossible to
send requests, or it just doesn't return anything when you've sent a
request. Our client library is set to retry on another server when this
happens, and what we see then is that the request is usually served
instantly. So it's not the case that some requests are very difficult, it's
that sometimes a node is just "busy", and we have no idea why or what it's
doing.


We're using MRTG and Monit to monitor the servers, and in MRTG the average
CPU usage is around 5%, on our quad-core Xeon servers with SSDs. But
occassionally through Monit we can see that the 1-min load average goes
above 4, and this usually corresponds to the above issues. Is this common?
Does this happen to everyone else? And why the spikiness in load? We can't
find anything in the cassandra logs indicating that something's up (such as
a slow GC or compaction), and there's no corresponding traffic spike in the
application either. Should we just add more nodes if any single one gets
CPU spikes?

Another explanation could also be that we've configured it wrong. We're
running pretty much default config. Each node has 16G of RAM, 4GB of heap,
no row-cache and an sizeable key-cache. Despite that, we occasionally get
OOM exceptions, and nodes crashing, maybe a few times per month. Should we
increase heap size? Or move to 1.1 and enable off-heap caching?

We have quite a lot of traffic to the cluster. A single keyspace with two
column families, RF=3, compression is enabled, and we're running the
default size-tiered compaction.
Column family A has 60GB of actual data, each row has a single column, and
that column holds binary data that varies in size up to 500kB. When we
update a row, we write a new value to this single column, effectively
replacing that entire row. We do ~1000 updates/s, totalling ~10MB/s in
writes.
Column family B also has 60GB of actual data, but each row has a variable
(~100-10000) number of supercolumns, and each supercolumn has a fixed
number of columns with a fixed amount of data, totalling ~1kB. The
operations we are doing on this column family is that we add supercolumns
to rows with a rate of ~200/s, and occasionally we do bulk deletion of
supercolumns in a row.

The config options we are unsure about are things like commit log sizes,
memtable flushing thresholds, commit log syncing, compaction throughput,
etc. Are we at a point with our data size and write load that the defaults
aren't good enough anymore? Should we stick with size-tiered compaction,
even though our application is update-heavy?


Many thanks,
/Henrik

Suggestions for tuning and monitoring.

Reply via email to