> - About 300 million records are stored in single cassandra node. (16
> cores, 32GB memory, SSD disk)
> - One record is avg. 1KB.

So about 300 GB of disk space in total?

> - Multi-thread client on different server does read and write:
>  - 120 reader threads continuously read random records from cassandra
> and 1/10 of read data are written back to cassandra with 20 writer
> threads.

They are definitely (pseudo-)random, uniformly across the data? I.e.,
there is no particular hot active set?

> - Use JNA library to suppress GC for cassandra program.

I assume you mean dropping JNA in there to enable cassandra to do
mlockall(). Does it succeed? Have you confirmed there is no swapping
going on?

(Just to be sure; the long GC times sound a bit suspicious)

> It didn't help. More frequent fluctuation occurs and zero disk read
> happens twice in one hour. The disk read throughput is lower than
> previous one.

Hrmpf. How about compaction; is compaction correlated somehow with the
trends you see in graphs?

For the long pauses, do you have GCInspector log entries indicating
the before/after heap sizes for the CMS GC passes?

Regarding the graph-ssd-stat.jna.operations1M.png:

The behavior seems very consistent with having low cache locality at
the start, being disk-bound. As disk activity goes down because of hot
reads, CPU usage instead goes up. However I would expect this to be
correlated with a higher operations/second rate. Are the different
graphs over the exact same time period, such that they show that the
overall 'reads per second' rate actually goes *down* as CPU usage goes
up and disk usage goes down?

If so, that indicates there is in fact excessive activity going on
wasting CPU. Since it is periodic, I'm thinking maybe the heap simply
grows too full, causing GC pressure to decrease overall efficiency
(without OOM:ing) until the time based criteria triggers memtable
flush. Once that happens and once the memtables gets GC:ed, suddenly
the application (cassandra) can make progress again.

> The last image Screenshot.png is GC log graph created by gcview. The
> GC spikes match to the performance drop timing (blue line means used
> heap size and green line height shows GC time).
> We believe that some strange thing is going on after memtable flush.
> Does anyone has a fix or advice?

Hmmm. I'm not sure how to correlate the GC graphs, especially because
of the resolution. Can you find some GCInspector log entries and
report back what the heap size is before/after a CMS sweep (1) during
the "good" period of high throughput and (2) during the "bad" period
of low throughput. I.e., find ConcurrentMarkSweep GCInspector lines
from those periods; each line should contained heap size and the
amount freed.

In general, I think the GC times are pretty surprising unless it is
because CMS is failing to start early enough such that the excessive
times are due to fallback to full GC.

You may want to modify the VM args to run with -Xloggc:path/to/gclog
-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimestamps. That should
make any concurrent mode failures and full GC:s visible and in general
give some more information about what is going on at each GC. One can
then look at the log surrounding the instances of prolonged GC pauses.
If you could run it with those options and trigger the behavior, and
perhaps post the GC log?

I think the most important thing to figure out is what CPU time is
spent on. If it is GC it should almost certainly be fixable as long as
one knows *why* (it should be due to heap usage, although I don't see
anything in your numbers that would indicate to me why the heap would
have so much live data as to cause problems given your 16 gig heap
size).

-- 
/ Peter Schuller

Reply via email to