> With the large new-gen, you were actually seeing fallbacks to full GC?
> You weren't just still experiencing problems because at 10 gig, the new-gen
> will be so slow to compact to effectively be similar to a full gc in terms of
> affecting latency?

Yes, we were seeing fallbacks to full GC with a large young generation.

Surprisingly, the young generation was still collecting quickly (under 0.25s).

Perhaps that means the young gen was full of fewer, large object?

Doing some experimentation now, with a 250MB young gen, parnew collections are 
around 0.01-0.02 wall time seconds. With a 2GB young, they are around 
0.02-0.07s. Finally, with ~10GB new (9.5GB young), they are around 0.05-0.15s, 
with most being <0.1s. If no compactions have occurred, we occasionally see a 
0.2s parnew collection.

109.126: [GC 109.126: [ParNew
Desired survivor size 298254336 bytes, new threshold 3 (max 3)
- age   1:   34897008 bytes,   34897008 total
- age   2:   17763192 bytes,   52660200 total
- age   3:  135342264 bytes,  188002464 total
: 9524607K->204678K(9903232K), 0.0565180 secs] 9527767K->229803K(15146112K), 
0.0567610 secs] [Times: user=0.31 sys=0.01, real=0.06 secs]
128.997: [GC 128.998: [ParNew
Desired survivor size 298254336 bytes, new threshold 2 (max 3)
- age   1:  294294360 bytes,  294294360 total
- age   2:   16201608 bytes,  310495968 total
- age   3:   17755920 bytes,  328251888 total
: 9500071K->349162K(9903232K), 0.1070140 secs] 9525196K->400440K(15146112K), 
0.1072430 secs] [Times: user=0.60 sys=0.00, real=0.11 secs]

These were taken during compaction of a non-troublesome CF. We are still seeing 
new generation allocations in the 1 GB/s range. And, process read I/O reported 
from /proc/<pid>/io reports reading about 135MB/s. A 7.5x memory allocations to 
read I/O ratio does seem pretty high. Granted, the host was servicing some 
thrift requests at the time and these obviously contribute to object 
allocations. 

Without any compactions, we blow through the ~9.5GB young generation in 50-60s 
on average, which means our baseline allocation rate (from servicing thrift 
requests and other misc background work) is a (more reasonable) 200MB/s. 
Assuming there isn't much else going on in Cassandra, that means that ~80% of 
the allocated space during compactions is coming from the compaction and that 
the memory allocation overhead to process read bytes for compactions is pretty 
high (around 6:1).

I understand there will be an overhead for the programming language, but a 
ratio in the 6:1 range of JVM allocations to I/O reads seems a bit high. Since 
we are talking about many gigabytes of memory, I would expect the JVM 
allocation size to be dominated by the column values. This leads me to believe 
that the column value buffers are being excessively copied or the sstables 
aren't being read as efficiently as possible. Whatever the root cause, there 
definitely seems to be room for improvement. But, I'm not a Java expert and 
don't know the compaction algorithm too well, so maybe a ratio of 6:1 is pretty 
good.

> If there is any suspicion that the above is happening, maybe try decreasing
> in_memory_compaction_limit_in_mb (preparing to see lots of stuff logged
> to console, assuming that's still happening in the 0.6.
> version you're running).

I don't believe an in-memory compaction limit config option exists in 0.6.
 
> I"m not sure on what you're basing that, but unless I have fatally failed to
> grok something fundamental about the interaction between new-gen and
> old-gen with CMS, object's aren't being allocated *period* while the "young
> generation is being collected" as that is a stop-the-world pause. (This is 
> also
> why I said before that at 10 gig new-gen size, the observed behavior on
> young gen collections may be similar to fallback-to-full-gc cases, but not 
> quite
> since it would be parallel rather than serial)

The grok fail is probably on my end. I couldn't find any documentation to back 
either of our claims, so I'll defer to your experience.

Greg

Reply via email to