I have the same problem here, and I analysised the hprof file with mat, as you said, LinkedBlockQueue used 2.6GB. I think the ThreadPool of cassandra should limit the queue size.
cassandra 0.6.1 java version $ java -version java version "1.6.0_20" Java(TM) SE Runtime Environment (build 1.6.0_20-b02) Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode) iostat $ iostat -x -l 1 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 81.00 8175.00 224.00 17.00 23984.00 2728.00 221.68 1.01 1.86 0.76 18.20 tpstats, of coz, this node is still alive $ ./nodetool -host localhost tpstats Pool Name Active Pending Completed FILEUTILS-DELETE-POOL 0 0 1281 STREAM-STAGE 0 0 0 RESPONSE-STAGE 0 0 473617241 ROW-READ-STAGE 0 0 0 LB-OPERATIONS 0 0 0 MESSAGE-DESERIALIZER-POOL 0 0 718355184 GMFD 0 0 132509 LB-TARGET 0 0 0 CONSISTENCY-MANAGER 0 0 0 ROW-MUTATION-STAGE 0 0 293735704 MESSAGE-STREAMING-POOL 0 0 6 LOAD-BALANCER-STAGE 0 0 0 FLUSH-SORTER-POOL 0 0 0 MEMTABLE-POST-FLUSHER 0 0 1870 FLUSH-WRITER-POOL 0 0 1870 AE-SERVICE-STAGE 0 0 5 HINTED-HANDOFF-POOL 0 0 21 On Tue, Apr 27, 2010 at 3:32 AM, Chris Goffinet <goffi...@digg.com> wrote: > Upgrade to b20 of Sun's version of JVM. This OOM might be related to > LinkedBlockQueue issues that were fixed. > > -Chris > > > 2010/4/26 Roland Hänel <rol...@haenel.me> > >> Cassandra Version 0.6.1 >> OpenJDK Server VM (build 14.0-b16, mixed mode) >> Import speed is about 10MB/s for the full cluster; if a compaction is >> going on the individual node is I/O limited >> tpstats: caught me, didn't know this. I will set up a test and try to >> catch a node during the critical time. >> >> Thanks, >> Roland >> >> >> 2010/4/26 Chris Goffinet <goffi...@digg.com> >> >> Which version of Cassandra? >>> Which version of Java JVM are you using? >>> What do your I/O stats look like when bulk importing? >>> When you run `nodeprobe -host XXXX tpstats` is any thread pool backing up >>> during the import? >>> >>> -Chris >>> >>> >>> 2010/4/26 Roland Hänel <rol...@haenel.me> >>> >>> I have a cluster of 5 machines building a Cassandra datastore, and I load >>>> bulk data into this using the Java Thrift API. The first ~250GB runs fine, >>>> then, one of the nodes starts to throw OutOfMemory exceptions. I'm not >>>> using >>>> and row or index caches, and since I only have 5 CF's and some 2,5 GB of >>>> RAM >>>> allocated to the JVM (-Xmx2500M), in theory, that should happen. All >>>> inserts >>>> are done with consistency level ALL. >>>> >>>> I hope with this I have avoided all the 'usual dummy errors' that lead >>>> to OOM's. I have begun to troubleshoot the issue with JMX, however, it's >>>> difficult to catch the JVM in the right moment because it runs well for >>>> several hours before this thing happens. >>>> >>>> One thing gets to my mind, maybe one of the experts could confirm or >>>> reject this idea for me: is it possible that when one machine slows down a >>>> little bit (for example because a big compaction is going on), the >>>> memtables >>>> don't get flushed to disk as fast as they are building up under the >>>> continuing bulk import? That would result in a downward spiral, the system >>>> gets slower and slower on disk I/O, but since more and more data arrives >>>> over Thrift, finally OOM. >>>> >>>> I'm using the "periodic" commit log sync, maybe also this could create a >>>> situation where the commit log writer is too slow to catch up with the data >>>> intake, resulting in ever growing memory usage? >>>> >>>> Maybe these thoughts are just bullshit. Let me now if so... ;-) >>>> >>>> >>>> >>> >> >