When you can't get the number of threads, that means you have way too many running (8,000+) usually.
Try running `ps -eLf | grep cassandra`. How many threads? -Chris On Jul 29, 2010, at 8:40 PM, Dathan Pattishall wrote: > > To Follow up on this thread. I blew away the data for my entire cluster, > waited a few days of user activity and within 3 days the server hangs > requests in the same way. > > > Background Info: > Make around 60 million requests per day. > 70% reads > 30% writes > an F5 Loadbalancer (BIGIP-LTM) in a round robin config. > > > > > IOSTAT Info: > 3 MB a secon of writing data @ 13% IOWAIT > > VMStat Info: > still shows a lock of blocking procs at a low CPU utilization. > > Data Size: > 6 GB of data per node and there is 4 nodes > > cass01: Pool Name Active Pending Completed > cass01: FILEUTILS-DELETE-POOL 0 0 27 > cass01: STREAM-STAGE 0 0 8 > cass01: RESPONSE-STAGE 0 0 66439845 > cass01: ROW-READ-STAGE 8 4098 77243463 > cass01: LB-OPERATIONS 0 0 0 > cass01: MESSAGE-DESERIALIZER-POOL 1 14223148 139627123 > cass01: GMFD 0 0 772032 > cass01: LB-TARGET 0 0 0 > cass01: CONSISTENCY-MANAGER 0 0 35518593 > cass01: ROW-MUTATION-STAGE 0 0 19809347 > cass01: MESSAGE-STREAMING-POOL 0 0 24 > cass01: LOAD-BALANCER-STAGE 0 0 0 > cass01: FLUSH-SORTER-POOL 0 0 0 > cass01: MEMTABLE-POST-FLUSHER 0 0 74 > cass01: FLUSH-WRITER-POOL 0 0 74 > cass01: AE-SERVICE-STAGE 0 0 0 > cass01: HINTED-HANDOFF-POOL 0 0 9 > > > > Keyspace: TimeFrameClicks > Read Count: 42686 > Read Latency: 47.21777100220213 ms. > Write Count: 18398 > Write Latency: 0.17457457332318732 ms. > Pending Tasks: 0 > Column Family: Standard2 > SSTable count: 9 > Space used (live): 6561033040 > Space used (total): 6561033040 > Memtable Columns Count: 6711 > Memtable Data Size: 241596 > Memtable Switch Count: 1 > Read Count: 42552 > Read Latency: 41.851 ms. > Write Count: 18398 > Write Latency: 0.031 ms. > Pending Tasks: 0 > Key cache capacity: 200000 > Key cache size: 81499 > Key cache hit rate: 0.2495154675604193 > Row cache: disabled > Compacted row minimum size: 0 > Compacted row maximum size: 0 > Compacted row mean size: 0 > > > Attached is jconsole memory use. > I would attach the thread use but I could not get any info from JMX on the > threads. And clicking detect deadlock just hangs, I do not see the expected > No deadlock detected. > > > Based on Feedback from this list by jbellis, I'm hitting cassandra to hard. > So I removed the offending server from the LB. Waited about 20 mins and the > pending queue did not clear at all. > > Killing Cassandra and restarting it, this box recovered. > > > > > So from my point of view I think there is a bug in Cassandra? Do you agree? > Possibly a dead lock in the SEDA implementation of the ROW-READ-STAGE? > > > > > > > > > > > On Tue, Jul 27, 2010 at 12:28 AM, Peter Schuller > <peter.schul...@infidyne.com> wrote: > > average queue size column too. But given the vmstat output I doubt > > this is the case since you should either be seeing a lot more wait > > time or a lot less idle time. > > Hmm, another thing: you mention 16 i7 cores. I presume that's 16 in > total, counting hyper-threading? Because that means 8 threads should > be able to saturate 50% (as perceived by the operating system). If you > have 32 (can you get this yet anyway?) virtual cores then I'd say that > your vmstat output could be consistent with READ-ROW-STAGE being CPU > bound rather than disk bound (presumably with data fitting in cache > and not having to go down to disk). If this is the case, increasing > read concurrency should at least make the actual problem more obvious > (i.e., achieving CPU saturation), though it probably won't increase > throughput much unless Cassandra is very friendly to > hyperthreading.... > > -- > / Peter Schuller > > <memory_use.PNG>