Re: Cassandra Nodes Freeze/Down for ConcurrentMarkSweep GC?
> [4] Is GC ConcurrentMarkSweep a Stop-The-World situation? Where the > JVM cannot do anything else? Hence then node is technically Down? > Correct? No; the concurrent mark/sweep phase runs concurrently with your application. CMS will cause a stop-the-world full pause it it fails to complete a CMS sweep in time and you hit the maximum heap size, but unless that happens, CMS will run concurrently (though there are stop-the-world pauses involved, that are typically very short, the mark/sweep phase is concurrent). As jbellis pointed out, you're almost certainly swapping. -- / Peter Schuller
Re: Cassandra Nodes Freeze/Down for ConcurrentMarkSweep GC?
http://riptano.blip.tv/file/4012133/ On Sun, Aug 22, 2010 at 12:11 PM, Moleza Moleza wrote: > Hi, > I am setting up a cluster on a linux box. > Everything seems to be working great and I am watching the ring with: > watch -d -n 2 nodetool -h localhost ring > Suddenly, I see that one of the nodes just went down (at 14:07): > Status changed from Up to Down. > 13 minutes later (without any intervention) the node comes back Up (by > itself). > I check the logs (see at end of text) on that node and see that there > is nothing in the log from 14:07 until 14:20 (13 minutes later). > I also notice the GC ConcurrentMarkSweep took 13 minutes. > Here are my questions: > [1] Is this behavior normal? > [2] Has it been observed by someone else before? > [3] The node being down means that nodetool, and any other client, > wont be able to connect to it (clients should use other nodes in > cluster to write data). Correct? > [4] Is GC ConcurrentMarkSweep a Stop-The-World situation? Where the > JVM cannot do anything else? Hence then node is technically Down? > Correct? > [5] Why is this GC taking such a long time? (see JMV ARGS posted bellow). > [6] Any JMV Args (switches) I can use to prevent this? > -- > JVM_OPTS=" \ > -Dprog=Cassandra \ > -ea \ > -Xms12G \ > -Xmx12G \ > -XX:+UseParNewGC \ > -XX:+UseConcMarkSweepGC \ > -XX:+CMSParallelRemarkEnabled \ > -XX:SurvivorRatio=8 \ > -XX:MaxTenuringThreshold=1 \ > -XX:+HeapDumpOnOutOfMemoryError \ > -Dcom.sun.management.jmxremote.port=8080 \ > -Dcom.sun.management.jmxremote.ssl=false \ > -Dcom.sun.management.jmxremote.authenticate=false" > > > Log Extract ## > INFO [GC inspection] 2010-08-22 14:06:48,622 GCInspector.java (line > 116) GC for ParNew: 235 ms, 134504976 reclaimed leaving 12721498296 > used; max is 13005881344 > INFO [FLUSH-TIMER] 2010-08-22 14:19:45,429 ColumnFamilyStore.java > (line 357)HintsColumnFamily has reached its threshold; switching in a > fresh Memtable at > CommitLogContext(file='/var/nes/data1/cassandra_commitlog/CommitLog-1282500306160.log', > position=55517352) > INFO [FLUSH-TIMER] 2010-08-22 14:19:45,429 ColumnFamilyStore.java > (line 609) Enqueuing flush of > memtable-hintscolumnfam...@1935604258(3147 bytes, 433 operations) > INFO [FLUSH-WRITER-POOL:1] 2010-08-22 14:19:45,430 Memtable.java (line > 148) Writing memtable-hintscolumnfam...@1935604258(3147 bytes, 433 > operations) > INFO [GC inspection] 2010-08-22 14:19:45,917 GCInspector.java (line > 116) GC for ParNew: 215 ms, 130254256 reclaimed leaving 12742982208 > used; max is 13005881344 > INFO [GC inspection] 2010-08-22 14:19:45,973 GCInspector.java (line > 116)GC for ConcurrentMarkSweep: 775679 ms, 12685881488 reclaimed > leaving 196692400 used; max is 13005881344 > -- >
Re: Cassandra Nodes Freeze/Down for ConcurrentMarkSweep GC?
GCs never take that long unless you're swapping. On Sun, Aug 22, 2010 at 2:11 PM, Moleza Moleza wrote: > Hi, > I am setting up a cluster on a linux box. > Everything seems to be working great and I am watching the ring with: > watch -d -n 2 nodetool -h localhost ring > Suddenly, I see that one of the nodes just went down (at 14:07): > Status changed from Up to Down. > 13 minutes later (without any intervention) the node comes back Up (by > itself). > I check the logs (see at end of text) on that node and see that there > is nothing in the log from 14:07 until 14:20 (13 minutes later). > I also notice the GC ConcurrentMarkSweep took 13 minutes. > Here are my questions: > [1] Is this behavior normal? > [2] Has it been observed by someone else before? > [3] The node being down means that nodetool, and any other client, > wont be able to connect to it (clients should use other nodes in > cluster to write data). Correct? > [4] Is GC ConcurrentMarkSweep a Stop-The-World situation? Where the > JVM cannot do anything else? Hence then node is technically Down? > Correct? > [5] Why is this GC taking such a long time? (see JMV ARGS posted bellow). > [6] Any JMV Args (switches) I can use to prevent this? > -- > JVM_OPTS=" \ > -Dprog=Cassandra \ > -ea \ > -Xms12G \ > -Xmx12G \ > -XX:+UseParNewGC \ > -XX:+UseConcMarkSweepGC \ > -XX:+CMSParallelRemarkEnabled \ > -XX:SurvivorRatio=8 \ > -XX:MaxTenuringThreshold=1 \ > -XX:+HeapDumpOnOutOfMemoryError \ > -Dcom.sun.management.jmxremote.port=8080 \ > -Dcom.sun.management.jmxremote.ssl=false \ > -Dcom.sun.management.jmxremote.authenticate=false" > > > Log Extract ## > INFO [GC inspection] 2010-08-22 14:06:48,622 GCInspector.java (line > 116) GC for ParNew: 235 ms, 134504976 reclaimed leaving 12721498296 > used; max is 13005881344 > INFO [FLUSH-TIMER] 2010-08-22 14:19:45,429 ColumnFamilyStore.java > (line 357)HintsColumnFamily has reached its threshold; switching in a > fresh Memtable at > CommitLogContext(file='/var/nes/data1/cassandra_commitlog/CommitLog-1282500306160.log', > position=55517352) > INFO [FLUSH-TIMER] 2010-08-22 14:19:45,429 ColumnFamilyStore.java > (line 609) Enqueuing flush of > memtable-hintscolumnfam...@1935604258(3147 bytes, 433 operations) > INFO [FLUSH-WRITER-POOL:1] 2010-08-22 14:19:45,430 Memtable.java (line > 148) Writing memtable-hintscolumnfam...@1935604258(3147 bytes, 433 > operations) > INFO [GC inspection] 2010-08-22 14:19:45,917 GCInspector.java (line > 116) GC for ParNew: 215 ms, 130254256 reclaimed leaving 12742982208 > used; max is 13005881344 > INFO [GC inspection] 2010-08-22 14:19:45,973 GCInspector.java (line > 116)GC for ConcurrentMarkSweep: 775679 ms, 12685881488 reclaimed > leaving 196692400 used; max is 13005881344 > -- > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Cassandra Nodes Freeze/Down for ConcurrentMarkSweep GC?
Hi, I am setting up a cluster on a linux box. Everything seems to be working great and I am watching the ring with: watch -d -n 2 nodetool -h localhost ring Suddenly, I see that one of the nodes just went down (at 14:07): Status changed from Up to Down. 13 minutes later (without any intervention) the node comes back Up (by itself). I check the logs (see at end of text) on that node and see that there is nothing in the log from 14:07 until 14:20 (13 minutes later). I also notice the GC ConcurrentMarkSweep took 13 minutes. Here are my questions: [1] Is this behavior normal? [2] Has it been observed by someone else before? [3] The node being down means that nodetool, and any other client, wont be able to connect to it (clients should use other nodes in cluster to write data). Correct? [4] Is GC ConcurrentMarkSweep a Stop-The-World situation? Where the JVM cannot do anything else? Hence then node is technically Down? Correct? [5] Why is this GC taking such a long time? (see JMV ARGS posted bellow). [6] Any JMV Args (switches) I can use to prevent this? -- JVM_OPTS=" \ -Dprog=Cassandra \ -ea \ -Xms12G \ -Xmx12G \ -XX:+UseParNewGC \ -XX:+UseConcMarkSweepGC \ -XX:+CMSParallelRemarkEnabled \ -XX:SurvivorRatio=8 \ -XX:MaxTenuringThreshold=1 \ -XX:+HeapDumpOnOutOfMemoryError \ -Dcom.sun.management.jmxremote.port=8080 \ -Dcom.sun.management.jmxremote.ssl=false \ -Dcom.sun.management.jmxremote.authenticate=false" Log Extract ## INFO [GC inspection] 2010-08-22 14:06:48,622 GCInspector.java (line 116) GC for ParNew: 235 ms, 134504976 reclaimed leaving 12721498296 used; max is 13005881344 INFO [FLUSH-TIMER] 2010-08-22 14:19:45,429 ColumnFamilyStore.java (line 357)HintsColumnFamily has reached its threshold; switching in a fresh Memtable at CommitLogContext(file='/var/nes/data1/cassandra_commitlog/CommitLog-1282500306160.log', position=55517352) INFO [FLUSH-TIMER] 2010-08-22 14:19:45,429 ColumnFamilyStore.java (line 609) Enqueuing flush of memtable-hintscolumnfam...@1935604258(3147 bytes, 433 operations) INFO [FLUSH-WRITER-POOL:1] 2010-08-22 14:19:45,430 Memtable.java (line 148) Writing memtable-hintscolumnfam...@1935604258(3147 bytes, 433 operations) INFO [GC inspection] 2010-08-22 14:19:45,917 GCInspector.java (line 116) GC for ParNew: 215 ms, 130254256 reclaimed leaving 12742982208 used; max is 13005881344 INFO [GC inspection] 2010-08-22 14:19:45,973 GCInspector.java (line 116)GC for ConcurrentMarkSweep: 775679 ms, 12685881488 reclaimed leaving 196692400 used; max is 13005881344 --