Re: Node Stuck while restarting
We took backup of commitlogs and restarted the node, it started fine. As the node was down for more than one day we can say for sure that it was stuck and was not processing. Wondering how we can tune our settings so as to avoid a similar scenario in future, possibly not taking a hacky measure. On Sun, May 29, 2016 at 7:12 PM, Bhuvan Rawal wrote: > Hi Mike, > > PFA the details you asked for: and some others if that helps: > we are using jvm params > -Xms8G > -Xmx8G > > MAX_HEAP_SIZE: & HEAP_NEWSIZE: is not being set and possibly calculated > by calculate_heap_sizes function. (And we are using default calculations): > here are the results, pls correct me if im wrong : > system_memory_in_mb : 64544 > system_cpu_cores : 16 > > for MAX_HEAP_SIZE: > > # set max heap size based on the following > # max(min(1/2 ram, 1024MB), min(1/4 ram, 8GB)) > # calculate 1/2 ram and cap to 1024MB > # calculate 1/4 ram and cap to 8192MB > # pick the max > > By this I can figure out that MAX_HEAP_SIZE is 8GB - (From the first case > & third case) > > max_sensible_yg_per_core_in_mb="100" > max_sensible_yg_in_mb=`expr $max_sensible_yg_per_core_in_mb "*" > $system_cpu_cores` - 100* 16 = 1600 MB > desired_yg_in_mb=`expr $max_heap_size_in_mb / 4 ---That comes out to > be- 8GB/4 = 2GB > > if [ "$desired_yg_in_mb" -gt "$max_sensible_yg_in_mb" ] > then > HEAP_NEWSIZE="${max_sensible_yg_in_mb}M" > else > HEAP_NEWSIZE="${desired_yg_in_mb}M" > fi > > That should set HEAP_NEWSIZE to 1600MB by first case. > > > memtable_allocation_type: heap_buffers > > memtable_cleanup_threshold- we are using default: > # memtable_cleanup_threshold defaults to 1 / (memtable_flush_writers + 1) > # memtable_cleanup_threshold: 0.11 > > memtable_flush_writers - default (2) > we can increase this as we are using SSD with IOPS of around 300/s > > memtable_heap_space_in_mb - default values > # memtable_heap_space_in_mb: 2048 > # memtable_offheap_space_in_mb: 2048 > > We are using G1 garbage collector and jdk1.8.0_45 > > Best Regards, > > > On Sun, May 29, 2016 at 5:07 PM, Mike Yeap wrote: > >> Hi Bhuvan, how big are your current commit logs in the failed node, and >> what are the sizes MAX_HEAP_SIZE and HEAP_NEWSIZE? >> >> Also the values of following properties in cassandra.yaml?? >> >> memtable_allocation_type >> memtable_cleanup_threshold >> memtable_flush_writers >> memtable_heap_space_in_mb >> memtable_offheap_space_in_mb >> >> >> Regards, >> Mike Yeap >> >> >> >> On Sun, May 29, 2016 at 6:18 PM, Bhuvan Rawal >> wrote: >> >>> Hi, >>> >>> We are running a 6 Node cluster in 2 DC on DSC 3.0.3, with 3 Node each. >>> One of the node was showing UNREACHABLE on other nodes in nodetool >>> describecluster and on that node it was showing all others UNREACHABLE and >>> as a measure we restarted the node. >>> >>> But on doing that it is stuck possibly at with these messages in >>> system.log: >>> >>> DEBUG [SlabPoolCleaner] 2016-05-29 14:07:28,156 >>> ColumnFamilyStore.java:829 - Enqueuing flush of batches: 226784704 (11%) >>> on-heap, 0 (0%) off-heap >>> DEBUG [main] 2016-05-29 14:07:28,576 CommitLogReplayer.java:415 - >>> Replaying /commitlog/data/CommitLog-6-1464508993391.log (CL version 6, >>> messaging version 10, compression null) >>> DEBUG [main] 2016-05-29 14:07:28,781 ColumnFamilyStore.java:829 - >>> Enqueuing flush of batches: 207333510 (10%) on-heap, 0 (0%) off-heap >>> >>> MemtablePostFlush / MemtableFlushWriter stages where it is stuck with >>> pending messages. >>> This has been the status of them as per *nodetool tpstats *for long. >>> MemtablePostFlush Active - 1pending - 52 >>> completed - 16 >>> MemtableFlushWriter Active - 2pending - 13 >>> completed - 15 >>> >>> >>> We restarted the node by setting log level to TRACE but in vain. What >>> could be a possible contingency plan in such a scenario? >>> >>> Best Regards, >>> Bhuvan >>> >>> >> >
Re: Node Stuck while restarting
Hi Mike, PFA the details you asked for: and some others if that helps: we are using jvm params -Xms8G -Xmx8G MAX_HEAP_SIZE: & HEAP_NEWSIZE: is not being set and possibly calculated by calculate_heap_sizes function. (And we are using default calculations): here are the results, pls correct me if im wrong : system_memory_in_mb : 64544 system_cpu_cores : 16 for MAX_HEAP_SIZE: # set max heap size based on the following # max(min(1/2 ram, 1024MB), min(1/4 ram, 8GB)) # calculate 1/2 ram and cap to 1024MB # calculate 1/4 ram and cap to 8192MB # pick the max By this I can figure out that MAX_HEAP_SIZE is 8GB - (From the first case & third case) max_sensible_yg_per_core_in_mb="100" max_sensible_yg_in_mb=`expr $max_sensible_yg_per_core_in_mb "*" $system_cpu_cores` - 100* 16 = 1600 MB desired_yg_in_mb=`expr $max_heap_size_in_mb / 4 ---That comes out to be- 8GB/4 = 2GB if [ "$desired_yg_in_mb" -gt "$max_sensible_yg_in_mb" ] then HEAP_NEWSIZE="${max_sensible_yg_in_mb}M" else HEAP_NEWSIZE="${desired_yg_in_mb}M" fi That should set HEAP_NEWSIZE to 1600MB by first case. memtable_allocation_type: heap_buffers memtable_cleanup_threshold- we are using default: # memtable_cleanup_threshold defaults to 1 / (memtable_flush_writers + 1) # memtable_cleanup_threshold: 0.11 memtable_flush_writers - default (2) we can increase this as we are using SSD with IOPS of around 300/s memtable_heap_space_in_mb - default values # memtable_heap_space_in_mb: 2048 # memtable_offheap_space_in_mb: 2048 We are using G1 garbage collector and jdk1.8.0_45 Best Regards, On Sun, May 29, 2016 at 5:07 PM, Mike Yeap wrote: > Hi Bhuvan, how big are your current commit logs in the failed node, and > what are the sizes MAX_HEAP_SIZE and HEAP_NEWSIZE? > > Also the values of following properties in cassandra.yaml?? > > memtable_allocation_type > memtable_cleanup_threshold > memtable_flush_writers > memtable_heap_space_in_mb > memtable_offheap_space_in_mb > > > Regards, > Mike Yeap > > > > On Sun, May 29, 2016 at 6:18 PM, Bhuvan Rawal wrote: > >> Hi, >> >> We are running a 6 Node cluster in 2 DC on DSC 3.0.3, with 3 Node each. >> One of the node was showing UNREACHABLE on other nodes in nodetool >> describecluster and on that node it was showing all others UNREACHABLE and >> as a measure we restarted the node. >> >> But on doing that it is stuck possibly at with these messages in >> system.log: >> >> DEBUG [SlabPoolCleaner] 2016-05-29 14:07:28,156 >> ColumnFamilyStore.java:829 - Enqueuing flush of batches: 226784704 (11%) >> on-heap, 0 (0%) off-heap >> DEBUG [main] 2016-05-29 14:07:28,576 CommitLogReplayer.java:415 - >> Replaying /commitlog/data/CommitLog-6-1464508993391.log (CL version 6, >> messaging version 10, compression null) >> DEBUG [main] 2016-05-29 14:07:28,781 ColumnFamilyStore.java:829 - >> Enqueuing flush of batches: 207333510 (10%) on-heap, 0 (0%) off-heap >> >> MemtablePostFlush / MemtableFlushWriter stages where it is stuck with >> pending messages. >> This has been the status of them as per *nodetool tpstats *for long. >> MemtablePostFlush Active - 1pending - 52 >> completed - 16 >> MemtableFlushWriter Active - 2pending - 13 >> completed - 15 >> >> >> We restarted the node by setting log level to TRACE but in vain. What >> could be a possible contingency plan in such a scenario? >> >> Best Regards, >> Bhuvan >> >> >
Re: Node Stuck while restarting
Hi Bhuvan, how big are your current commit logs in the failed node, and what are the sizes MAX_HEAP_SIZE and HEAP_NEWSIZE? Also the values of following properties in cassandra.yaml?? memtable_allocation_type memtable_cleanup_threshold memtable_flush_writers memtable_heap_space_in_mb memtable_offheap_space_in_mb Regards, Mike Yeap On Sun, May 29, 2016 at 6:18 PM, Bhuvan Rawal wrote: > Hi, > > We are running a 6 Node cluster in 2 DC on DSC 3.0.3, with 3 Node each. > One of the node was showing UNREACHABLE on other nodes in nodetool > describecluster and on that node it was showing all others UNREACHABLE and > as a measure we restarted the node. > > But on doing that it is stuck possibly at with these messages in > system.log: > > DEBUG [SlabPoolCleaner] 2016-05-29 14:07:28,156 ColumnFamilyStore.java:829 > - Enqueuing flush of batches: 226784704 (11%) on-heap, 0 (0%) off-heap > DEBUG [main] 2016-05-29 14:07:28,576 CommitLogReplayer.java:415 - > Replaying /commitlog/data/CommitLog-6-1464508993391.log (CL version 6, > messaging version 10, compression null) > DEBUG [main] 2016-05-29 14:07:28,781 ColumnFamilyStore.java:829 - > Enqueuing flush of batches: 207333510 (10%) on-heap, 0 (0%) off-heap > > MemtablePostFlush / MemtableFlushWriter stages where it is stuck with > pending messages. > This has been the status of them as per *nodetool tpstats *for long. > MemtablePostFlush Active - 1pending - 52 > completed - 16 > MemtableFlushWriter Active - 2pending - 13 > completed - 15 > > > We restarted the node by setting log level to TRACE but in vain. What > could be a possible contingency plan in such a scenario? > > Best Regards, > Bhuvan > >
Node Stuck while restarting
Hi, We are running a 6 Node cluster in 2 DC on DSC 3.0.3, with 3 Node each. One of the node was showing UNREACHABLE on other nodes in nodetool describecluster and on that node it was showing all others UNREACHABLE and as a measure we restarted the node. But on doing that it is stuck possibly at with these messages in system.log: DEBUG [SlabPoolCleaner] 2016-05-29 14:07:28,156 ColumnFamilyStore.java:829 - Enqueuing flush of batches: 226784704 (11%) on-heap, 0 (0%) off-heap DEBUG [main] 2016-05-29 14:07:28,576 CommitLogReplayer.java:415 - Replaying /commitlog/data/CommitLog-6-1464508993391.log (CL version 6, messaging version 10, compression null) DEBUG [main] 2016-05-29 14:07:28,781 ColumnFamilyStore.java:829 - Enqueuing flush of batches: 207333510 (10%) on-heap, 0 (0%) off-heap MemtablePostFlush / MemtableFlushWriter stages where it is stuck with pending messages. This has been the status of them as per *nodetool tpstats *for long. MemtablePostFlush Active - 1pending - 52 completed - 16 MemtableFlushWriter Active - 2pending - 13 completed - 15 We restarted the node by setting log level to TRACE but in vain. What could be a possible contingency plan in such a scenario? Best Regards, Bhuvan