Re: Node Stuck while restarting

2016-05-29 Thread Bhuvan Rawal
Hi Mike,

PFA the details you asked for: and some others if that helps:
we are using jvm params
-Xms8G
-Xmx8G

MAX_HEAP_SIZE: & HEAP_NEWSIZE: is not being set and possibly calculated
by calculate_heap_sizes function. (And we are using default calculations):
here are the results, pls correct me if im wrong :
system_memory_in_mb : 64544
system_cpu_cores : 16

for MAX_HEAP_SIZE:

# set max heap size based on the following
# max(min(1/2 ram, 1024MB), min(1/4 ram, 8GB))
# calculate 1/2 ram and cap to 1024MB
# calculate 1/4 ram and cap to 8192MB
# pick the max

By this I can figure out that MAX_HEAP_SIZE is 8GB - (From the first case &
third case)

max_sensible_yg_per_core_in_mb="100"
max_sensible_yg_in_mb=`expr $max_sensible_yg_per_core_in_mb "*"
$system_cpu_cores` -  100* 16 = 1600 MB
desired_yg_in_mb=`expr $max_heap_size_in_mb / 4 ---That comes out to
be- 8GB/4 = 2GB

if [ "$desired_yg_in_mb" -gt "$max_sensible_yg_in_mb" ]
then
HEAP_NEWSIZE="${max_sensible_yg_in_mb}M"
else
HEAP_NEWSIZE="${desired_yg_in_mb}M"
fi

That should set HEAP_NEWSIZE to 1600MB by first case.


memtable_allocation_type: heap_buffers

memtable_cleanup_threshold- we are using default:
# memtable_cleanup_threshold defaults to 1 / (memtable_flush_writers + 1)
# memtable_cleanup_threshold: 0.11

memtable_flush_writers - default (2)
we can increase this as we are using SSD with IOPS of around 300/s

memtable_heap_space_in_mb - default values
# memtable_heap_space_in_mb: 2048
# memtable_offheap_space_in_mb: 2048

We are using G1 garbage collector and jdk1.8.0_45

Best Regards,


On Sun, May 29, 2016 at 5:07 PM, Mike Yeap  wrote:

> Hi Bhuvan, how big are your current commit logs in the failed node, and
> what are the sizes MAX_HEAP_SIZE and HEAP_NEWSIZE?
>
> Also the values of following properties in cassandra.yaml??
>
> memtable_allocation_type
> memtable_cleanup_threshold
> memtable_flush_writers
> memtable_heap_space_in_mb
> memtable_offheap_space_in_mb
>
>
> Regards,
> Mike Yeap
>
>
>
> On Sun, May 29, 2016 at 6:18 PM, Bhuvan Rawal  wrote:
>
>> Hi,
>>
>> We are running a 6 Node cluster in 2 DC on DSC 3.0.3, with 3 Node each.
>> One of the node was showing UNREACHABLE on other nodes in nodetool
>> describecluster  and on that node it was showing all others UNREACHABLE and
>> as a measure we restarted the node.
>>
>> But on doing that it is stuck possibly at with these messages in
>> system.log:
>>
>> DEBUG [SlabPoolCleaner] 2016-05-29 14:07:28,156
>> ColumnFamilyStore.java:829 - Enqueuing flush of batches: 226784704 (11%)
>> on-heap, 0 (0%) off-heap
>> DEBUG [main] 2016-05-29 14:07:28,576 CommitLogReplayer.java:415 -
>> Replaying /commitlog/data/CommitLog-6-1464508993391.log (CL version 6,
>> messaging version 10, compression null)
>> DEBUG [main] 2016-05-29 14:07:28,781 ColumnFamilyStore.java:829 -
>> Enqueuing flush of batches: 207333510 (10%) on-heap, 0 (0%) off-heap
>>
>> MemtablePostFlush / MemtableFlushWriter stages where it is stuck with
>> pending messages.
>> This has been the status of them as per *nodetool tpstats *for long.
>> MemtablePostFlush Active - 1pending - 52
>>   completed - 16
>> MemtableFlushWriter   Active - 2pending - 13
>>   completed - 15
>>
>>
>> We restarted the node by setting log level to TRACE but in vain. What
>> could be a possible contingency plan in such a scenario?
>>
>> Best Regards,
>> Bhuvan
>>
>>
>


Re: Node Stuck while restarting

2016-05-29 Thread Mike Yeap
Hi Bhuvan, how big are your current commit logs in the failed node, and
what are the sizes MAX_HEAP_SIZE and HEAP_NEWSIZE?

Also the values of following properties in cassandra.yaml??

memtable_allocation_type
memtable_cleanup_threshold
memtable_flush_writers
memtable_heap_space_in_mb
memtable_offheap_space_in_mb


Regards,
Mike Yeap



On Sun, May 29, 2016 at 6:18 PM, Bhuvan Rawal  wrote:

> Hi,
>
> We are running a 6 Node cluster in 2 DC on DSC 3.0.3, with 3 Node each.
> One of the node was showing UNREACHABLE on other nodes in nodetool
> describecluster  and on that node it was showing all others UNREACHABLE and
> as a measure we restarted the node.
>
> But on doing that it is stuck possibly at with these messages in
> system.log:
>
> DEBUG [SlabPoolCleaner] 2016-05-29 14:07:28,156 ColumnFamilyStore.java:829
> - Enqueuing flush of batches: 226784704 (11%) on-heap, 0 (0%) off-heap
> DEBUG [main] 2016-05-29 14:07:28,576 CommitLogReplayer.java:415 -
> Replaying /commitlog/data/CommitLog-6-1464508993391.log (CL version 6,
> messaging version 10, compression null)
> DEBUG [main] 2016-05-29 14:07:28,781 ColumnFamilyStore.java:829 -
> Enqueuing flush of batches: 207333510 (10%) on-heap, 0 (0%) off-heap
>
> MemtablePostFlush / MemtableFlushWriter stages where it is stuck with
> pending messages.
> This has been the status of them as per *nodetool tpstats *for long.
> MemtablePostFlush Active - 1pending - 52
> completed - 16
> MemtableFlushWriter   Active - 2pending - 13
> completed - 15
>
>
> We restarted the node by setting log level to TRACE but in vain. What
> could be a possible contingency plan in such a scenario?
>
> Best Regards,
> Bhuvan
>
>


Node Stuck while restarting

2016-05-29 Thread Bhuvan Rawal
Hi,

We are running a 6 Node cluster in 2 DC on DSC 3.0.3, with 3 Node each. One
of the node was showing UNREACHABLE on other nodes in nodetool
describecluster  and on that node it was showing all others UNREACHABLE and
as a measure we restarted the node.

But on doing that it is stuck possibly at with these messages in system.log:

DEBUG [SlabPoolCleaner] 2016-05-29 14:07:28,156 ColumnFamilyStore.java:829
- Enqueuing flush of batches: 226784704 (11%) on-heap, 0 (0%) off-heap
DEBUG [main] 2016-05-29 14:07:28,576 CommitLogReplayer.java:415 - Replaying
/commitlog/data/CommitLog-6-1464508993391.log (CL version 6, messaging
version 10, compression null)
DEBUG [main] 2016-05-29 14:07:28,781 ColumnFamilyStore.java:829 - Enqueuing
flush of batches: 207333510 (10%) on-heap, 0 (0%) off-heap

MemtablePostFlush / MemtableFlushWriter stages where it is stuck with
pending messages.
This has been the status of them as per *nodetool tpstats *for long.
MemtablePostFlush Active - 1pending - 52
completed - 16
MemtableFlushWriter   Active - 2pending - 13
completed - 15


We restarted the node by setting log level to TRACE but in vain. What could
be a possible contingency plan in such a scenario?

Best Regards,
Bhuvan