Hi Bowen,
There is only one major table taking 90% of writes. I will try to increase the 
"commitlog_segment_size_in_mb" value to 1 GB and set "max_mutation_size_in_kb" 
to 16MB. Currently we don't set the "commitlog_total_space_in_mb" value so it 
should be using default of 8192 MB (8 GB). What do you suggest for this 
parameter's value?By the way we don't back up the commit log.Since our reaper 
repair is not working effectively as now, we still need to rely on the hints to 
reply writes cross nodes and DC. Do you have any suggestion for how old hints 
could be safely removed without impacting data consistency? I think this 
question may be depending on many factors but I was wondering if there is any 
kind of rule of thumb?
Thanks,Jiayong Sun    On Sunday, August 15, 2021, 05:58:11 AM PDT, Bowen Song 
<bo...@bso.ng> wrote:  
 
  
Hi Jiayong,
 

 
 
Based on this statement:
 
 
> We see the commit logs switched about 10 times per minutes
 
I'd also like to know, roughly speaking, how many tables in Cassandra are being 
written frequently? I'm asking this because the commit log segments are being 
created (and recycled) so frequently (~ every 6 seconds), and I suspect that a 
lots of tables are involved in each commit log segment, and that leads to many 
SSTable flushes.
 
You could try to drain a node, and then remove all commit logs from the node 
and increase the "commitlog_segment_size_in_mb" value to something much larger 
(say, 1GB), and also increase the commitlog_total_space_in_mb accordingly on 
this node, and see if this helps improving the situation. Note that you may 
also want to manually set the "max_mutation_size_in_kb" to 16MB (the default 
value is half of the commit segment size) to prevent unexpected extra large 
sized mutations get accepted on this node and then failing on other nodes. 
Please also note that this may interfere with some backup tools which backs up 
the commit log segments.
 
 
In addition to that, if you periodically purge the hints, you probably are 
better off by just disabling hinted handoff and make sure you always run repair 
within the gc_grace_seconds.
 
 

 
 
Cheers,
 
Bowen
 
 On 14/08/2021 03:33, Jiayong Sun wrote:
  
 
 Hi Bowen, 
  Thanks for digging into source code so deep. Here are answers to your 
questions:     
   - Does your application changes the table schema frequently? Specifically: 
alter table, drop table and drop keyspace. - No, either admin or apps doesn't 
frequently alter/drop/create table schema in run-time.
   - Do you have the memtable_flush_period_in_ms table property set to non-zero 
on any table at all? - all tables use "memtable_flush_period_in_ms = 0" for 
default.
   - Is the timing of frequent small SSTable flushes coincident with streaming 
activities? - The repair job is paused and don't see streaming occurring in 
system.log
   - What's your commitlog_segment_size_in_mb and commitlog_total_space_in_mb 
in cassandra.yaml and what's your free space size on the disk where commit log 
is located? - "commitlog_segment_size_in_mb: 32". "commitlog_total_space_in_mb" 
is not set. The commit logs have separate disk and no chance it'd be filled up. 
Data disk is about 50% used.
   - How fast do you fill up a commit log segment? I.e.: how fast are you 
writing to Cassandra? - We see the commit logs switched about 10 times per 
minutes, and lots of hints cumulated on disk and replayed constantly. This 
could be due to many nodes unresponsive due to this ongoing issue.
   - Anything invoking the "nodetool flush" or "nodetool drain" command? - no, 
we don't issue these commands unless restarting a node manually or 
automatically through some monitoring mechanism which is not happening 
frequesntly.
 I doubt if this could be related with huge amount of hints replay. The cluster 
is stressed by heave writes through spark jobs and there are some hot-spot 
partitions. Huge amount of hints (e.g. >50GB per node) are not uncommon in this 
cluster especially since this issue has been occurring causing many node lost 
gossip. We have to set up a daily cron job to clear the older hints from disk, 
but not sure if this would hurt data inconsistency among nodes and DCs. 
  Thoughts? 
   Thanks, Jiayong Sun     On Friday, August 13, 2021, 03:39:44 PM PDT, Bowen 
Song <bo...@bso.ng> wrote:  
  
     
Hi Jiayong,
 

 
 
I'm sorry to hear that. I did not know many nodes were/are experiencing the 
same issue. A bit of dig in the source code indicates the log below comes from 
the ColumnFamilyStore.logFlush() method.
 
 
 
DEBUG [NativePoolCleaner] <timestamp> ColumnFamilyStore.java:932 - Enqueuing 
flush of sstable_activity: 0.408KiB (0%) on-heap, 0.154KiB (0%) off-heap
 
 The ColumnFamilyStore.logFlush() method is a private method and the only place 
referencing to it is the ColumnFamilyStore.switchmemtable() method in the same 
file, and that has been referenced in two places - ColumnFamilyStore.reload() 
and ColumnFamilyStore.switchMemtableIfCurrent(). Ignoring secondary index and 
MV, the former is only being called during the node starts up and on table 
schema changes, and it's unlikely our suspect (unless you are frequently 
changing the schema or restarting the node). The later is being referenced in 
two methods: ColumnFamilyStore.forceFlush() and 
ColumnFamilyStore.FlushLargestColumnFamily.run(). The 
ColumnFamilyStore.FlushLargestColumnFamily.run() method is only called by the 
MemtableCleanerThread, and we have pretty much ruled that out in the previous 
conversations. The forceFlush() method is invoked if the table property 
memtable_flush_period_in_ms is set, when Cassandra is preparing for 
sending/receiving files via streaming, when an old segment of commit log is 
recycled, on "nodetool drain", and again, on schema changes (drop 
keyspace/table). 
   
  So, my questions would be:
    
   - Does your application changes the table schema frequently? Specifically: 
alter table, drop table and drop keyspace.
   - Do you have the memtable_flush_period_in_ms table property set to non-zero 
on any table at all?
   - Is the timing of frequent small SSTable flushes coincident with streaming 
activities?
   - What's your commitlog_segment_size_in_mb and commitlog_total_space_in_mb 
in cassandra.yaml and what's your free space size on the disk where commit log 
is located?
   - How fast do you fill up a commit log segment? I.e.: how fast are you 
writing to Cassandra?
   - Anything invoking the "nodetool flush" or "nodetool drain" command?   
 
  I hope the above questions will help you find the root cause. 
  Cheers, Bowen
  
   On 13/08/2021 22:47, Jiayong Sun wrote:
  
 
      Hi Bowen, 
  There are many nodes having this issue and some of them repeatedly having it. 
Replacing a node by wiping out everything and streaming in good shape of 
sstables would work, but if we don't know the root cause the node would be in 
the bad shape again. Yes, we know the reaper repair running so long like weeks 
is not good which most likely due to the multiple DC with large size of rings. 
We are planning to upgrade to newer version of reaper to see if that helps. We 
do have debug.log turned on but didn't catch anything helpful other than those 
constant enqueuing/flashing/deleting of memtable and sstables (I listed a few 
examples messages at beginning of this email thread). Thanks for all your 
thoughts and I really appreciate. 
  Thanks, Jiayong Sun   
      On Friday, August 13, 2021, 01:36:21 PM PDT, Bowen Song <bo...@bso.ng> 
wrote:  
  
     
Hi Jiayong,
 

 
 
That doesn't really match the situation described in the SO question. I 
suspected it was related to repairing a table with MV and large partitions, but 
based on the information you've given, I was clearly wrong.
 
A few hundreds MB partitions is not exactly unusual, I don't see that alone 
could lead to frequent SSTable flushing. A repair session takes weeks to 
complete is a bit worrying in terms of performance and maintainability, but 
again it should not cause this issue.
 
Since we don't know the cause of it, I can see two possible solutions - either 
replace the "broken" node, or dig into the logs (remember to turn on the debug 
logs) and trying to identify the root cause. I personally would recommend 
replacing the problematic node as a quick win.
 
 

 
 
Cheers,
 
Bowen
 
  On 13/08/2021 20:31, Jiayong Sun wrote:
  
 
      Hi Bowen, 
  We do have reaper repair job scheduled periodically and it can take days even 
weeks to complete one round of repair due to large number of rings/nodes. 
However, we have paused the repair since we are facing this issue. We do not 
use the MV in this cluster. There is major table taking 95% of disk storage and 
workload but its Partition Size is around 30 MB. There are a couple small 
tables with the Max Partition Size over several hundreds of MB but their total 
data size just about a few GB. 
  Any thoughts? 
  Thanks, Jiayong 
      On Friday, August 13, 2021, 03:32:45 AM PDT, Bowen Song <bo...@bso.ng> 
wrote:  
  
     
Hi Jiayong,
 

 
 
Sorry I didn't make it clear in my previous email. When I commented on the 
RAID0 setup, it was only a comment on the RAID0 setup vs JBOD, and that was not 
in relation to the SSTable flushing issue. The part of my previous email after 
the "On the frequent SSTable flush issue" line is the part related to the 
SSTable flushing issue, and those two questions at the end of it remain valid:
 
    
   - Did you run repair?
   - Do you use materialized views?
 
and, if I may, I'd also like to add another question:
    
   - Do you have large (> 100 MB) partitions?
 Those are the 3 things mentioned in the SO question. I'm trying to find the 
connections between the issue you are experiencing and the issue described in 
the SO question.
 

 
 
Cheers,
 
Bowen
 
 

 
  On 13/08/2021 01:36, Jiayong Sun wrote:
  
 
      Hello Bowen, 
  Thanks for your response. Yes, we are aware of the theory that RAID0 vs 
individual JBOD, but all of our clusters are using this RAID0 configuration 
through Azure, while only on this cluster we see this issue so it's hardly to 
conclude root cause to the disk. This is more like workload related, and we are 
seeking feedback here for any other parameters in the yaml that we could tune 
for this. 
  Thanks again, Jiayong Sun 
      On Thursday, August 12, 2021, 04:55:51 AM PDT, Bowen Song <bo...@bso.ng> 
wrote:  
  
     
Hello Jiayong,
 

 
 
Using multiple disks in a RAID0 for Cassandra data directory is not 
recommended. You will get better fault tolerance and often better performance 
too with multiple data directories, one on each disk.
 
If you stick with RAID0, it's not 4 disks, it's 1 from Cassandra's point of 
view, because any read or write operation will have to touch all 4 member 
disks. Therefore, 4 flush writers doesn't make much sense.
 
On the frequent SSTable flush issue, a quick internet search leads me to:
 
 
* an old bug in Cassandra 2.1 - CASSANDRA-8409 which shouldn't affect 3.x at all
 
* a StackOverflow question may be related
 
 
Did you run repair? Do you use materialized views?
 
 

 
 
Regards,
 
Bowen
 
 

 
  On 11/08/2021 15:58, Jiayong Sun wrote:
  
 
      Hi Erick, 
  The nodes have 4 SSD (1TB for each but we only use 2.4TB of space. Current 
disk usage is about 50%) with RAID0.  Based on number of disks we increased 
memtable_flush_writers: 4 instead of default of 2. 
  For the following we set:   - max heap size - 31GB - 
memtable_heap_space_in_mb (use default) - memtable_offheap_space_in_mb  (use 
default)  
  In the logs, we also noticed system.sstable_activity table has hundreds of MB 
or GB of data and constantly flushing:
  DEBUG [NativePoolCleaner]<timestamp> ColumnFamilyStore.java:932 - Enqueuing 
flush of sstable_activity: 0.293KiB (0%) on-heap, 0.107KiB (0%) off-heap DEBUG 
[NonPeriodicTasks:1]<timestamp> SSTable.java:105 - Deleting 
sstable:/app/cassandra/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/md-103645-big
 DEBUG [NativePoolCleaner]<timestamp> ColumnFamilyStore.java:1322 - Flushing 
largest CFS(Keyspace='system',ColumnFamily='sstable_activity') to free up room. 
Used total: 0.06/1.00, live: 0.00/0.00, flushing: 0.02/0.29, this: 0.00/0.00 
   Thanks, Jiayong Sun     On Wednesday, August 11, 2021, 12:06:27 AM PDT, 
Erick Ramirez <erick.rami...@datastax.com> wrote:  
  
       4 flush writers isn't bad since the default is 2. It doesn't make a 
difference if you have fast disks (like NVMe SSDs) because only 1 thread gets 
used. 
  But if flushes are slow, the work gets distributed to 4 flush writers so you 
end up with smaller flush sizes although it's difficult to tell how tiny the 
SSTables would be without analysing the logs and overall performance of your 
cluster. 
  Was there a specific reason you decided to bump it up to 4? I'm just trying 
to get a sense of why you did it since it might provide some clues. Out of 
curiosity, what do you have set for the following? - max heap size - 
memtable_heap_space_in_mb - memtable_offheap_space_in_mb    
             
                                     

Reply via email to