Your partition sizes aren't ridiculous... kinda big cells if there are 4 cells and 12 MB partitions, but still I don't think that is ludicrous.
Whelp, I'm out of ideas from my "pay grade". Honestly, with AZ/racks you should have theoretically might have been able to take the nodes off simultaneously, but (Disclaimer) I've never done that. ?Rolling Restart? <-- definitely indicates I have no ideas :-) On Thu, Feb 22, 2018 at 8:15 AM, Fd Habash <fmhab...@gmail.com> wrote: > One more observation … > > > > When we compare read latencies between non-prod (where nodes were removed) > to prod clusters, even though the node load as measure by size of /data dir > is similar, yet the read latencies are 5 times slower in the downsized > non-prod cluster. > > > > The only difference we see is that prod reads from 4 sstables whereas > non-prod reads from 5 as cfhistograms. > > > > Non-prod /data size > > --------------------------------- > > Filesystem Size Used Avail Use% Mounted on > > /dev/nvme0n1 885G 454G 432G 52% /data > > Filesystem Size Used Avail Use% Mounted on > > /dev/nvme0n1 885G 439G 446G 50% /data > > Filesystem Size Used Avail Use% Mounted on > > /dev/nvme0n1 885G 368G 518G 42% /data > > Filesystem Size Used Avail Use% Mounted on > > /dev/nvme0n1 885G 431G 455G 49% /data > > Filesystem Size Used Avail Use% Mounted on > > /dev/nvme0n1 885G 463G 423G 53% /data > > Filesystem Size Used Avail Use% Mounted on > > /dev/nvme0n1 885G 406G 479G 46% /data > > Filesystem Size Used Avail Use% Mounted on > > /dev/nvme0n1 885G 419G 466G 48% /data > > Filesystem Size Used Avail Use% Mounted on > > > > Prod /data size > > ---------------------------- > > Filesystem Size Used Avail Use% Mounted on > > /dev/nvme0n1 885G 352G 534G 40% /data > > Filesystem Size Used Avail Use% Mounted on > > /dev/nvme0n1 885G 423G 462G 48% /data > > Filesystem Size Used Avail Use% Mounted on > > /dev/nvme0n1 885G 431G 454G 49% /data > > Filesystem Size Used Avail Use% Mounted on > > /dev/nvme0n1 885G 442G 443G 50% /data > > Filesystem Size Used Avail Use% Mounted on > > /dev/nvme0n1 885G 454G 431G 52% /data > > > > > > Cfhistograms: comparing prod to non-prod > > --------------------------------- > > > > Non-prod > > -------------- > > 08:21:38 Percentile SSTables Write Latency Read > Latency Partition Size Cell Count > > 08:21:38 (micros) > (micros) (bytes) > > 08:21:38 50% 1.00 24.60 > 4055.27 11864 4 > > 08:21:38 75% 2.00 35.43 > 14530.76 17084 4 > > 08:21:38 95% 4.00 126.93 > 89970.66 35425 4 > > 08:21:38 98% 5.00 219.34 > 155469.30 73457 4 > > 08:21:38 99% 5.00 219.34 > 186563.16 105778 4 > > 08:21:38 Min 0.00 5.72 > 17.09 87 3 > > 08:21:38 Max 7.00 20924.30 > 1386179.89 14530764 4 > > > > Prod > > ----------- > > 07:41:42 Percentile SSTables Write Latency Read > Latency Partition Size Cell Count > > 07:41:42 (micros) > (micros) (bytes) > > 07:41:42 50% 1.00 24.60 > 2346.80 11864 4 > > 07:41:42 75% 2.00 29.52 > 4866.32 17084 4 > > 07:41:42 95% 3.00 73.46 > 14530.76 29521 4 > > 07:41:42 98% 4.00 182.79 > 25109.16 61214 4 > > 07:41:42 99% 4.00 182.79 > 36157.19 88148 4 > > 07:41:42 Min 0.00 9.89 > 20.50 87 0 > > 07:41:42 Max 5.00 219.34 > 155469.30 12108970 4 > > > > > > ---------------- > Thank you > > > > *From: *Fd Habash <fmhab...@gmail.com> > *Sent: *Thursday, February 22, 2018 9:00 AM > *To: *user@cassandra.apache.org > *Subject: *RE: Cluster Repairs 'nodetool repair -pr' Cause Severe > IncreaseinRead Latency After Shrinking Cluster > > > > > > “ data was allowed to fully rebalance/repair/drain before the next node > was taken off?” > > -------------------------------------------------------------- > > Judging by the messages, the decomm was healthy. As an example > > > > StorageService.java:3425 - Announcing that I have left the ring for > 30000ms > > … > > INFO [RMI TCP Connection(4)-127.0.0.1] 2016-01-07 06:00:52,662 > StorageService.java:1191 – DECOMMISSIONED > > > > I do not believe repairs were run after each node removal. I’ll > double-check. > > > > I’m not sure what you mean by ‘rebalance’? How do you check if a node is > balanced? Load/size of data dir? > > > > As for the drain, there was no need to drain and I believe it is not > something you do as part of decomm’ing a node. > > > did you take 1 off per rack/AZ? > > -------------------------------------------------------------- > > We removed 3 nodes, one from each AZ in sequence > > > > These are some of the cfhistogram metrics. Read latencies are high after > the removal of the nodes > > -------------------------------------------------------------- > > You can see reads of 186ms are at the 99th% from 5 sstables. There are > awfully high numbers given that these metrics measure C* storage layer read > performance. > > > > Does this mean removing the nodes undersized the cluster? > > > > key_space_01/cf_01 histograms > > Percentile SSTables Write Latency Read Latency Partition > Size Cell Count > > (micros) (micros) > (bytes) > > 50% 1.00 24.60 4055.27 > 11864 4 > > 75% 2.00 35.43 14530.76 > 17084 4 > > 95% 4.00 126.93 89970.66 > 35425 4 > > 98% 5.00 219.34 155469.30 > 73457 4 > > 99% 5.00 219.34 186563.16 > 105778 4 > > Min 0.00 5.72 17.09 > 87 3 > > Max 7.00 20924.30 1386179.89 > 14530764 4 > > > > key_space_01/cf_01 histograms > > Percentile SSTables Write Latency Read Latency Partition > Size Cell Count > > (micros) (micros) > (bytes) > > 50% 1.00 29.52 4055.27 > 11864 4 > > 75% 2.00 42.51 10090.81 > 17084 4 > > 95% 4.00 152.32 52066.35 > 35425 4 > > 98% 4.00 219.34 89970.66 > 73457 4 > > 99% 5.00 219.34 155469.30 > 88148 4 > > Min 0.00 9.89 24.60 > 87 0 > > Max 6.00 1955.67 557074.61 > 14530764 4 > > > > ---------------- > Thank you > > > > *From: *Carl Mueller <carl.muel...@smartthings.com> > *Sent: *Wednesday, February 21, 2018 4:33 PM > *To: *user@cassandra.apache.org > *Subject: *Re: Cluster Repairs 'nodetool repair -pr' Cause Severe > Increase inRead Latency After Shrinking Cluster > > > > Hm nodetool decommision performs the streamout of the replicated data, and > you said that was apparently without error... > > But if you dropped three nodes in one AZ/rack on a five node with RF3, > then we have a missing RF factor unless NetworkTopologyStrategy fails over > to another AZ. But that would also entail cross-az streaming and queries > and repair. > > > > On Wed, Feb 21, 2018 at 3:30 PM, Carl Mueller < > carl.muel...@smartthings.com> wrote: > > sorry for the idiot questions... > > data was allowed to fully rebalance/repair/drain before the next node was > taken off? > > did you take 1 off per rack/AZ? > > > > > > On Wed, Feb 21, 2018 at 12:29 PM, Fred Habash <fmhab...@gmail.com> wrote: > > One node at a time > > > > On Feb 21, 2018 10:23 AM, "Carl Mueller" <carl.muel...@smartthings.com> > wrote: > > What is your replication factor? > Single datacenter, three availability zones, is that right? > You removed one node at a time or three at once? > > > > On Wed, Feb 21, 2018 at 10:20 AM, Fd Habash <fmhab...@gmail.com> wrote: > > We have had a 15 node cluster across three zones and cluster repairs using > ‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk the > cluster to 12. Since then, same repair job has taken up to 12 hours to > finish and most times, it never does. > > > > More importantly, at some point during the repair cycle, we see read > latencies jumping to 1-2 seconds and applications immediately notice the > impact. > > > > stream_throughput_outbound_megabits_per_sec is set at 200 and > compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is > around ~500GB at 44% usage. > > > > When shrinking the cluster, the ‘nodetool decommision’ was eventless. It > completed successfully with no issues. > > > > What could possibly cause repairs to cause this impact following cluster > downsizing? Taking three nodes out does not seem compatible with such a > drastic effect on repair and read latency. > > > > Any expert insights will be appreciated. > > ---------------- > Thank you > > > > > > > > > > > > >