Re: Cluster Repairs 'nodetool repair -pr' Cause Severe IncreaseinRead Latency After Shrinking Cluster

Carl Mueller Thu, 22 Feb 2018 07:59:02 -0800

Your partition sizes aren't ridiculous... kinda big cells if there are 4
cells and 12 MB partitions, but still I don't think that is ludicrous.


Whelp, I'm out of ideas from my "pay grade". Honestly, with AZ/racks you
should have theoretically might have been able to take the nodes off
simultaneously, but (Disclaimer) I've never done that.

?Rolling Restart? <-- definitely indicates I have no ideas :-)

On Thu, Feb 22, 2018 at 8:15 AM, Fd Habash <fmhab...@gmail.com> wrote:

> One more observation …
>
>
>
> When we compare read latencies between non-prod (where nodes were removed)
> to prod clusters, even though the node load as measure by size of /data dir
> is similar, yet the read latencies are 5 times slower in the downsized
> non-prod cluster.
>
>
>
> The only difference we see is that prod reads from 4 sstables whereas
> non-prod reads from 5 as cfhistograms.
>
>
>
> Non-prod /data size
>
> ---------------------------------
>
> Filesystem      Size  Used Avail Use% Mounted on
>
> /dev/nvme0n1    885G  454G  432G  52% /data
>
> Filesystem      Size  Used Avail Use% Mounted on
>
> /dev/nvme0n1    885G  439G  446G  50% /data
>
> Filesystem      Size  Used Avail Use% Mounted on
>
> /dev/nvme0n1    885G  368G  518G  42% /data
>
> Filesystem      Size  Used Avail Use% Mounted on
>
> /dev/nvme0n1    885G  431G  455G  49% /data
>
> Filesystem      Size  Used Avail Use% Mounted on
>
> /dev/nvme0n1    885G  463G  423G  53% /data
>
> Filesystem      Size  Used Avail Use% Mounted on
>
> /dev/nvme0n1    885G  406G  479G  46% /data
>
> Filesystem      Size  Used Avail Use% Mounted on
>
> /dev/nvme0n1    885G  419G  466G  48% /data
>
> Filesystem      Size  Used Avail Use% Mounted on
>
>
>
> Prod /data size
>
> ----------------------------
>
> Filesystem      Size  Used Avail Use% Mounted on
>
> /dev/nvme0n1    885G  352G  534G  40% /data
>
> Filesystem      Size  Used Avail Use% Mounted on
>
> /dev/nvme0n1    885G  423G  462G  48% /data
>
> Filesystem      Size  Used Avail Use% Mounted on
>
> /dev/nvme0n1    885G  431G  454G  49% /data
>
> Filesystem      Size  Used Avail Use% Mounted on
>
> /dev/nvme0n1    885G  442G  443G  50% /data
>
> Filesystem      Size  Used Avail Use% Mounted on
>
> /dev/nvme0n1    885G  454G  431G  52% /data
>
>
>
>
>
> Cfhistograms: comparing prod to non-prod
>
> ---------------------------------
>
>
>
> Non-prod
>
> --------------
>
> 08:21:38        Percentile  SSTables     Write Latency      Read
> Latency    Partition Size        Cell Count
>
> 08:21:38                                      (micros)
> (micros)           (bytes)
>
> 08:21:38        50%             1.00             24.60
> 4055.27             11864                 4
>
> 08:21:38        75%             2.00             35.43
> 14530.76             17084                 4
>
> 08:21:38        95%             4.00            126.93
> 89970.66             35425                 4
>
> 08:21:38        98%             5.00            219.34
> 155469.30             73457                 4
>
> 08:21:38        99%             5.00            219.34
>        186563.16            105778                 4
>
> 08:21:38        Min             0.00              5.72
> 17.09                87                 3
>
> 08:21:38        Max             7.00          20924.30
> 1386179.89          14530764                 4
>
>
>
> Prod
>
> -----------
>
> 07:41:42        Percentile  SSTables     Write Latency      Read
> Latency    Partition Size        Cell Count
>
> 07:41:42                                      (micros)
> (micros)           (bytes)
>
> 07:41:42        50%             1.00             24.60
> 2346.80             11864                 4
>
> 07:41:42        75%             2.00             29.52
> 4866.32             17084                 4
>
> 07:41:42        95%             3.00             73.46
> 14530.76             29521                 4
>
> 07:41:42        98%             4.00            182.79
> 25109.16             61214                 4
>
> 07:41:42        99%             4.00            182.79
> 36157.19             88148                 4
>
> 07:41:42        Min             0.00              9.89
> 20.50                87                 0
>
> 07:41:42        Max             5.00            219.34
> 155469.30          12108970                 4
>
>
>
>
>
> ----------------
> Thank you
>
>
>
> *From: *Fd Habash <fmhab...@gmail.com>
> *Sent: *Thursday, February 22, 2018 9:00 AM
> *To: *user@cassandra.apache.org
> *Subject: *RE: Cluster Repairs 'nodetool repair -pr' Cause Severe
> IncreaseinRead Latency After Shrinking Cluster
>
>
>
>
>
> “ data was allowed to fully rebalance/repair/drain before the next node
> was taken off?”
>
> --------------------------------------------------------------
>
> Judging by the messages, the decomm was healthy. As an example
>
>
>
>   StorageService.java:3425 - Announcing that I have left the ring for
> 30000ms
>
> …
>
> INFO  [RMI TCP Connection(4)-127.0.0.1] 2016-01-07 06:00:52,662
> StorageService.java:1191 – DECOMMISSIONED
>
>
>
> I do not believe repairs were run after each node removal. I’ll
> double-check.
>
>
>
> I’m not sure what you mean by ‘rebalance’? How do you check if a node is
> balanced? Load/size of data dir?
>
>
>
> As for the drain, there was no need to drain and I believe it is not
> something you do as part of decomm’ing a node.
>
>
> did you take 1 off per rack/AZ?
>
> --------------------------------------------------------------
>
> We removed 3 nodes, one from each AZ in sequence
>
>
>
> These are some of the cfhistogram metrics. Read latencies are high after
> the removal of the nodes
>
> --------------------------------------------------------------
>
> You can see reads of 186ms are at the 99th% from 5 sstables. There are
> awfully high numbers given that these metrics measure C* storage layer read
> performance.
>
>
>
> Does this mean removing the nodes undersized the cluster?
>
>
>
> key_space_01/cf_01 histograms
>
> Percentile  SSTables     Write Latency      Read Latency    Partition
> Size        Cell Count
>
>                               (micros)          (micros)
> (bytes)
>
> 50%             1.00             24.60           4055.27
> 11864                 4
>
> 75%             2.00             35.43          14530.76
> 17084                 4
>
> 95%             4.00            126.93          89970.66
> 35425                 4
>
> 98%             5.00            219.34         155469.30
> 73457                 4
>
> 99%             5.00            219.34         186563.16
> 105778                 4
>
> Min             0.00              5.72             17.09
> 87                 3
>
> Max             7.00          20924.30        1386179.89
> 14530764                 4
>
>
>
> key_space_01/cf_01 histograms
>
> Percentile  SSTables     Write Latency      Read Latency    Partition
> Size        Cell Count
>
>                               (micros)          (micros)
>     (bytes)
>
> 50%             1.00             29.52           4055.27
> 11864                 4
>
> 75%             2.00             42.51          10090.81
> 17084                 4
>
> 95%             4.00            152.32          52066.35
> 35425                 4
>
> 98%             4.00            219.34          89970.66
> 73457                 4
>
> 99%             5.00            219.34         155469.30
> 88148                 4
>
> Min             0.00              9.89             24.60
> 87                 0
>
> Max             6.00           1955.67         557074.61
> 14530764                 4
>
>
>
> ----------------
> Thank you
>
>
>
> *From: *Carl Mueller <carl.muel...@smartthings.com>
> *Sent: *Wednesday, February 21, 2018 4:33 PM
> *To: *user@cassandra.apache.org
> *Subject: *Re: Cluster Repairs 'nodetool repair -pr' Cause Severe
> Increase inRead Latency After Shrinking Cluster
>
>
>
> Hm nodetool decommision performs the streamout of the replicated data, and
> you said that was apparently without error...
>
> But if you dropped three nodes in one AZ/rack on a five node with RF3,
> then we have a missing RF factor unless NetworkTopologyStrategy fails over
> to another AZ. But that would also entail cross-az streaming and queries
> and repair.
>
>
>
> On Wed, Feb 21, 2018 at 3:30 PM, Carl Mueller <
> carl.muel...@smartthings.com> wrote:
>
> sorry for the idiot questions...
>
> data was allowed to fully rebalance/repair/drain before the next node was
> taken off?
>
> did you take 1 off per rack/AZ?
>
>
>
>
>
> On Wed, Feb 21, 2018 at 12:29 PM, Fred Habash <fmhab...@gmail.com> wrote:
>
> One node at a time
>
>
>
> On Feb 21, 2018 10:23 AM, "Carl Mueller" <carl.muel...@smartthings.com>
> wrote:
>
> What is your replication factor?
> Single datacenter, three availability zones, is that right?
> You removed one node at a time or three at once?
>
>
>
> On Wed, Feb 21, 2018 at 10:20 AM, Fd Habash <fmhab...@gmail.com> wrote:
>
> We have had a 15 node cluster across three zones and cluster repairs using
> ‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk the
> cluster to 12. Since then, same repair job has taken up to 12 hours to
> finish and most times, it never does.
>
>
>
> More importantly, at some point during the repair cycle, we see read
> latencies jumping to 1-2 seconds and applications immediately notice the
> impact.
>
>
>
> stream_throughput_outbound_megabits_per_sec is set at 200 and
> compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is
> around ~500GB at 44% usage.
>
>
>
> When shrinking the cluster, the ‘nodetool decommision’ was eventless. It
> completed successfully with no issues.
>
>
>
> What could possibly cause repairs to cause this impact following cluster
> downsizing? Taking three nodes out does not seem compatible with such a
> drastic effect on repair and read latency.
>
>
>
> Any expert insights will be appreciated.
>
> ----------------
> Thank you
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Cluster Repairs 'nodetool repair -pr' Cause Severe IncreaseinRead Latency After Shrinking Cluster

Reply via email to