Re: RE: Re: Re: High mutation stage in multi dc deployment
This is sufficiently atypical that many people aren't going to have enough intuition to figure it out without seeing your metrics / logs / debugging data (e.g. heap dumps). My only guess, and it's a pretty big guess, is that your write timeout is low enough (or network quality bad enough, though unlikely with gcp, they're usually very good at networking) that your coordinator is timing out waiting for india to ack the write, which is causing it to write a hint, and the delivery of that hint also times out due to (network or latency or just speed of light), so you're creating a death spiral of write -> write is slow so hint -> hint replay now means youre' doing extra work, so it's even slower, so new writes also hint, and maybe hints re-deliver. The other option would be that system_traces is only in india, and someone enabled tracing (either application level or probabilistically). If it's neither of those, you're going to have to debug it for real. Do metrics show hints? Do you have table level metrics for writes per table? Are they the same as you expect? Are they higher in india? Is one table only in india and it's getting lots of writes? Take a heap dump of one of the india machines and look to see what the mutations are. Is it a table you recognize? Is the replication factor set the way you expect? On Tue, Jul 20, 2021 at 10:20 AM MyWorld wrote: > Kindly help in this regard. What could be the possible reason for load and > mutation spike in india data center > > On 2021/07/20 00:14:56 MyWorld wrote: > > Hi Arvinder, > > It's a separate cluster. Here max partition size is 32mb. > > > > On 2021/07/19 23:57:27 Arvinder Dhillon wrote: > > > Is this the same cluster with 1G partition size? > > > > > > -Arvinder > > > > > > On Mon, Jul 19, 2021, 4:51 PM MyWorld wrote: > > > > > > > Hi daemeon, > > > > We have already tuned the TCP settings to improve the bandwidth. > Earlier > > > > we had lot of hint and mutation msg drop which were gone after tuning > > TCP. > > > > Moreover we are writing with CL local quorum at US side, so ack is > taken > > > > from local DC. > > > > I m still concern what could be reason of increase mutation count. > > > > > > > > On 2021/07/19 19:55:52 daemeon reiydelle wrote: > > > > > You may want to think about the latency impacts of a cluster that > has > > one > > > > > node "far away". This is such a basic design flaw that you need to > do > > > > some > > > > > basic learning, and some basic understanding of networking and > > latency. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Jul 19, 2021 at 10:38 AM MyWorld wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > Currently we have a cluster with 2 DC of 3 nodes each. One DC is > in > > > > GCP-US > > > > > > while other is in GCP-India. Just to add here, configuration of > > every > > > > node > > > > > > accross both DC is same. Cpu-6, Ram-32gb, Heap-8gb > > > > > > > > > > > > We do all our write on US data center. While performing a bulk > > write on > > > > > > GCP US, we observe normal load of 1 on US while this load at GCP > > India > > > > > > spikes to 10. > > > > > > > > > > > > On observing tpstats further in grafana we found mutation stage > at > > GCP > > > > > > India is going to 1million intermittently though our overall > write > > is > > > > > > nearly 300 per sec per node. Don't know the reason but whenever > we > > have > > > > > > this spike, we are having load issue. > > > > > > Please help what could be the possible reason for this? > > > > > > > > > > > > Regards, > > > > > > Ashish > > > > > > > > > > > > > > > > > > > > >
RE: RE: Re: Re: High mutation stage in multi dc deployment
Kindly help in this regard. What could be the possible reason for load and mutation spike in india data center On 2021/07/20 00:14:56 MyWorld wrote: > Hi Arvinder, > It's a separate cluster. Here max partition size is 32mb. > > On 2021/07/19 23:57:27 Arvinder Dhillon wrote: > > Is this the same cluster with 1G partition size? > > > > -Arvinder > > > > On Mon, Jul 19, 2021, 4:51 PM MyWorld wrote: > > > > > Hi daemeon, > > > We have already tuned the TCP settings to improve the bandwidth. Earlier > > > we had lot of hint and mutation msg drop which were gone after tuning > TCP. > > > Moreover we are writing with CL local quorum at US side, so ack is taken > > > from local DC. > > > I m still concern what could be reason of increase mutation count. > > > > > > On 2021/07/19 19:55:52 daemeon reiydelle wrote: > > > > You may want to think about the latency impacts of a cluster that has > one > > > > node "far away". This is such a basic design flaw that you need to do > > > some > > > > basic learning, and some basic understanding of networking and > latency. > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Jul 19, 2021 at 10:38 AM MyWorld wrote: > > > > > > > > > Hi all, > > > > > > > > > > Currently we have a cluster with 2 DC of 3 nodes each. One DC is in > > > GCP-US > > > > > while other is in GCP-India. Just to add here, configuration of > every > > > node > > > > > accross both DC is same. Cpu-6, Ram-32gb, Heap-8gb > > > > > > > > > > We do all our write on US data center. While performing a bulk > write on > > > > > GCP US, we observe normal load of 1 on US while this load at GCP > India > > > > > spikes to 10. > > > > > > > > > > On observing tpstats further in grafana we found mutation stage at > GCP > > > > > India is going to 1million intermittently though our overall write > is > > > > > nearly 300 per sec per node. Don't know the reason but whenever we > have > > > > > this spike, we are having load issue. > > > > > Please help what could be the possible reason for this? > > > > > > > > > > Regards, > > > > > Ashish > > > > > > > > > > > > > > >
RE: Re: Re: High mutation stage in multi dc deployment
Hi Arvinder, It's a separate cluster. Here max partition size is 32mb. On 2021/07/19 23:57:27 Arvinder Dhillon wrote: > Is this the same cluster with 1G partition size? > > -Arvinder > > On Mon, Jul 19, 2021, 4:51 PM MyWorld wrote: > > > Hi daemeon, > > We have already tuned the TCP settings to improve the bandwidth. Earlier > > we had lot of hint and mutation msg drop which were gone after tuning TCP. > > Moreover we are writing with CL local quorum at US side, so ack is taken > > from local DC. > > I m still concern what could be reason of increase mutation count. > > > > On 2021/07/19 19:55:52 daemeon reiydelle wrote: > > > You may want to think about the latency impacts of a cluster that has one > > > node "far away". This is such a basic design flaw that you need to do > > some > > > basic learning, and some basic understanding of networking and latency. > > > > > > > > > > > > > > > > > > On Mon, Jul 19, 2021 at 10:38 AM MyWorld wrote: > > > > > > > Hi all, > > > > > > > > Currently we have a cluster with 2 DC of 3 nodes each. One DC is in > > GCP-US > > > > while other is in GCP-India. Just to add here, configuration of every > > node > > > > accross both DC is same. Cpu-6, Ram-32gb, Heap-8gb > > > > > > > > We do all our write on US data center. While performing a bulk write on > > > > GCP US, we observe normal load of 1 on US while this load at GCP India > > > > spikes to 10. > > > > > > > > On observing tpstats further in grafana we found mutation stage at GCP > > > > India is going to 1million intermittently though our overall write is > > > > nearly 300 per sec per node. Don't know the reason but whenever we have > > > > this spike, we are having load issue. > > > > Please help what could be the possible reason for this? > > > > > > > > Regards, > > > > Ashish > > > > > > > > > >
Re: Re: High mutation stage in multi dc deployment
Is this the same cluster with 1G partition size? -Arvinder On Mon, Jul 19, 2021, 4:51 PM MyWorld wrote: > Hi daemeon, > We have already tuned the TCP settings to improve the bandwidth. Earlier > we had lot of hint and mutation msg drop which were gone after tuning TCP. > Moreover we are writing with CL local quorum at US side, so ack is taken > from local DC. > I m still concern what could be reason of increase mutation count. > > On 2021/07/19 19:55:52 daemeon reiydelle wrote: > > You may want to think about the latency impacts of a cluster that has one > > node "far away". This is such a basic design flaw that you need to do > some > > basic learning, and some basic understanding of networking and latency. > > > > > > > > > > > > On Mon, Jul 19, 2021 at 10:38 AM MyWorld wrote: > > > > > Hi all, > > > > > > Currently we have a cluster with 2 DC of 3 nodes each. One DC is in > GCP-US > > > while other is in GCP-India. Just to add here, configuration of every > node > > > accross both DC is same. Cpu-6, Ram-32gb, Heap-8gb > > > > > > We do all our write on US data center. While performing a bulk write on > > > GCP US, we observe normal load of 1 on US while this load at GCP India > > > spikes to 10. > > > > > > On observing tpstats further in grafana we found mutation stage at GCP > > > India is going to 1million intermittently though our overall write is > > > nearly 300 per sec per node. Don't know the reason but whenever we have > > > this spike, we are having load issue. > > > Please help what could be the possible reason for this? > > > > > > Regards, > > > Ashish > > > > > >
RE: Re: High mutation stage in multi dc deployment
Hi daemeon, We have already tuned the TCP settings to improve the bandwidth. Earlier we had lot of hint and mutation msg drop which were gone after tuning TCP. Moreover we are writing with CL local quorum at US side, so ack is taken from local DC. I m still concern what could be reason of increase mutation count. On 2021/07/19 19:55:52 daemeon reiydelle wrote: > You may want to think about the latency impacts of a cluster that has one > node "far away". This is such a basic design flaw that you need to do some > basic learning, and some basic understanding of networking and latency. > > > > > > On Mon, Jul 19, 2021 at 10:38 AM MyWorld wrote: > > > Hi all, > > > > Currently we have a cluster with 2 DC of 3 nodes each. One DC is in GCP-US > > while other is in GCP-India. Just to add here, configuration of every node > > accross both DC is same. Cpu-6, Ram-32gb, Heap-8gb > > > > We do all our write on US data center. While performing a bulk write on > > GCP US, we observe normal load of 1 on US while this load at GCP India > > spikes to 10. > > > > On observing tpstats further in grafana we found mutation stage at GCP > > India is going to 1million intermittently though our overall write is > > nearly 300 per sec per node. Don't know the reason but whenever we have > > this spike, we are having load issue. > > Please help what could be the possible reason for this? > > > > Regards, > > Ashish > > >
RE: Re: High mutation stage in multi dc deployment
Hi Patrick, Currently we are using 3.11.6 apache cassandra version. We are performing write with CL local quorum in US side DC. We have 4-5 tables with supplier_details, supplier_prod_details, supplier_rating. We also have an mview attached with rating table. For batching part, I need to check with ops team. However they are re-syncing data supplier wise in these tables. On 2021/07/19 20:56:49 Patrick McFadin wrote: > Hi Ashish, > > Can you give us some information about some of the details? Specifically > some indication on the version of Cassandra, data model, consistency levels > used, and how you are bulk loading. Is this a batch by any chance? > > Patrick > > On Mon, Jul 19, 2021 at 10:38 AM MyWorld wrote: > > > Hi all, > > > > Currently we have a cluster with 2 DC of 3 nodes each. One DC is in GCP-US > > while other is in GCP-India. Just to add here, configuration of every node > > accross both DC is same. Cpu-6, Ram-32gb, Heap-8gb > > > > We do all our write on US data center. While performing a bulk write on > > GCP US, we observe normal load of 1 on US while this load at GCP India > > spikes to 10. > > > > On observing tpstats further in grafana we found mutation stage at GCP > > India is going to 1million intermittently though our overall write is > > nearly 300 per sec per node. Don't know the reason but whenever we have > > this spike, we are having load issue. > > Please help what could be the possible reason for this? > > > > Regards, > > Ashish > > >
Re: High mutation stage in multi dc deployment
Hi Ashish, Can you give us some information about some of the details? Specifically some indication on the version of Cassandra, data model, consistency levels used, and how you are bulk loading. Is this a batch by any chance? Patrick On Mon, Jul 19, 2021 at 10:38 AM MyWorld wrote: > Hi all, > > Currently we have a cluster with 2 DC of 3 nodes each. One DC is in GCP-US > while other is in GCP-India. Just to add here, configuration of every node > accross both DC is same. Cpu-6, Ram-32gb, Heap-8gb > > We do all our write on US data center. While performing a bulk write on > GCP US, we observe normal load of 1 on US while this load at GCP India > spikes to 10. > > On observing tpstats further in grafana we found mutation stage at GCP > India is going to 1million intermittently though our overall write is > nearly 300 per sec per node. Don't know the reason but whenever we have > this spike, we are having load issue. > Please help what could be the possible reason for this? > > Regards, > Ashish >
Re: High mutation stage in multi dc deployment
You may want to think about the latency impacts of a cluster that has one node "far away". This is such a basic design flaw that you need to do some basic learning, and some basic understanding of networking and latency. On Mon, Jul 19, 2021 at 10:38 AM MyWorld wrote: > Hi all, > > Currently we have a cluster with 2 DC of 3 nodes each. One DC is in GCP-US > while other is in GCP-India. Just to add here, configuration of every node > accross both DC is same. Cpu-6, Ram-32gb, Heap-8gb > > We do all our write on US data center. While performing a bulk write on > GCP US, we observe normal load of 1 on US while this load at GCP India > spikes to 10. > > On observing tpstats further in grafana we found mutation stage at GCP > India is going to 1million intermittently though our overall write is > nearly 300 per sec per node. Don't know the reason but whenever we have > this spike, we are having load issue. > Please help what could be the possible reason for this? > > Regards, > Ashish >
High mutation stage in multi dc deployment
Hi all, Currently we have a cluster with 2 DC of 3 nodes each. One DC is in GCP-US while other is in GCP-India. Just to add here, configuration of every node accross both DC is same. Cpu-6, Ram-32gb, Heap-8gb We do all our write on US data center. While performing a bulk write on GCP US, we observe normal load of 1 on US while this load at GCP India spikes to 10. On observing tpstats further in grafana we found mutation stage at GCP India is going to 1million intermittently though our overall write is nearly 300 per sec per node. Don't know the reason but whenever we have this spike, we are having load issue. Please help what could be the possible reason for this? Regards, Ashish