Hi Ben, just catching up over the weekend.

The typical advice, per Sergio’s link reference, is an obvious starting point.  
We use G1GC and normally I’d treat 8gig as the minimal starting point for a 
heap.  What sometimes doesn’t get talked about in the myriad of tunings, is 
that you have to have a clear goal in your mind on what you are tuning *for*. 
You could be tuning for throughput, or average latency, or 99’s latency, etc.  
How you tune varies quite a lot according to your goal.  The more your goal is 
about latency, the more work you have ahead of you.

I will suggest that, if your data footprint is going to stay low, that you give 
yourself permission to do some experimentation.  As you’re using K8s, you are 
in a bit of a position where if your usage is small enough, you can get 2x bang 
for the buck on your servers by sizing the pods to about 45% of server 
resources and using the C* rack metaphor to ensure you don’t co-locate replicas.

For example, were I you, I’d start asking myself if SSTable compression 
mattered to me at all.  The reason I’d start asking myself questions like that 
is C* has multiple uses of memory, and one of the balancing acts is chunk cache 
and the O/S file cache.  If I could find a way to make my O/S file cache be a 
defacto C* cache, I’d roll up the shirt sleeves and see what kind of 
performance numbers I could squeeze out with some creative tuning experiments.  
Now, I’m not saying *do* that, because your write volume also plays a roll, and 
you said you’re expecting a relatively even balance in reads and writes.  I’m 
just saying, by way of example, I’d start weighing if the advice I get online 
was based in experience similar to my current circumstance, or ones that were 
very different.

R

From: Ben Mills <b...@bitbrew.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Monday, November 4, 2019 at 8:51 AM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: ***UNCHECKED*** Re: Memory Recommendations for G1GC

Message from External Sender
Hi (yet again) Sergio,

Finally, note that we use this 
sidecar<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Stackdriver_stackdriver-2Dprometheus-2Dsidecar&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=EP6Ql6dsh_bz1U49OKL6IYmkd51gf4VD6m2QwaQJ0ZM&s=m9OmSlwbgoGmO8jUYlAF6b4fbWx82f8NlqqQtOqlwhQ&e=>
 for shipping metrics to Stackdriver. It runs as a second container within our 
Prometheus stateful set.


On Mon, Nov 4, 2019 at 8:46 AM Ben Mills 
<b...@bitbrew.com<mailto:b...@bitbrew.com>> wrote:
Hi (again) Sergio,

I forgot to note that along with Prometheus, we use Grafana (with Prometheus as 
its data source) as well as Stackdriver for monitoring.

As Stackdriver is still developing (i.e. does not have all the features we 
need), we tend to use it for the basics (i.e. monitoring and alerting on 
memory, cpu and disk (PVs) thresholds). More specifically, the Prometheus JMX 
exporter (noted above) scrapes all the MBeans inside Cassandra, exporting in 
the Prometheus data model. Its config map filters (allows) our metrics of 
interest, and those metrics are sent to our Grafana instances and to 
Stackdriver. We use Grafana for more advanced metric configs that provide 
deeper insight in Cassandra - e.g. read/write latencies and so forth. For 
monitoring memory utilization, we monitor both pod-level in Stackdriver (i.e. 
to avoid having a Cassandra pod oomkilled by kubelet) as well as inside the JVM 
(heap space).

Hope this helps.

On Mon, Nov 4, 2019 at 8:26 AM Ben Mills 
<b...@bitbrew.com<mailto:b...@bitbrew.com>> wrote:
Hi Sergio,

Thanks for this and sorry for the slow reply.

We are indeed still running Java 8 and so it's very helpful.

This Cassandra cluster has been running reliably in Kubernetes for several 
years, and while we've had some repair-related issues, they are not related to 
container orchestration or the cloud environment. We don't use operators and 
have simply built the needed Kubernetes configs (YAML manifests) to handle 
deployment of new Docker images (when needed), and so forth. We have:

(1) ConfigMap - Cassandra environment variables
(2) ConfigMap - Prometheus configs for this JMX 
exporter<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_prometheus_jmx-5Fexporter&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=EP6Ql6dsh_bz1U49OKL6IYmkd51gf4VD6m2QwaQJ0ZM&s=l3csYnTFP-q25mQ57k36PlkMKj2OdN7JhM-vuSyKWh8&e=>,
 which is built into the image and runs as a Java agent
(3) PodDisruptionBudget - with minAvailable: 2 as the important setting
(4) Service - this is a headless service (clusterIP: None) which specifies the 
ports for cql, jmx, prometheus, intra-node
(5) StatefulSet - 3 replicas, ports, health checks, resources, etc - as you 
would expect

We store data on persistent volumes using an SSD storage class, and use: an 
updateStrategy of OnDelete, some affinity rules to ensure an even spread of 
pods across our zones, Prometheus annotations for scraping the metrics port, a 
nodeSelector and tolerations to ensure the Cassandra pods run in their 
dedicated node pool, and a preStop hook that runs nodetool drain to help with 
graceful shutdown when a pod is rolled.

I'm guessing your installation is much larger than ours and so operators may be 
a good way to go. For our needs the above has been very reliable as has GCP in 
general.

We are currently updating our backup/restore implementation to provide better 
granularity with respect to restoring a specific keyspace and also exploring 
Velero<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_vmware-2Dtanzu_velero&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=EP6Ql6dsh_bz1U49OKL6IYmkd51gf4VD6m2QwaQJ0ZM&s=70zhtNI28tFIrRscGslgaYQNrpcjuLOXKSCEuR3NoJw&e=>
 for DR.

Hope this helps.


On Fri, Nov 1, 2019 at 5:34 PM Sergio 
<lapostadiser...@gmail.com<mailto:lapostadiser...@gmail.com>> wrote:
Hi Ben,

Well, I had a similar question and Jon Haddad was preferring ParNew + CMS over 
G1GC for java 8.  
https://lists.apache.org/thread.html/283547619b1dcdcddb80947a45e2178158394e317f3092b8959ba879@%3Cuser.cassandra.apache.org%3E<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_283547619b1dcdcddb80947a45e2178158394e317f3092b8959ba879-40-253Cuser.cassandra.apache.org-253E&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=EP6Ql6dsh_bz1U49OKL6IYmkd51gf4VD6m2QwaQJ0ZM&s=2myv56frHk6jkFgNvr-j11Upv8niune5BmB9GjRCd2c&e=>
It depends on your JVM and in any case, I would test it based on your workload.

What's your experience of running Cassandra in k8s. Are you using the Cassandra 
Kubernetes Operator?

How do you monitor it and how do you perform disaster recovery backup?


Best,

Sergio

Il giorno ven 1 nov 2019 alle ore 14:14 Ben Mills 
<b...@bitbrew.com<mailto:b...@bitbrew.com>> ha scritto:
Thanks Sergio - that's good advice and we have this built into the plan.
Have you heard a solid/consistent recommendation/requirement as to the amount 
of memory heap requires for G1GC?

On Fri, Nov 1, 2019 at 5:11 PM Sergio 
<lapostadiser...@gmail.com<mailto:lapostadiser...@gmail.com>> wrote:
In any case I would test with tlp-stress or Cassandra stress tool any 
configuration

Sergio

On Fri, Nov 1, 2019, 12:31 PM Ben Mills 
<b...@bitbrew.com<mailto:b...@bitbrew.com>> wrote:
Greetings,

We are planning a Cassandra upgrade from 3.7 to 3.11.5 and considering a change 
to the GC config.

What is the minimum amount of memory that needs to be allocated to heap space 
when using G1GC?

For GC, we currently use CMS. Along with the version upgrade, we'll be running 
the stateful set of Cassandra pods on new machine types in a new node pool with 
12Gi memory per node. Not a lot of memory but an improvement. We may be able to 
go up to 16Gi memory per node. We'd like to continue using these heap settings:

-XX:+UnlockExperimentalVMOptions
-XX:+UseCGroupMemoryLimitForHeap
-XX:MaxRAMFraction=2

which (if 12Gi per node) would provide 6Gi memory for heap (i.e. half of total 
available).

Here are some details on the environment and configs in the event that 
something is relevant.

Environment: Kubernetes
Environment Config: Stateful set of 3 replicas
Storage: Persistent Volumes
Storage Class: SSD
Node OS: Container-Optimized OS
Container OS: Ubuntu 16.04.3 LTS
Data Centers: 1
Racks: 3 (one per zone)
Nodes: 3
Tokens: 4
Replication Factor: 3
Replication Strategy: NetworkTopologyStrategy (all keyspaces)
Compaction Strategy: STCS (all tables)
Read/Write Requirements: Blend of both
Data Load: <1GB per node
gc_grace_seconds: default (10 days - all tables)

GC Settings: (CMS)

-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSWaitDuration=30000
-XX:+CMSParallelInitialMarkEnabled
-XX:+CMSEdenChunksRecordAlways

Any ideas are much appreciated.

Reply via email to