Re: cassandra_migration_wait

2020-01-13 Thread Ben Mills
Hi Reid,

Many thanks! Very helpful.

Will have a look at that source.

Ben



On Mon, Jan 13, 2020 at 2:08 PM Reid Pinchback 
wrote:

> I can’t find it anywhere either, but I’m looking at a 3.11.4 source
> image.  From the naming I’d bet that this is being used to feed the
> cassandra.migration_task_wait_in_seconds property.  It’s already coded to
> have a default of 1 second, which matches what you are seeing in the shell
> script var.  The relevant Java source is
> org.apache.cassandra.service.MigrationManager, line 62.
>
>
>
> *From: *Ben Mills 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Monday, January 13, 2020 at 1:59 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *cassandra_migration_wait
>
>
>
> *Message from External Sender*
>
> Greetings,
>
>
>
> We are running Cassandra 3.11.2 in Kubernetes and use a run.sh to set some
> environment variables and a few other things.
>
>
>
> This script includes:
>
>
>
> CASSANDRA_MIGRATION_WAIT="${CASSANDRA_MIGRATION_WAIT:-1}"
>
>
>
> setting this environment variable to "1". I looked for documentation on
> this but cannot seem to find it anywhere. Anyone know what this is
> configuring and what the value implies?
>
>
>
> Thanks in advance for your help.
>
>
>
> Ben
>
>
>


cassandra_migration_wait

2020-01-13 Thread Ben Mills
Greetings,

We are running Cassandra 3.11.2 in Kubernetes and use a run.sh to set some
environment variables and a few other things.

This script includes:

CASSANDRA_MIGRATION_WAIT="${CASSANDRA_MIGRATION_WAIT:-1}"

setting this environment variable to "1". I looked for documentation on
this but cannot seem to find it anywhere. Anyone know what this is
configuring and what the value implies?

Thanks in advance for your help.

Ben


Re: repair failed

2020-01-02 Thread Ben Mills
Hi Oliver,

I don't have a quick answer (or any answer yet), though we ran into a
similar issue and I'm wondering about your environment and some configs.

- Operating system?
- Cloud or on-premise?
- Version of Cassandra?
- Version of Java?
- Compaction strategy?
- Primarily read or primarily write (or a blend of both)?
- How much memory allocated to heap?
- How long do all the repair commands typically take per node?

nodetool repair -full -dcpar will stream data across data centers - is it
possible that the number of nodes, or the amount of data, or the number of
keyspaces has grown enough over time to cause streaming issues (and
timeouts)?

You wrote:

Is it problematic if the repair is started only on one node?

Are you asking whether it's ok to run -full repairs one node at a time (on
all nodes)? Or are you saying that you are only repairing one node in each
cluster or DC?

Thanks,
Ben




On Sun, Dec 29, 2019 at 3:54 AM gloCalHelp.com 
wrote:

> TO Oliver :
>Maybe repair should be executed after all data in MEMTBL are all
> flushed into harddisk?
>
>
> Sincerely yours,
> Georgelin
> www_8ems_...@sina.com
> mobile:0086 180 5986 1565
>
>
> - 原始邮件 -
> 发件人:Oliver Herrmann 
> 收件人:user@cassandra.apache.org
> 主题:repair failed
> 日期:2019年12月28日 23点15分
>
> Hello,
>
> today the second time our weekly repair job failed which was working for
> many month without a problem. We are having multiple Cassandra nodes in two
> data center.
>
> The repair command is started only on one node with the following
> parameters:
>
> nodetool repair -full -dcpar
>
> Is it problematic if the repair is started only on one node?
>
> The repair fails after one hour with the following error message:
>
>  failed with error Could not create snapshot at /192.168.13.232
> (progress: 0%)
> [2019-12-28 05:00:04,295] Some repair failed
> [2019-12-28 05:00:04,296] Repair command #1 finished in 1 hour 0 minutes 2
> seconds
> error: Repair job has failed with the error message: [2019-12-28
> 05:00:04,295] Some repair failed
> -- StackTrace --
> java.lang.RuntimeException: Repair job has failed with the error message:
> [2019-12-28 05:00:04,295] Some repair failed
> at
> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116)
> at
> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(Unknown
> Source)
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(Unknown
> Source)
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(Unknown
> Source)
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(Unknown
> Source)
>
> In the logfile on 192.168.13.232 which is in the second data center I
> could find only in debug.log the following log messages:
> DEBUG [COMMIT-LOG-ALLOCATOR] 2019-12-28 04:21:20,143
> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
> a fresh one
> DEBUG [MessagingService-Outgoing-192.168.13.120-Small] 2019-12-28
> 04:31:00,450 OutboundTcpConnection.java:410 - Socket to 192.168.13.120
>  closed
> DEBUG [MessagingService-Outgoing-192.168.13.120-Small] 2019-12-28
> 04:31:00,450 OutboundTcpConnection.java:349 - Error writing to 192.168
> .13.120
> java.io.IOException: Connection timed out
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> ~[na:1.8.0_111]
>
> We tried to run repair a few more times but it always failed with the same
> error. After restarting all nodes it was finally successful.
>
> Any idea what could be wrong?
>
> Regards
> Oliver
>


Re: ***UNCHECKED*** Re: Memory Recommendations for G1GC

2019-11-04 Thread Ben Mills
Thanks again Reid - another great response - you have pointed me in the
right direction.

On Mon, Nov 4, 2019 at 12:23 PM Reid Pinchback 
wrote:

> It’s not a setting I’ve played with at all.  I understand the gist of it
> though, essentially it’ll let you automatically adjust your JVM size
> relative to whatever you allocated to the cgroup.  Unfortunately I’m not a
> K8s developer (that may change shortly, but atm the case).  What you need
> to a firm handle on yourself is where does the memory for the O/S file
> cache live, and is that size sufficient for your read/write activity.  Bare
> metal and VM tuning I understand better, so I’ll have to defer to others
> who may have specific personal experience with the details, but the essence
> of the issue should remain the same.  You want a file cache that functions
> appropriately or you’ll get excessive stalls happening on either reading
> from disk or flushing dirty pages to disk.
>
>
>
>
>
> *From: *Ben Mills 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Monday, November 4, 2019 at 12:14 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: ***UNCHECKED*** Re: Memory Recommendations for G1GC
>
>
>
> CGroup
>


Re: ***UNCHECKED*** Re: Memory Recommendations for G1GC

2019-11-04 Thread Ben Mills
Hi Reid,

Many thanks for this thoughtful response - very helpful and much
appreciated.

No doubt some additional experimentation will pay off as you noted.

One additional question: we currently use this heap setting:

-XX:MaxRAMFraction=2

I realize every environment and its tuning goals are different; though -
just generally - what do you think of MaxRAMFraction=2 with Java 8?

If the stateful set is configured with 16Gi memory, that setting would
allocate roughly 8Gi to the heap and seems a safe balance between
heap/nonheap. No worries if you don't have enough information to answer (as
I haven't shared our tuning goals), but any feedback is, again, appreciated.


On Mon, Nov 4, 2019 at 10:28 AM Reid Pinchback 
wrote:

> Hi Ben, just catching up over the weekend.
>
>
>
> The typical advice, per Sergio’s link reference, is an obvious starting
> point.  We use G1GC and normally I’d treat 8gig as the minimal starting
> point for a heap.  What sometimes doesn’t get talked about in the myriad of
> tunings, is that you have to have a clear goal in your mind on what you are
> tuning **for**. You could be tuning for throughput, or average latency,
> or 99’s latency, etc.  How you tune varies quite a lot according to your
> goal.  The more your goal is about latency, the more work you have ahead of
> you.
>
>
>
> I will suggest that, if your data footprint is going to stay low, that you
> give yourself permission to do some experimentation.  As you’re using K8s,
> you are in a bit of a position where if your usage is small enough, you can
> get 2x bang for the buck on your servers by sizing the pods to about 45% of
> server resources and using the C* rack metaphor to ensure you don’t
> co-locate replicas.
>
>
>
> For example, were I you, I’d start asking myself if SSTable compression
> mattered to me at all.  The reason I’d start asking myself questions like
> that is C* has multiple uses of memory, and one of the balancing acts is
> chunk cache and the O/S file cache.  If I could find a way to make my O/S
> file cache be a defacto C* cache, I’d roll up the shirt sleeves and see
> what kind of performance numbers I could squeeze out with some creative
> tuning experiments.  Now, I’m not saying **do** that, because your write
> volume also plays a roll, and you said you’re expecting a relatively even
> balance in reads and writes.  I’m just saying, by way of example, I’d start
> weighing if the advice I get online was based in experience similar to my
> current circumstance, or ones that were very different.
>
>
>
> R
>
>
>
> *From: *Ben Mills 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Monday, November 4, 2019 at 8:51 AM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: ***UNCHECKED*** Re: Memory Recommendations for G1GC
>
>
>
> *Message from External Sender*
>
> Hi (yet again) Sergio,
>
>
>
> Finally, note that we use this sidecar
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Stackdriver_stackdriver-2Dprometheus-2Dsidecar&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=EP6Ql6dsh_bz1U49OKL6IYmkd51gf4VD6m2QwaQJ0ZM&s=m9OmSlwbgoGmO8jUYlAF6b4fbWx82f8NlqqQtOqlwhQ&e=>
>  for
> shipping metrics to Stackdriver. It runs as a second container within our
> Prometheus stateful set.
>
>
>
>
>
> On Mon, Nov 4, 2019 at 8:46 AM Ben Mills  wrote:
>
> Hi (again) Sergio,
>
>
>
> I forgot to note that along with Prometheus, we use Grafana (with
> Prometheus as its data source) as well as Stackdriver for monitoring.
>
>
>
> As Stackdriver is still developing (i.e. does not have all the features we
> need), we tend to use it for the basics (i.e. monitoring and alerting on
> memory, cpu and disk (PVs) thresholds). More specifically, the
> Prometheus JMX exporter (noted above) scrapes all the MBeans inside
> Cassandra, exporting in the Prometheus data model. Its config map filters
> (allows) our metrics of interest, and those metrics are sent to our Grafana
> instances and to Stackdriver. We use Grafana for more advanced metric
> configs that provide deeper insight in Cassandra - e.g. read/write
> latencies and so forth. For monitoring memory utilization, we monitor both
> pod-level in Stackdriver (i.e. to avoid having a Cassandra pod oomkilled by
> kubelet) as well as inside the JVM (heap space).
>
>
>
> Hope this helps.
>
>
>
> On Mon, Nov 4, 2019 at 8:26 AM Ben Mills  wrote:
>
> Hi Sergio,
>
>
>
> Thanks for this and sorry for the slow reply.
>
>
>
> We are indeed still running Java 8 and so it's very helpful.
>
>
>
> This Cassandra cluster has been ru

Re: ***UNCHECKED*** Re: Memory Recommendations for G1GC

2019-11-04 Thread Ben Mills
Hi (yet again) Sergio,

Finally, note that we use this sidecar
<https://github.com/Stackdriver/stackdriver-prometheus-sidecar> for
shipping metrics to Stackdriver. It runs as a second container within our
Prometheus stateful set.


On Mon, Nov 4, 2019 at 8:46 AM Ben Mills  wrote:

> Hi (again) Sergio,
>
> I forgot to note that along with Prometheus, we use Grafana (with
> Prometheus as its data source) as well as Stackdriver for monitoring.
>
> As Stackdriver is still developing (i.e. does not have all the features we
> need), we tend to use it for the basics (i.e. monitoring and alerting on
> memory, cpu and disk (PVs) thresholds). More specifically, the
> Prometheus JMX exporter (noted above) scrapes all the MBeans inside
> Cassandra, exporting in the Prometheus data model. Its config map filters
> (allows) our metrics of interest, and those metrics are sent to our Grafana
> instances and to Stackdriver. We use Grafana for more advanced metric
> configs that provide deeper insight in Cassandra - e.g. read/write
> latencies and so forth. For monitoring memory utilization, we monitor both
> pod-level in Stackdriver (i.e. to avoid having a Cassandra pod oomkilled by
> kubelet) as well as inside the JVM (heap space).
>
> Hope this helps.
>
> On Mon, Nov 4, 2019 at 8:26 AM Ben Mills  wrote:
>
>> Hi Sergio,
>>
>> Thanks for this and sorry for the slow reply.
>>
>> We are indeed still running Java 8 and so it's very helpful.
>>
>> This Cassandra cluster has been running reliably in Kubernetes for
>> several years, and while we've had some repair-related issues, they are not
>> related to container orchestration or the cloud environment. We don't use
>> operators and have simply built the needed Kubernetes configs (YAML
>> manifests) to handle deployment of new Docker images (when needed), and so
>> forth. We have:
>>
>> (1) ConfigMap - Cassandra environment variables
>> (2) ConfigMap - Prometheus configs for this JMX exporter
>> <https://github.com/prometheus/jmx_exporter>, which is built into the
>> image and runs as a Java agent
>> (3) PodDisruptionBudget - with minAvailable: 2 as the important setting
>> (4) Service - this is a headless service (clusterIP: None) which
>> specifies the ports for cql, jmx, prometheus, intra-node
>> (5) StatefulSet - 3 replicas, ports, health checks, resources, etc - as
>> you would expect
>>
>> We store data on persistent volumes using an SSD storage class, and use:
>> an updateStrategy of OnDelete, some affinity rules to ensure an even
>> spread of pods across our zones, Prometheus annotations for scraping the
>> metrics port, a nodeSelector and tolerations to ensure the Cassandra pods
>> run in their dedicated node pool, and a preStop hook that runs nodetool
>> drain to help with graceful shutdown when a pod is rolled.
>>
>> I'm guessing your installation is much larger than ours and so operators
>> may be a good way to go. For our needs the above has been very reliable as
>> has GCP in general.
>>
>> We are currently updating our backup/restore implementation to provide
>> better granularity with respect to restoring a specific keyspace and also
>> exploring Velero <https://github.com/vmware-tanzu/velero> for DR.
>>
>> Hope this helps.
>>
>>
>> On Fri, Nov 1, 2019 at 5:34 PM Sergio  wrote:
>>
>>> Hi Ben,
>>>
>>> Well, I had a similar question and Jon Haddad was preferring ParNew +
>>> CMS over G1GC for java 8.
>>> https://lists.apache.org/thread.html/283547619b1dcdcddb80947a45e2178158394e317f3092b8959ba879@%3Cuser.cassandra.apache.org%3E
>>> It depends on your JVM and in any case, I would test it based on your
>>> workload.
>>>
>>> What's your experience of running Cassandra in k8s. Are you using the
>>> Cassandra Kubernetes Operator?
>>>
>>> How do you monitor it and how do you perform disaster recovery backup?
>>>
>>>
>>> Best,
>>>
>>> Sergio
>>>
>>> Il giorno ven 1 nov 2019 alle ore 14:14 Ben Mills  ha
>>> scritto:
>>>
>>>> Thanks Sergio - that's good advice and we have this built into the
>>>> plan.
>>>> Have you heard a solid/consistent recommendation/requirement as to the
>>>> amount of memory heap requires for G1GC?
>>>>
>>>> On Fri, Nov 1, 2019 at 5:11 PM Sergio 
>>>> wrote:
>>>>
>>>>> In any case I would test with tlp-stress or Cassandra stress tool any
>>>>> configuration
>>>>>

Re: ***UNCHECKED*** Re: Memory Recommendations for G1GC

2019-11-04 Thread Ben Mills
Hi (again) Sergio,

I forgot to note that along with Prometheus, we use Grafana (with
Prometheus as its data source) as well as Stackdriver for monitoring.

As Stackdriver is still developing (i.e. does not have all the features we
need), we tend to use it for the basics (i.e. monitoring and alerting on
memory, cpu and disk (PVs) thresholds). More specifically, the
Prometheus JMX exporter (noted above) scrapes all the MBeans inside
Cassandra, exporting in the Prometheus data model. Its config map filters
(allows) our metrics of interest, and those metrics are sent to our Grafana
instances and to Stackdriver. We use Grafana for more advanced metric
configs that provide deeper insight in Cassandra - e.g. read/write
latencies and so forth. For monitoring memory utilization, we monitor both
pod-level in Stackdriver (i.e. to avoid having a Cassandra pod oomkilled by
kubelet) as well as inside the JVM (heap space).

Hope this helps.

On Mon, Nov 4, 2019 at 8:26 AM Ben Mills  wrote:

> Hi Sergio,
>
> Thanks for this and sorry for the slow reply.
>
> We are indeed still running Java 8 and so it's very helpful.
>
> This Cassandra cluster has been running reliably in Kubernetes for several
> years, and while we've had some repair-related issues, they are not related
> to container orchestration or the cloud environment. We don't use operators
> and have simply built the needed Kubernetes configs (YAML manifests) to
> handle deployment of new Docker images (when needed), and so forth. We have:
>
> (1) ConfigMap - Cassandra environment variables
> (2) ConfigMap - Prometheus configs for this JMX exporter
> <https://github.com/prometheus/jmx_exporter>, which is built into the
> image and runs as a Java agent
> (3) PodDisruptionBudget - with minAvailable: 2 as the important setting
> (4) Service - this is a headless service (clusterIP: None) which specifies
> the ports for cql, jmx, prometheus, intra-node
> (5) StatefulSet - 3 replicas, ports, health checks, resources, etc - as
> you would expect
>
> We store data on persistent volumes using an SSD storage class, and use:
> an updateStrategy of OnDelete, some affinity rules to ensure an even
> spread of pods across our zones, Prometheus annotations for scraping the
> metrics port, a nodeSelector and tolerations to ensure the Cassandra pods
> run in their dedicated node pool, and a preStop hook that runs nodetool
> drain to help with graceful shutdown when a pod is rolled.
>
> I'm guessing your installation is much larger than ours and so operators
> may be a good way to go. For our needs the above has been very reliable as
> has GCP in general.
>
> We are currently updating our backup/restore implementation to provide
> better granularity with respect to restoring a specific keyspace and also
> exploring Velero <https://github.com/vmware-tanzu/velero> for DR.
>
> Hope this helps.
>
>
> On Fri, Nov 1, 2019 at 5:34 PM Sergio  wrote:
>
>> Hi Ben,
>>
>> Well, I had a similar question and Jon Haddad was preferring ParNew + CMS
>> over G1GC for java 8.
>> https://lists.apache.org/thread.html/283547619b1dcdcddb80947a45e2178158394e317f3092b8959ba879@%3Cuser.cassandra.apache.org%3E
>> It depends on your JVM and in any case, I would test it based on your
>> workload.
>>
>> What's your experience of running Cassandra in k8s. Are you using the
>> Cassandra Kubernetes Operator?
>>
>> How do you monitor it and how do you perform disaster recovery backup?
>>
>>
>> Best,
>>
>> Sergio
>>
>> Il giorno ven 1 nov 2019 alle ore 14:14 Ben Mills  ha
>> scritto:
>>
>>> Thanks Sergio - that's good advice and we have this built into the plan.
>>> Have you heard a solid/consistent recommendation/requirement as to the
>>> amount of memory heap requires for G1GC?
>>>
>>> On Fri, Nov 1, 2019 at 5:11 PM Sergio  wrote:
>>>
>>>> In any case I would test with tlp-stress or Cassandra stress tool any
>>>> configuration
>>>>
>>>> Sergio
>>>>
>>>> On Fri, Nov 1, 2019, 12:31 PM Ben Mills  wrote:
>>>>
>>>>> Greetings,
>>>>>
>>>>> We are planning a Cassandra upgrade from 3.7 to 3.11.5 and considering
>>>>> a change to the GC config.
>>>>>
>>>>> What is the minimum amount of memory that needs to be allocated to
>>>>> heap space when using G1GC?
>>>>>
>>>>> For GC, we currently use CMS. Along with the version upgrade, we'll be
>>>>> running the stateful set of Cassandra pods on new machine types in a new
>>>>> node 

Re: ***UNCHECKED*** Re: Memory Recommendations for G1GC

2019-11-04 Thread Ben Mills
Hi Sergio,

Thanks for this and sorry for the slow reply.

We are indeed still running Java 8 and so it's very helpful.

This Cassandra cluster has been running reliably in Kubernetes for several
years, and while we've had some repair-related issues, they are not related
to container orchestration or the cloud environment. We don't use operators
and have simply built the needed Kubernetes configs (YAML manifests) to
handle deployment of new Docker images (when needed), and so forth. We have:

(1) ConfigMap - Cassandra environment variables
(2) ConfigMap - Prometheus configs for this JMX exporter
<https://github.com/prometheus/jmx_exporter>, which is built into the image
and runs as a Java agent
(3) PodDisruptionBudget - with minAvailable: 2 as the important setting
(4) Service - this is a headless service (clusterIP: None) which specifies
the ports for cql, jmx, prometheus, intra-node
(5) StatefulSet - 3 replicas, ports, health checks, resources, etc - as you
would expect

We store data on persistent volumes using an SSD storage class, and use: an
updateStrategy of OnDelete, some affinity rules to ensure an even spread of
pods across our zones, Prometheus annotations for scraping the metrics
port, a nodeSelector and tolerations to ensure the Cassandra pods run in
their dedicated node pool, and a preStop hook that runs nodetool drain to
help with graceful shutdown when a pod is rolled.

I'm guessing your installation is much larger than ours and so operators
may be a good way to go. For our needs the above has been very reliable as
has GCP in general.

We are currently updating our backup/restore implementation to provide
better granularity with respect to restoring a specific keyspace and also
exploring Velero <https://github.com/vmware-tanzu/velero> for DR.

Hope this helps.


On Fri, Nov 1, 2019 at 5:34 PM Sergio  wrote:

> Hi Ben,
>
> Well, I had a similar question and Jon Haddad was preferring ParNew + CMS
> over G1GC for java 8.
> https://lists.apache.org/thread.html/283547619b1dcdcddb80947a45e2178158394e317f3092b8959ba879@%3Cuser.cassandra.apache.org%3E
> It depends on your JVM and in any case, I would test it based on your
> workload.
>
> What's your experience of running Cassandra in k8s. Are you using the
> Cassandra Kubernetes Operator?
>
> How do you monitor it and how do you perform disaster recovery backup?
>
>
> Best,
>
> Sergio
>
> Il giorno ven 1 nov 2019 alle ore 14:14 Ben Mills  ha
> scritto:
>
>> Thanks Sergio - that's good advice and we have this built into the plan.
>> Have you heard a solid/consistent recommendation/requirement as to the
>> amount of memory heap requires for G1GC?
>>
>> On Fri, Nov 1, 2019 at 5:11 PM Sergio  wrote:
>>
>>> In any case I would test with tlp-stress or Cassandra stress tool any
>>> configuration
>>>
>>> Sergio
>>>
>>> On Fri, Nov 1, 2019, 12:31 PM Ben Mills  wrote:
>>>
>>>> Greetings,
>>>>
>>>> We are planning a Cassandra upgrade from 3.7 to 3.11.5 and considering
>>>> a change to the GC config.
>>>>
>>>> What is the minimum amount of memory that needs to be allocated to heap
>>>> space when using G1GC?
>>>>
>>>> For GC, we currently use CMS. Along with the version upgrade, we'll be
>>>> running the stateful set of Cassandra pods on new machine types in a new
>>>> node pool with 12Gi memory per node. Not a lot of memory but an
>>>> improvement. We may be able to go up to 16Gi memory per node. We'd like to
>>>> continue using these heap settings:
>>>>
>>>> -XX:+UnlockExperimentalVMOptions
>>>> -XX:+UseCGroupMemoryLimitForHeap
>>>> -XX:MaxRAMFraction=2
>>>>
>>>> which (if 12Gi per node) would provide 6Gi memory for heap (i.e. half
>>>> of total available).
>>>>
>>>> Here are some details on the environment and configs in the event that
>>>> something is relevant.
>>>>
>>>> Environment: Kubernetes
>>>> Environment Config: Stateful set of 3 replicas
>>>> Storage: Persistent Volumes
>>>> Storage Class: SSD
>>>> Node OS: Container-Optimized OS
>>>> Container OS: Ubuntu 16.04.3 LTS
>>>> Data Centers: 1
>>>> Racks: 3 (one per zone)
>>>> Nodes: 3
>>>> Tokens: 4
>>>> Replication Factor: 3
>>>> Replication Strategy: NetworkTopologyStrategy (all keyspaces)
>>>> Compaction Strategy: STCS (all tables)
>>>> Read/Write Requirements: Blend of both
>>>> Data Load: <1GB per node
>>>> gc_grace_seconds: default (10 days - all tables)
>>>>
>>>> GC Settings: (CMS)
>>>>
>>>> -XX:+UseParNewGC
>>>> -XX:+UseConcMarkSweepGC
>>>> -XX:+CMSParallelRemarkEnabled
>>>> -XX:SurvivorRatio=8
>>>> -XX:MaxTenuringThreshold=1
>>>> -XX:CMSInitiatingOccupancyFraction=75
>>>> -XX:+UseCMSInitiatingOccupancyOnly
>>>> -XX:CMSWaitDuration=3
>>>> -XX:+CMSParallelInitialMarkEnabled
>>>> -XX:+CMSEdenChunksRecordAlways
>>>>
>>>> Any ideas are much appreciated.
>>>>
>>>


Re: Memory Recommendations for G1GC

2019-11-01 Thread Ben Mills
Thanks Sergio - that's good advice and we have this built into the plan.
Have you heard a solid/consistent recommendation/requirement as to the
amount of memory heap requires for G1GC?

On Fri, Nov 1, 2019 at 5:11 PM Sergio  wrote:

> In any case I would test with tlp-stress or Cassandra stress tool any
> configuration
>
> Sergio
>
> On Fri, Nov 1, 2019, 12:31 PM Ben Mills  wrote:
>
>> Greetings,
>>
>> We are planning a Cassandra upgrade from 3.7 to 3.11.5 and considering a
>> change to the GC config.
>>
>> What is the minimum amount of memory that needs to be allocated to heap
>> space when using G1GC?
>>
>> For GC, we currently use CMS. Along with the version upgrade, we'll be
>> running the stateful set of Cassandra pods on new machine types in a new
>> node pool with 12Gi memory per node. Not a lot of memory but an
>> improvement. We may be able to go up to 16Gi memory per node. We'd like to
>> continue using these heap settings:
>>
>> -XX:+UnlockExperimentalVMOptions
>> -XX:+UseCGroupMemoryLimitForHeap
>> -XX:MaxRAMFraction=2
>>
>> which (if 12Gi per node) would provide 6Gi memory for heap (i.e. half of
>> total available).
>>
>> Here are some details on the environment and configs in the event that
>> something is relevant.
>>
>> Environment: Kubernetes
>> Environment Config: Stateful set of 3 replicas
>> Storage: Persistent Volumes
>> Storage Class: SSD
>> Node OS: Container-Optimized OS
>> Container OS: Ubuntu 16.04.3 LTS
>> Data Centers: 1
>> Racks: 3 (one per zone)
>> Nodes: 3
>> Tokens: 4
>> Replication Factor: 3
>> Replication Strategy: NetworkTopologyStrategy (all keyspaces)
>> Compaction Strategy: STCS (all tables)
>> Read/Write Requirements: Blend of both
>> Data Load: <1GB per node
>> gc_grace_seconds: default (10 days - all tables)
>>
>> GC Settings: (CMS)
>>
>> -XX:+UseParNewGC
>> -XX:+UseConcMarkSweepGC
>> -XX:+CMSParallelRemarkEnabled
>> -XX:SurvivorRatio=8
>> -XX:MaxTenuringThreshold=1
>> -XX:CMSInitiatingOccupancyFraction=75
>> -XX:+UseCMSInitiatingOccupancyOnly
>> -XX:CMSWaitDuration=3
>> -XX:+CMSParallelInitialMarkEnabled
>> -XX:+CMSEdenChunksRecordAlways
>>
>> Any ideas are much appreciated.
>>
>


Re: Memory Recommendations for G1GC

2019-11-01 Thread Ben Mills
Thanks Reid,

We currently only have ~1GB data per node with a replication factor of 3.
The amount of data will certainly grow, though I have no solid projections
at this time. The current memory and CPU resources are quite low (for
Cassandra) and so along with the upgrade we plan to increase both. This
seems to be the strong recommendation from this user group.

On Fri, Nov 1, 2019 at 4:52 PM Reid Pinchback 
wrote:

> Maybe I’m missing something.  You’re expecting less than 1 gig of data per
> node?  Unless this is some situation of super-high data churn/brief TTL, it
> sounds like you’ll end up with your entire database in memory.
>
>
>
> *From: *Ben Mills 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Friday, November 1, 2019 at 3:31 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Memory Recommendations for G1GC
>
>
>
> *Message from External Sender*
>
> Greetings,
>
>
>
> We are planning a Cassandra upgrade from 3.7 to 3.11.5 and considering a
> change to the GC config.
>
>
>
> What is the minimum amount of memory that needs to be allocated to heap
> space when using G1GC?
>
>
>
> For GC, we currently use CMS. Along with the version upgrade, we'll be
> running the stateful set of Cassandra pods on new machine types in a new
> node pool with 12Gi memory per node. Not a lot of memory but an
> improvement. We may be able to go up to 16Gi memory per node. We'd like to
> continue using these heap settings:
>
>
> -XX:+UnlockExperimentalVMOptions
> -XX:+UseCGroupMemoryLimitForHeap
> -XX:MaxRAMFraction=2
>
>
>
> which (if 12Gi per node) would provide 6Gi memory for heap (i.e. half of
> total available).
>
>
>
> Here are some details on the environment and configs in the event that
> something is relevant.
>
>
>
> Environment: Kubernetes
> Environment Config: Stateful set of 3 replicas
> Storage: Persistent Volumes
> Storage Class: SSD
> Node OS: Container-Optimized OS
> Container OS: Ubuntu 16.04.3 LTS
> Data Centers: 1
> Racks: 3 (one per zone)
> Nodes: 3
> Tokens: 4
> Replication Factor: 3
> Replication Strategy: NetworkTopologyStrategy (all keyspaces)
> Compaction Strategy: STCS (all tables)
> Read/Write Requirements: Blend of both
> Data Load: <1GB per node
> gc_grace_seconds: default (10 days - all tables)
>
> GC Settings: (CMS)
>
> -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC
> -XX:+CMSParallelRemarkEnabled
> -XX:SurvivorRatio=8
> -XX:MaxTenuringThreshold=1
> -XX:CMSInitiatingOccupancyFraction=75
> -XX:+UseCMSInitiatingOccupancyOnly
> -XX:CMSWaitDuration=3
> -XX:+CMSParallelInitialMarkEnabled
> -XX:+CMSEdenChunksRecordAlways
>
>
>
> Any ideas are much appreciated.
>


Memory Recommendations for G1GC

2019-11-01 Thread Ben Mills
Greetings,

We are planning a Cassandra upgrade from 3.7 to 3.11.5 and considering a
change to the GC config.

What is the minimum amount of memory that needs to be allocated to heap
space when using G1GC?

For GC, we currently use CMS. Along with the version upgrade, we'll be
running the stateful set of Cassandra pods on new machine types in a new
node pool with 12Gi memory per node. Not a lot of memory but an
improvement. We may be able to go up to 16Gi memory per node. We'd like to
continue using these heap settings:

-XX:+UnlockExperimentalVMOptions
-XX:+UseCGroupMemoryLimitForHeap
-XX:MaxRAMFraction=2

which (if 12Gi per node) would provide 6Gi memory for heap (i.e. half of
total available).

Here are some details on the environment and configs in the event that
something is relevant.

Environment: Kubernetes
Environment Config: Stateful set of 3 replicas
Storage: Persistent Volumes
Storage Class: SSD
Node OS: Container-Optimized OS
Container OS: Ubuntu 16.04.3 LTS
Data Centers: 1
Racks: 3 (one per zone)
Nodes: 3
Tokens: 4
Replication Factor: 3
Replication Strategy: NetworkTopologyStrategy (all keyspaces)
Compaction Strategy: STCS (all tables)
Read/Write Requirements: Blend of both
Data Load: <1GB per node
gc_grace_seconds: default (10 days - all tables)

GC Settings: (CMS)

-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSWaitDuration=3
-XX:+CMSParallelInitialMarkEnabled
-XX:+CMSEdenChunksRecordAlways

Any ideas are much appreciated.


Re: Repair Issues

2019-10-26 Thread Ben Mills
Thanks Ghiyasi.

On Sat, Oct 26, 2019 at 9:17 AM Hossein Ghiyasi Mehr 
wrote:

> If the problem exist still, and all nodes are up, reboot them one by one.
> Then try to repair one node. After that repair other nodes one by one.
>
> On Fri, Oct 25, 2019 at 12:56 AM Ben Mills  wrote:
>
>>
>> Thanks Jon!
>>
>> This is very helpful - allow me to follow-up and ask a question.
>>
>> (1) Yes, incremental repairs will never be used (unless it becomes viable
>> in Cassandra 4.x someday).
>> (2) I hear you on the JVM - will look into that.
>> (3) Been looking at Cassandra version 3.11.x though was unaware that 3.7
>> is considered non-viable for production use.
>>
>> For (4) - Question/Request:
>>
>> Note that with:
>>
>> -XX:MaxRAMFraction=2
>>
>> the actual amount of memory allocated for heap space is effectively 2Gi
>> (i.e. half of the 4Gi allocated on the machine type). We can definitely
>> increase memory (for heap and nonheap), though can you expand a bit on your
>> heap comment to help my understanding (as this is such a small cluster with
>> such a small amount of data at rest)?
>>
>> Thanks again.
>>
>> On Thu, Oct 24, 2019 at 5:11 PM Jon Haddad  wrote:
>>
>>> There's some major warning signs for me with your environment.  4GB heap
>>> is too low, and Cassandra 3.7 isn't something I would put into production.
>>>
>>> Your surface area for problems is massive right now.  Things I'd do:
>>>
>>> 1. Never use incremental repair.  Seems like you've already stopped
>>> doing them, but it's worth mentioning.
>>> 2. Upgrade to the latest JVM, that version's way out of date.
>>> 3. Upgrade to Cassandra 3.11.latest (we're voting on 3.11.5 right now).
>>> 4. Increase memory to 8GB minimum, preferably 12.
>>>
>>> I usually don't like making a bunch of changes without knowing the root
>>> cause of a problem, but in your case there's so many potential problems I
>>> don't think it's worth digging into, especially since the problem might be
>>> one of the 500 or so bugs that were fixed since this release.
>>>
>>> Once you've done those things it'll be easier to narrow down the problem.
>>>
>>> Jon
>>>
>>>
>>> On Thu, Oct 24, 2019 at 4:59 PM Ben Mills  wrote:
>>>
>>>> Hi Sergio,
>>>>
>>>> No, not at this time.
>>>>
>>>> It was in use with this cluster previously, and while there were no
>>>> reaper-specific issues, it was removed to help simplify investigation of
>>>> the underlying repair issues I've described.
>>>>
>>>> Thanks.
>>>>
>>>> On Thu, Oct 24, 2019 at 4:21 PM Sergio 
>>>> wrote:
>>>>
>>>>> Are you using Cassandra reaper?
>>>>>
>>>>> On Thu, Oct 24, 2019, 12:31 PM Ben Mills  wrote:
>>>>>
>>>>>> Greetings,
>>>>>>
>>>>>> Inherited a small Cassandra cluster with some repair issues and need
>>>>>> some advice on recommended next steps. Apologies in advance for a long
>>>>>> email.
>>>>>>
>>>>>> Issue:
>>>>>>
>>>>>> Intermittent repair failures on two non-system keyspaces.
>>>>>>
>>>>>> - platform_users
>>>>>> - platform_management
>>>>>>
>>>>>> Repair Type:
>>>>>>
>>>>>> Full, parallel repairs are run on each of the three nodes every five
>>>>>> days.
>>>>>>
>>>>>> Repair command output for a typical failure:
>>>>>>
>>>>>> [2019-10-18 00:22:09,109] Starting repair command #46, repairing
>>>>>> keyspace platform_users with repair options (parallelism: parallel, 
>>>>>> primary
>>>>>> range: false, incremental: false, job threads: 1, ColumnFamilies: [],
>>>>>> dataCenters: [], hosts: [], # of ranges: 12)
>>>>>> [2019-10-18 00:22:09,242] Repair session
>>>>>> 5282be70-f13d-11e9-9b4e-7f6db768ba9a for range
>>>>>> [(-1890954128429545684,2847510199483651721],
>>>>>> (8249813014782655320,-8746483007209345011],
>>>>>> (4299912178579297893,6811748355903297393],
>>>>>> (-8746483007209345011,-8628999431140554276],

Re: Repair Issues

2019-10-24 Thread Ben Mills
Thanks Jon!

This is very helpful - allow me to follow-up and ask a question.

(1) Yes, incremental repairs will never be used (unless it becomes viable
in Cassandra 4.x someday).
(2) I hear you on the JVM - will look into that.
(3) Been looking at Cassandra version 3.11.x though was unaware that 3.7 is
considered non-viable for production use.

For (4) - Question/Request:

Note that with:

-XX:MaxRAMFraction=2

the actual amount of memory allocated for heap space is effectively 2Gi
(i.e. half of the 4Gi allocated on the machine type). We can definitely
increase memory (for heap and nonheap), though can you expand a bit on your
heap comment to help my understanding (as this is such a small cluster with
such a small amount of data at rest)?

Thanks again.

On Thu, Oct 24, 2019 at 5:11 PM Jon Haddad  wrote:

> There's some major warning signs for me with your environment.  4GB heap
> is too low, and Cassandra 3.7 isn't something I would put into production.
>
> Your surface area for problems is massive right now.  Things I'd do:
>
> 1. Never use incremental repair.  Seems like you've already stopped doing
> them, but it's worth mentioning.
> 2. Upgrade to the latest JVM, that version's way out of date.
> 3. Upgrade to Cassandra 3.11.latest (we're voting on 3.11.5 right now).
> 4. Increase memory to 8GB minimum, preferably 12.
>
> I usually don't like making a bunch of changes without knowing the root
> cause of a problem, but in your case there's so many potential problems I
> don't think it's worth digging into, especially since the problem might be
> one of the 500 or so bugs that were fixed since this release.
>
> Once you've done those things it'll be easier to narrow down the problem.
>
> Jon
>
>
> On Thu, Oct 24, 2019 at 4:59 PM Ben Mills  wrote:
>
>> Hi Sergio,
>>
>> No, not at this time.
>>
>> It was in use with this cluster previously, and while there were no
>> reaper-specific issues, it was removed to help simplify investigation of
>> the underlying repair issues I've described.
>>
>> Thanks.
>>
>> On Thu, Oct 24, 2019 at 4:21 PM Sergio  wrote:
>>
>>> Are you using Cassandra reaper?
>>>
>>> On Thu, Oct 24, 2019, 12:31 PM Ben Mills  wrote:
>>>
>>>> Greetings,
>>>>
>>>> Inherited a small Cassandra cluster with some repair issues and need
>>>> some advice on recommended next steps. Apologies in advance for a long
>>>> email.
>>>>
>>>> Issue:
>>>>
>>>> Intermittent repair failures on two non-system keyspaces.
>>>>
>>>> - platform_users
>>>> - platform_management
>>>>
>>>> Repair Type:
>>>>
>>>> Full, parallel repairs are run on each of the three nodes every five
>>>> days.
>>>>
>>>> Repair command output for a typical failure:
>>>>
>>>> [2019-10-18 00:22:09,109] Starting repair command #46, repairing
>>>> keyspace platform_users with repair options (parallelism: parallel, primary
>>>> range: false, incremental: false, job threads: 1, ColumnFamilies: [],
>>>> dataCenters: [], hosts: [], # of ranges: 12)
>>>> [2019-10-18 00:22:09,242] Repair session
>>>> 5282be70-f13d-11e9-9b4e-7f6db768ba9a for range
>>>> [(-1890954128429545684,2847510199483651721],
>>>> (8249813014782655320,-8746483007209345011],
>>>> (4299912178579297893,6811748355903297393],
>>>> (-8746483007209345011,-8628999431140554276],
>>>> (-5865769407232506956,-4746990901966533744],
>>>> (-4470950459111056725,-1890954128429545684],
>>>> (4001531392883953257,4299912178579297893],
>>>> (6811748355903297393,6878104809564599690],
>>>> (6878104809564599690,8249813014782655320],
>>>> (-4746990901966533744,-4470950459111056725],
>>>> (-8628999431140554276,-5865769407232506956],
>>>> (2847510199483651721,4001531392883953257]] failed with error [repair
>>>> #5282be70-f13d-11e9-9b4e-7f6db768ba9a on platform_users/access_tokens_v2,
>>>> [(-1890954128429545684,2847510199483651721],
>>>> (8249813014782655320,-8746483007209345011],
>>>> (4299912178579297893,6811748355903297393],
>>>> (-8746483007209345011,-8628999431140554276],
>>>> (-5865769407232506956,-4746990901966533744],
>>>> (-4470950459111056725,-1890954128429545684],
>>>> (4001531392883953257,4299912178579297893],
>>>> (6811748355903297393,6878104809564599690],
>>>> (68781048095645

Re: Repair Issues

2019-10-24 Thread Ben Mills
Hi Reid,

Many thanks - I have seen that article though will definitely give it
another read.

Note that nodetool scrub has been tried (no effect) and sstablescrub cannot
currently be run with the Cassandra image in use (though certainly a new
image that allows the server to be stopped but keeps the operating
environment available to use this utility can be built - just haven't done
so yet). Note also that none of the logs are indicating that a corrupt
data file (or files) is in play here. Noting that because the article
includes a solution where a specific data file is manually deleted and then
repairs are run to restore the file from a different node in the cluster.
Also, the way persistent volumes are mounted onto [Kubernetes] nodes
prevents this solution (manual deletion of an offending data file) from
being viable because the PV mount on the node's filesystem is detached when
the pods are down. This is a subtlety of running Cassandra in Kubernetes.

On Thu, Oct 24, 2019 at 4:24 PM Reid Pinchback 
wrote:

> Ben, you may find this helpful:
>
>
>
> https://blog.pythian.com/so-you-have-a-broken-cassandra-sstable-file/
>
>
>
>
>
> *From: *Ben Mills 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Thursday, October 24, 2019 at 3:31 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Repair Issues
>
>
>
> *Message from External Sender*
>
> Greetings,
>
> Inherited a small Cassandra cluster with some repair issues and need some
> advice on recommended next steps. Apologies in advance for a long email.
>
> Issue:
>
> Intermittent repair failures on two non-system keyspaces.
>
> - platform_users
> - platform_management
>
> Repair Type:
>
> Full, parallel repairs are run on each of the three nodes every five days.
>
> Repair command output for a typical failure:
>
> [2019-10-18 00:22:09,109] Starting repair command #46, repairing keyspace
> platform_users with repair options (parallelism: parallel, primary range:
> false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters:
> [], hosts: [], # of ranges: 12)
> [2019-10-18 00:22:09,242] Repair session
> 5282be70-f13d-11e9-9b4e-7f6db768ba9a for range
> [(-1890954128429545684,2847510199483651721],
> (8249813014782655320,-8746483007209345011],
> (4299912178579297893,6811748355903297393],
> (-8746483007209345011,-8628999431140554276],
> (-5865769407232506956,-4746990901966533744],
> (-4470950459111056725,-1890954128429545684],
> (4001531392883953257,4299912178579297893],
> (6811748355903297393,6878104809564599690],
> (6878104809564599690,8249813014782655320],
> (-4746990901966533744,-4470950459111056725],
> (-8628999431140554276,-5865769407232506956],
> (2847510199483651721,4001531392883953257]] failed with error [repair
> #5282be70-f13d-11e9-9b4e-7f6db768ba9a on platform_users/access_tokens_v2,
> [(-1890954128429545684,2847510199483651721],
> (8249813014782655320,-8746483007209345011],
> (4299912178579297893,6811748355903297393],
> (-8746483007209345011,-8628999431140554276],
> (-5865769407232506956,-4746990901966533744],
> (-4470950459111056725,-1890954128429545684],
> (4001531392883953257,4299912178579297893],
> (6811748355903297393,6878104809564599690],
> (6878104809564599690,8249813014782655320],
> (-4746990901966533744,-4470950459111056725],
> (-8628999431140554276,-5865769407232506956],
> (2847510199483651721,4001531392883953257]]] Validation failed in /10.x.x.x
> (progress: 26%)
> [2019-10-18 00:22:09,246] Some repair failed
> [2019-10-18 00:22:09,248] Repair command #46 finished in 0 seconds
>
> Additional Notes:
>
> Repairs encounter above failures more often than not. Sometimes on one
> node only, though occasionally on two. Sometimes just one of the two
> keyspaces, sometimes both. Apparently the previous repair schedule for
> this cluster included incremental repairs (script alternated between
> incremental and full repairs). After reading this TLP article:
>
>
> https://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__thelastpickle.com_blog_2017_12_14_should-2Dyou-2Duse-2Dincremental-2Drepair.html&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=IS_T0jkqMzq1WUvU2M2bsp86B8WWcNuhUoWjudSR_t0&s=s4UG2uUbhDqyEE7itCF4vYdDQTg7kxJ6LcipRE71Jqw&e=>
>
> the repair script was replaced with cassandra-reaper (v1.4.0), which was
> run with its default configs. Reaper was fine but only obscured the ongoing
> issues (it did not resolve them) and complicated the debugging process and
> so was then removed. The current repair schedule is as described above
> under Repair Type.
>
> At

Re: Repair Issues

2019-10-24 Thread Ben Mills
Hi Sergio,

No, not at this time.

It was in use with this cluster previously, and while there were no
reaper-specific issues, it was removed to help simplify investigation of
the underlying repair issues I've described.

Thanks.

On Thu, Oct 24, 2019 at 4:21 PM Sergio  wrote:

> Are you using Cassandra reaper?
>
> On Thu, Oct 24, 2019, 12:31 PM Ben Mills  wrote:
>
>> Greetings,
>>
>> Inherited a small Cassandra cluster with some repair issues and need some
>> advice on recommended next steps. Apologies in advance for a long email.
>>
>> Issue:
>>
>> Intermittent repair failures on two non-system keyspaces.
>>
>> - platform_users
>> - platform_management
>>
>> Repair Type:
>>
>> Full, parallel repairs are run on each of the three nodes every five days.
>>
>> Repair command output for a typical failure:
>>
>> [2019-10-18 00:22:09,109] Starting repair command #46, repairing keyspace
>> platform_users with repair options (parallelism: parallel, primary range:
>> false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters:
>> [], hosts: [], # of ranges: 12)
>> [2019-10-18 00:22:09,242] Repair session
>> 5282be70-f13d-11e9-9b4e-7f6db768ba9a for range
>> [(-1890954128429545684,2847510199483651721],
>> (8249813014782655320,-8746483007209345011],
>> (4299912178579297893,6811748355903297393],
>> (-8746483007209345011,-8628999431140554276],
>> (-5865769407232506956,-4746990901966533744],
>> (-4470950459111056725,-1890954128429545684],
>> (4001531392883953257,4299912178579297893],
>> (6811748355903297393,6878104809564599690],
>> (6878104809564599690,8249813014782655320],
>> (-4746990901966533744,-4470950459111056725],
>> (-8628999431140554276,-5865769407232506956],
>> (2847510199483651721,4001531392883953257]] failed with error [repair
>> #5282be70-f13d-11e9-9b4e-7f6db768ba9a on platform_users/access_tokens_v2,
>> [(-1890954128429545684,2847510199483651721],
>> (8249813014782655320,-8746483007209345011],
>> (4299912178579297893,6811748355903297393],
>> (-8746483007209345011,-8628999431140554276],
>> (-5865769407232506956,-4746990901966533744],
>> (-4470950459111056725,-1890954128429545684],
>> (4001531392883953257,4299912178579297893],
>> (6811748355903297393,6878104809564599690],
>> (6878104809564599690,8249813014782655320],
>> (-4746990901966533744,-4470950459111056725],
>> (-8628999431140554276,-5865769407232506956],
>> (2847510199483651721,4001531392883953257]]] Validation failed in /10.x.x.x
>> (progress: 26%)
>> [2019-10-18 00:22:09,246] Some repair failed
>> [2019-10-18 00:22:09,248] Repair command #46 finished in 0 seconds
>>
>> Additional Notes:
>>
>> Repairs encounter above failures more often than not. Sometimes on one
>> node only, though occasionally on two. Sometimes just one of the two
>> keyspaces, sometimes both. Apparently the previous repair schedule for
>> this cluster included incremental repairs (script alternated between
>> incremental and full repairs). After reading this TLP article:
>>
>>
>> https://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html
>>
>> the repair script was replaced with cassandra-reaper (v1.4.0), which was
>> run with its default configs. Reaper was fine but only obscured the ongoing
>> issues (it did not resolve them) and complicated the debugging process and
>> so was then removed. The current repair schedule is as described above
>> under Repair Type.
>>
>> Attempts at Resolution:
>>
>> (1) nodetool scrub was attempted on the offending keyspaces/tables to no
>> effect.
>>
>> (2) sstablescrub has not been attempted due to the current design of the
>> Docker image that runs Cassandra in each Kubernetes pod - i.e. there is no
>> way to stop the server to run this utility without killing the only pid
>> running in the container.
>>
>> Related Error:
>>
>> Not sure if this is related, though sometimes, when either:
>>
>> (a) Running nodetool snapshot, or
>> (b) Rolling a pod that runs a Cassandra node, which calls nodetool drain
>> prior shutdown,
>>
>> the following error is thrown:
>>
>> -- StackTrace --
>> java.lang.RuntimeException: Last written key
>> DecoratedKey(10df3ba1-6eb2-4c8e-bddd-c0c7af586bda,
>> 10df3ba16eb24c8ebdddc0c7af586bda) >= current key
>> DecoratedKey(----,
>> 17343121887f480c9ba87c0e32206b74) writing into
>> /cassandra_data/data/platform_management/device_by_tenant_v2-e91529202ccf11e7ab96

Repair Issues

2019-10-24 Thread Ben Mills
Greetings,

Inherited a small Cassandra cluster with some repair issues and need some
advice on recommended next steps. Apologies in advance for a long email.

Issue:

Intermittent repair failures on two non-system keyspaces.

- platform_users
- platform_management

Repair Type:

Full, parallel repairs are run on each of the three nodes every five days.

Repair command output for a typical failure:

[2019-10-18 00:22:09,109] Starting repair command #46, repairing keyspace
platform_users with repair options (parallelism: parallel, primary range:
false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters:
[], hosts: [], # of ranges: 12)
[2019-10-18 00:22:09,242] Repair session
5282be70-f13d-11e9-9b4e-7f6db768ba9a for range
[(-1890954128429545684,2847510199483651721],
(8249813014782655320,-8746483007209345011],
(4299912178579297893,6811748355903297393],
(-8746483007209345011,-8628999431140554276],
(-5865769407232506956,-4746990901966533744],
(-4470950459111056725,-1890954128429545684],
(4001531392883953257,4299912178579297893],
(6811748355903297393,6878104809564599690],
(6878104809564599690,8249813014782655320],
(-4746990901966533744,-4470950459111056725],
(-8628999431140554276,-5865769407232506956],
(2847510199483651721,4001531392883953257]] failed with error [repair
#5282be70-f13d-11e9-9b4e-7f6db768ba9a on platform_users/access_tokens_v2,
[(-1890954128429545684,2847510199483651721],
(8249813014782655320,-8746483007209345011],
(4299912178579297893,6811748355903297393],
(-8746483007209345011,-8628999431140554276],
(-5865769407232506956,-4746990901966533744],
(-4470950459111056725,-1890954128429545684],
(4001531392883953257,4299912178579297893],
(6811748355903297393,6878104809564599690],
(6878104809564599690,8249813014782655320],
(-4746990901966533744,-4470950459111056725],
(-8628999431140554276,-5865769407232506956],
(2847510199483651721,4001531392883953257]]] Validation failed in /10.x.x.x
(progress: 26%)
[2019-10-18 00:22:09,246] Some repair failed
[2019-10-18 00:22:09,248] Repair command #46 finished in 0 seconds

Additional Notes:

Repairs encounter above failures more often than not. Sometimes on one node
only, though occasionally on two. Sometimes just one of the two keyspaces,
sometimes both. Apparently the previous repair schedule for this cluster
included incremental repairs (script alternated between incremental and
full repairs). After reading this TLP article:

https://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html

the repair script was replaced with cassandra-reaper (v1.4.0), which was
run with its default configs. Reaper was fine but only obscured the ongoing
issues (it did not resolve them) and complicated the debugging process and
so was then removed. The current repair schedule is as described above
under Repair Type.

Attempts at Resolution:

(1) nodetool scrub was attempted on the offending keyspaces/tables to no
effect.

(2) sstablescrub has not been attempted due to the current design of the
Docker image that runs Cassandra in each Kubernetes pod - i.e. there is no
way to stop the server to run this utility without killing the only pid
running in the container.

Related Error:

Not sure if this is related, though sometimes, when either:

(a) Running nodetool snapshot, or
(b) Rolling a pod that runs a Cassandra node, which calls nodetool drain
prior shutdown,

the following error is thrown:

-- StackTrace --
java.lang.RuntimeException: Last written key
DecoratedKey(10df3ba1-6eb2-4c8e-bddd-c0c7af586bda,
10df3ba16eb24c8ebdddc0c7af586bda) >= current key
DecoratedKey(----,
17343121887f480c9ba87c0e32206b74) writing into
/cassandra_data/data/platform_management/device_by_tenant_v2-e91529202ccf11e7ab96d5693708c583/.device_by_tenant_tags_idx/mb-45-big-Data.db
at
org.apache.cassandra.io.sstable.format.big.BigTableWriter.beforeAppend(BigTableWriter.java:114)
at
org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:153)
at
org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.append(SimpleSSTableMultiWriter.java:48)
at
org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:441)
at
org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:477)
at
org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:363)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Here are some details on the environment and configs in the event that
something is relevant.

Environment: Kubernetes
Environment Config: Stateful set of 3 replicas
Storage: Persistent Volumes
Storage Class: SSD
Node OS: Container-Optimized OS
Container OS: Ubuntu 16.