Hi Alain,

Thank you for chiming in!

I was thinking to perform the 'start_native_transport=false' test as well
and indeed the issue is not showing up. Starting the/a node with native
transport disabled and letting it cool down lead to no timeout exceptions
no dropped messages, simply a crystal clean startup. Agreed it is a
workaround

# About upgrading:
Yes, I desperately want to upgrade despite is a long and slow task. Just
reviewing all the changes from 3.0.6 to 3.0.17
is going to be a huge pain, top of your head, any breaking change I should
absolutely take care of reviewing ?

# describecluster output: YES they agree on the same schema version

# keyspaces:
system WITH replication = {'class': 'LocalStrategy'}
system_schema WITH replication = {'class': 'LocalStrategy'}
system_auth WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '1'}
system_distributed WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '3'}
system_traces WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '2'}

<custom1> WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '3'}
<custom2>  WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '3'}

# Snitch
Ec2Snitch

## About Snitch and replication:
- We have the default DC and all nodes are in the same RACK
- We are planning to move to GossipingPropertyFileSnitch configuring the
cassandra-rackdc accortingly.
-- This should be a transparent change, correct?

- Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy' with
'us-xxxx' DC and replica counts as before
- Then adding a new DC inside the VPC, but this is another story...

Any concerns here ?

# nodetool status <ks>
--  Address         Load       Tokens       Owns (effective)  Host
ID                               Rack
UN  10.x.x.a  177 GB     256          50.3%
d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
UN  10.x.x.b    152.46 GB  256          51.8%
7888c077-346b-4e09-96b0-9f6376b8594f  rr
UN  10.x.x.c   159.59 GB  256          49.0%
329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
UN  10.x.x.d  162.44 GB  256          49.3%
07038c11-d200-46a0-9f6a-6e2465580fb1  rr
UN  10.x.x.e    174.9 GB   256          50.5%
c35b5d51-2d14-4334-9ffc-726f9dd8a214  rr
UN  10.x.x.f  194.71 GB  256          49.2%
f20f7a87-d5d2-4f38-a963-21e24167b8ac  rr

# gossipinfo
/10.x.x.a
  STATUS:827:NORMAL,-1350078789194251746
  LOAD:289986:1.90078037902E11
  SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:<some-ec2-dc>
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:290040:0.5934718251228333
  NET_VERSION:1:10
  HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5
  RPC_READY:868:true
  TOKENS:826:<hidden>
/10.x.x.b
  STATUS:16:NORMAL,-1023229528754013265
  LOAD:7113:1.63730480619E11
  SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:<some-ec2-dc>
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:7274:0.5988024473190308
  NET_VERSION:1:10
  HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f
  TOKENS:15:<hidden>
/10.x.x.c
  STATUS:732:NORMAL,-1117172759238888547
  LOAD:245839:1.71409806942E11
  SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:<some-ec2-dc>
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:245989:0.0
  NET_VERSION:1:10
  HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa
  RPC_READY:763:true
  TOKENS:731:<hidden>
/10.x.x.d
  STATUS:14:NORMAL,-1004942496246544417
  LOAD:313125:1.74447964917E11
  SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:<some-ec2-dc>
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:313215:0.25641027092933655
  NET_VERSION:1:10
  HOST_ID:2:07038c11-d200-46a0-9f6a-6e2465580fb1
  RPC_READY:56:true
  TOKENS:13:<hidden>
/10.x.x.e
  STATUS:520:NORMAL,-1058809960483771749
  LOAD:276118:1.87831573032E11
  SCHEMA:267327:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:<some-ec2-dc>
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:276217:0.32786884903907776
  NET_VERSION:1:10
  HOST_ID:2:c35b5d51-2d14-4334-9ffc-726f9dd8a214
  RPC_READY:550:true
  TOKENS:519:<hidden>
/10.x.x.f
  STATUS:1081:NORMAL,-1039671799603495012
  LOAD:239114:2.09082017545E11
  SCHEMA:230229:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:<some-ec2-dc>
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:239180:0.5665722489356995
  NET_VERSION:1:10
  HOST_ID:2:f20f7a87-d5d2-4f38-a963-21e24167b8ac
  RPC_READY:1118:true
  TOKENS:1080:<hidden>

## About load and tokens:
- While load is pretty even this does not apply to tokens, I guess we have
some table with uneven distribution. This should not be the case for high
load tabels as partition keys are are build with some 'id + <some time
format>'
- I was not able to find some documentation about the numbers printed next
to LOAD, SCHEMA, SEVERITY, RPC_READY ... Is there any doc around ?

# Tombstones
No ERRORS, only WARN about a very specific table that we are aware of. It
is an append only table read by spark from a batch job. (I guess it is a
read_repair chance or DTCS misconfig)

## Closing note!
We are on very only m1.xlarge 4 vcpu and raid0 (stripe) on the 4 spinning
drives, some changes to the cassandra.yml:

- dynamic_snitch: false
- concurrent_reads: 48
- concurrent_compactors: 1 (was 2)
- disk_optimization_strategy: spinning

I have some concerns about the number of concurrent_compactors, what do you
think?

Thanks!

Reply via email to