Hi Alain, Thank you for chiming in!
I was thinking to perform the 'start_native_transport=false' test as well and indeed the issue is not showing up. Starting the/a node with native transport disabled and letting it cool down lead to no timeout exceptions no dropped messages, simply a crystal clean startup. Agreed it is a workaround # About upgrading: Yes, I desperately want to upgrade despite is a long and slow task. Just reviewing all the changes from 3.0.6 to 3.0.17 is going to be a huge pain, top of your head, any breaking change I should absolutely take care of reviewing ? # describecluster output: YES they agree on the same schema version # keyspaces: system WITH replication = {'class': 'LocalStrategy'} system_schema WITH replication = {'class': 'LocalStrategy'} system_auth WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} system_distributed WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'} system_traces WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '2'} <custom1> WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'} <custom2> WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'} # Snitch Ec2Snitch ## About Snitch and replication: - We have the default DC and all nodes are in the same RACK - We are planning to move to GossipingPropertyFileSnitch configuring the cassandra-rackdc accortingly. -- This should be a transparent change, correct? - Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy' with 'us-xxxx' DC and replica counts as before - Then adding a new DC inside the VPC, but this is another story... Any concerns here ? # nodetool status <ks> -- Address Load Tokens Owns (effective) Host ID Rack UN 10.x.x.a 177 GB 256 50.3% d8bfe4ad-8138-41fe-89a4-ee9a043095b5 rr UN 10.x.x.b 152.46 GB 256 51.8% 7888c077-346b-4e09-96b0-9f6376b8594f rr UN 10.x.x.c 159.59 GB 256 49.0% 329b288e-c5b5-4b55-b75e-fbe9243e75fa rr UN 10.x.x.d 162.44 GB 256 49.3% 07038c11-d200-46a0-9f6a-6e2465580fb1 rr UN 10.x.x.e 174.9 GB 256 50.5% c35b5d51-2d14-4334-9ffc-726f9dd8a214 rr UN 10.x.x.f 194.71 GB 256 49.2% f20f7a87-d5d2-4f38-a963-21e24167b8ac rr # gossipinfo /10.x.x.a STATUS:827:NORMAL,-1350078789194251746 LOAD:289986:1.90078037902E11 SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a DC:6:<some-ec2-dc> RACK:8:rr RELEASE_VERSION:4:3.0.6 SEVERITY:290040:0.5934718251228333 NET_VERSION:1:10 HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5 RPC_READY:868:true TOKENS:826:<hidden> /10.x.x.b STATUS:16:NORMAL,-1023229528754013265 LOAD:7113:1.63730480619E11 SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a DC:6:<some-ec2-dc> RACK:8:rr RELEASE_VERSION:4:3.0.6 SEVERITY:7274:0.5988024473190308 NET_VERSION:1:10 HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f TOKENS:15:<hidden> /10.x.x.c STATUS:732:NORMAL,-1117172759238888547 LOAD:245839:1.71409806942E11 SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a DC:6:<some-ec2-dc> RACK:8:rr RELEASE_VERSION:4:3.0.6 SEVERITY:245989:0.0 NET_VERSION:1:10 HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa RPC_READY:763:true TOKENS:731:<hidden> /10.x.x.d STATUS:14:NORMAL,-1004942496246544417 LOAD:313125:1.74447964917E11 SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a DC:6:<some-ec2-dc> RACK:8:rr RELEASE_VERSION:4:3.0.6 SEVERITY:313215:0.25641027092933655 NET_VERSION:1:10 HOST_ID:2:07038c11-d200-46a0-9f6a-6e2465580fb1 RPC_READY:56:true TOKENS:13:<hidden> /10.x.x.e STATUS:520:NORMAL,-1058809960483771749 LOAD:276118:1.87831573032E11 SCHEMA:267327:af4461c3-d269-39bc-9d03-3566031c1e0a DC:6:<some-ec2-dc> RACK:8:rr RELEASE_VERSION:4:3.0.6 SEVERITY:276217:0.32786884903907776 NET_VERSION:1:10 HOST_ID:2:c35b5d51-2d14-4334-9ffc-726f9dd8a214 RPC_READY:550:true TOKENS:519:<hidden> /10.x.x.f STATUS:1081:NORMAL,-1039671799603495012 LOAD:239114:2.09082017545E11 SCHEMA:230229:af4461c3-d269-39bc-9d03-3566031c1e0a DC:6:<some-ec2-dc> RACK:8:rr RELEASE_VERSION:4:3.0.6 SEVERITY:239180:0.5665722489356995 NET_VERSION:1:10 HOST_ID:2:f20f7a87-d5d2-4f38-a963-21e24167b8ac RPC_READY:1118:true TOKENS:1080:<hidden> ## About load and tokens: - While load is pretty even this does not apply to tokens, I guess we have some table with uneven distribution. This should not be the case for high load tabels as partition keys are are build with some 'id + <some time format>' - I was not able to find some documentation about the numbers printed next to LOAD, SCHEMA, SEVERITY, RPC_READY ... Is there any doc around ? # Tombstones No ERRORS, only WARN about a very specific table that we are aware of. It is an append only table read by spark from a batch job. (I guess it is a read_repair chance or DTCS misconfig) ## Closing note! We are on very only m1.xlarge 4 vcpu and raid0 (stripe) on the 4 spinning drives, some changes to the cassandra.yml: - dynamic_snitch: false - concurrent_reads: 48 - concurrent_compactors: 1 (was 2) - disk_optimization_strategy: spinning I have some concerns about the number of concurrent_compactors, what do you think? Thanks!