RE: all the nost are not reacheable when running massive deletes

Paco Trujillo Tue, 05 Apr 2016 01:25:30 -0700

Hi Alain


-          Over use the cluster was one thing which I was thinking about, and I 
have requested two new nodes (anyway it was something already planned). But the 
pattern of nodes with high CPU load is only visible in 1 or two of the nodes, 
the rest are working correctly. That made me think that adding two new nodes 
maybe will not help.


-          Run the deletes at slower at constant path sounds good and 
definitely I will try that. Anyway I have similar errors during the weekly 
repair, even without the deletes running.



-          Our cluster is inhouse one, each machine ois only use as a Cassandra 
node.



-          Logs are quite normal, even when the timeouts start to appear on the 
client.



-          The update of Cassandra is a good point but I am afraid that if I 
start the updates right now the timeouts problems will appear again. During an 
update compactions are executed? If it is not I think is safe to update the 
cluster.

Thanks for your comments

From: Alain RODRIGUEZ [mailto:arodr...@gmail.com]
Sent: maandag 4 april 2016 18:35
To: user@cassandra.apache.org
Subject: Re: all the nost are not reacheable when running massive deletes

Hola Paco,

the mutation stages pending column grows without stop, could be that the problem

CPU (near 96%)

Yes, basically I think you are over using this cluster.

but two of them have high cpu load, especially the 232 because I am running a 
lot of deletes using cqlsh in that node.

Solutions would be to run delete at a slower & constant path, against all the 
nodes, using a balancing policy or adding capacity if all the nodes are facing 
the issue and you can't slow deletes. You should also have a look at iowait and 
steal, see if CPU are really used 100% or masking an other issue. (disk not 
answering fast enough or hardware / shared instance issue). I had some noisy 
neighbours at some point while using Cassandra on AWS.

 I cannot find the reason that originates the timeouts.

I don't see it that weird while being overusing some/all the nodes.

I already have increased the timeouts, but It do not think that is a solution 
because the timeouts indicated another type of error

Any relevant logs in Cassandra nodes (other than dropped mutations INFO)?

7 nodes version 2.0.17

Note: Be aware that this Cassandra version is quite old and no longer 
supported. Plus you might face issues that were solved already. I know that 
upgrading is not straight forward, but 2.0 --> 2.1 brings an amazing set of 
optimisations and some fixes too. You should try it out :-).

C*heers,
-----------------------
Alain Rodriguez - al...@thelastpickle.com<mailto:al...@thelastpickle.com>
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


2016-04-04 14:33 GMT+02:00 Paco Trujillo 
<f.truji...@genetwister.nl<mailto:f.truji...@genetwister.nl>>:
Hi everyone

We are having problems with our cluster (7 nodes version 2.0.17) when running 
“massive deletes” on one of the nodes (via cql command line). At the beginning 
everything is fine, but after a while we start getting constant 
NoHostAvailableException using the datastax driver:

Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All 
host(s) tried for query failed (tried: 
/172.31.7.243:9042<http://172.31.7.243:9042> 
(com.datastax.driver.core.exceptions.DriverException: Timeout while trying to 
acquire available connection (you may want to increase the driver number of 
per-host connections)), /172.31.7.245:9042<http://172.31.7.245:9042> 
(com.datastax.driver.core.exceptions.DriverException: Timeout while trying to 
acquire available connection (you may want to increase the driver number of 
per-host connections)), /172.31.7.246:9042<http://172.31.7.246:9042> 
(com.datastax.driver.core.exceptions.DriverException: Timeout while trying to 
acquire available connection (you may want to increase the driver number of 
per-host connections)), /172.31.7.247:9042<http://172.31.7.247:9042>, 
/172.31.7.232:9042<http://172.31.7.232:9042>, 
/172.31.7.233:9042<http://172.31.7.233:9042>, 
/172.31.7.244:9042<http://172.31.7.244:9042> [only showing errors of first 3 
hosts, use getErrors() for more details])


All the nodes are running:

UN  172.31.7.244  152.21 GB  256     14.5%  
58abea69-e7ba-4e57-9609-24f3673a7e58  RAC1
UN  172.31.7.245  168.4 GB   256     14.5%  
bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752  RAC1
UN  172.31.7.246  177.71 GB  256     13.7%  
8dc7bb3d-38f7-49b9-b8db-a622cc80346c  RAC1
UN  172.31.7.247  158.57 GB  256     14.1%  
94022081-a563-4042-81ab-75ffe4d13194  RAC1
UN  172.31.7.243  176.83 GB  256     14.6%  
0dda3410-db58-42f2-9351-068bdf68f530  RAC1
UN  172.31.7.233  159 GB     256     13.6%  
01e013fb-2f57-44fb-b3c5-fd89d705bfdd  RAC1
UN  172.31.7.232  166.05 GB  256     15.0%  4d009603-faa9-4add-b3a2-fe24ec16a7c1

but two of them have high cpu load, especially the 232 because I am running a 
lot of deletes using cqlsh in that node.

I know that deletes generate tombstones, but with 7 nodes in the cluster I do 
not think is normal that all the host are not accesible.

We have a replication factor of 3 and for the deletes I am not using any 
consistency (so it is using the default ONE).

I check the nodes which a lot of CPU (near 96%) and th gc activity remains on 
1.6% (using only 3 GB from the 10 which have assigned). But looking at the 
thread pool stats, the mutation stages pending column grows without stop, could 
be that the problem?

I cannot find the reason that originates the timeouts. I already have increased 
the timeouts, but It do not think that is a solution because the timeouts 
indicated another type of error. Anyone have a tip to try to determine where is 
the problem?

Thanks in advance

RE: all the nost are not reacheable when running massive deletes

Reply via email to