Hi Alain
- Over use the cluster was one thing which I was thinking about, and I have requested two new nodes (anyway it was something already planned). But the pattern of nodes with high CPU load is only visible in 1 or two of the nodes, the rest are working correctly. That made me think that adding two new nodes maybe will not help. - Run the deletes at slower at constant path sounds good and definitely I will try that. Anyway I have similar errors during the weekly repair, even without the deletes running. - Our cluster is inhouse one, each machine ois only use as a Cassandra node. - Logs are quite normal, even when the timeouts start to appear on the client. - The update of Cassandra is a good point but I am afraid that if I start the updates right now the timeouts problems will appear again. During an update compactions are executed? If it is not I think is safe to update the cluster. Thanks for your comments From: Alain RODRIGUEZ [mailto:arodr...@gmail.com] Sent: maandag 4 april 2016 18:35 To: user@cassandra.apache.org Subject: Re: all the nost are not reacheable when running massive deletes Hola Paco, the mutation stages pending column grows without stop, could be that the problem CPU (near 96%) Yes, basically I think you are over using this cluster. but two of them have high cpu load, especially the 232 because I am running a lot of deletes using cqlsh in that node. Solutions would be to run delete at a slower & constant path, against all the nodes, using a balancing policy or adding capacity if all the nodes are facing the issue and you can't slow deletes. You should also have a look at iowait and steal, see if CPU are really used 100% or masking an other issue. (disk not answering fast enough or hardware / shared instance issue). I had some noisy neighbours at some point while using Cassandra on AWS. I cannot find the reason that originates the timeouts. I don't see it that weird while being overusing some/all the nodes. I already have increased the timeouts, but It do not think that is a solution because the timeouts indicated another type of error Any relevant logs in Cassandra nodes (other than dropped mutations INFO)? 7 nodes version 2.0.17 Note: Be aware that this Cassandra version is quite old and no longer supported. Plus you might face issues that were solved already. I know that upgrading is not straight forward, but 2.0 --> 2.1 brings an amazing set of optimisations and some fixes too. You should try it out :-). C*heers, ----------------------- Alain Rodriguez - al...@thelastpickle.com<mailto:al...@thelastpickle.com> France The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com 2016-04-04 14:33 GMT+02:00 Paco Trujillo <f.truji...@genetwister.nl<mailto:f.truji...@genetwister.nl>>: Hi everyone We are having problems with our cluster (7 nodes version 2.0.17) when running “massive deletes” on one of the nodes (via cql command line). At the beginning everything is fine, but after a while we start getting constant NoHostAvailableException using the datastax driver: Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.31.7.243:9042<http://172.31.7.243:9042> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.245:9042<http://172.31.7.245:9042> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.246:9042<http://172.31.7.246:9042> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.247:9042<http://172.31.7.247:9042>, /172.31.7.232:9042<http://172.31.7.232:9042>, /172.31.7.233:9042<http://172.31.7.233:9042>, /172.31.7.244:9042<http://172.31.7.244:9042> [only showing errors of first 3 hosts, use getErrors() for more details]) All the nodes are running: UN 172.31.7.244 152.21 GB 256 14.5% 58abea69-e7ba-4e57-9609-24f3673a7e58 RAC1 UN 172.31.7.245 168.4 GB 256 14.5% bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752 RAC1 UN 172.31.7.246 177.71 GB 256 13.7% 8dc7bb3d-38f7-49b9-b8db-a622cc80346c RAC1 UN 172.31.7.247 158.57 GB 256 14.1% 94022081-a563-4042-81ab-75ffe4d13194 RAC1 UN 172.31.7.243 176.83 GB 256 14.6% 0dda3410-db58-42f2-9351-068bdf68f530 RAC1 UN 172.31.7.233 159 GB 256 13.6% 01e013fb-2f57-44fb-b3c5-fd89d705bfdd RAC1 UN 172.31.7.232 166.05 GB 256 15.0% 4d009603-faa9-4add-b3a2-fe24ec16a7c1 but two of them have high cpu load, especially the 232 because I am running a lot of deletes using cqlsh in that node. I know that deletes generate tombstones, but with 7 nodes in the cluster I do not think is normal that all the host are not accesible. We have a replication factor of 3 and for the deletes I am not using any consistency (so it is using the default ONE). I check the nodes which a lot of CPU (near 96%) and th gc activity remains on 1.6% (using only 3 GB from the 10 which have assigned). But looking at the thread pool stats, the mutation stages pending column grows without stop, could be that the problem? I cannot find the reason that originates the timeouts. I already have increased the timeouts, but It do not think that is a solution because the timeouts indicated another type of error. Anyone have a tip to try to determine where is the problem? Thanks in advance