Have you determined if a specific query is the one getting timed out? It is 
possible that the query/data model does not scale well, especially if you are 
trying to do something like a full table scan.

It is also possible that your OS settings will limit the number of connections 
to the host. Do you see any timewait connections in netstat? I would agree that 
5,000 connections per host seems on the high side. Each one requires resources, 
like memory, so reducing connections is a good idea.


Sean Durity

-----Original Message-----
From: Max Campos [mailto:mc_cassan...@core43.com]
Sent: Thursday, December 14, 2017 3:18 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Lots of simultaneous connections?

Hi -

We’re finally putting our new application under load, and we’re starting to get 
this error message from the Python driver when under heavy load:

('Unable to connect to any servers', {‘x.y.z.205': 
OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.204': 
OperationTimedOut('errors=None, last_host=None',), ‘x.y.z.206': 
OperationTimedOut('errors=None, last_host=None',)})' (22.7s)

Our cluster is running 3.0.6, has 3 nodes and we use RF=3, CL=QUORUM 
reads/writes.  We have a few thousand machines which are each making 1-10 
connections to C* at once, but each of these connections only reads/writes a 
few records, waits several minutes, and then writes a few records — so while 
netstat reports ~5K connections per node, they’re generally idle.  Peak 
read/sec today was ~1500 per node, peak writes/sec was ~300 per node.  
Read/write latencies peaked at 2.5ms.

Some questions:
1) Is anyone else out there making this many simultaneous connections?  Any 
idea what a reasonable number of connections is, what is too many, etc?

2) Any thoughts on which JMX metrics I should look at to better understand what 
exactly is exploding?  Is there a “number of active connections” metric?  We 
currently look at:
- client reads/writes per sec
- read/write latency
- compaction tasks
- repair tasks
- disk used by node
- disk used by table
- avg partition size per table

3) Any other advice?

I think I’ll try doing an explicit disconnect during the waiting period of our 
application’s execution; so as to get the C* connection count down.  Hopefully 
that will solve the timeout problem.

Thanks for your help.

- Max
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org


________________________________

The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.

Reply via email to