Re: Read timeouts when performing rolling restart

Alain RODRIGUEZ Wed, 12 Sep 2018 04:11:21 -0700

Hello Ricardo

How come that a single node is impacting the whole cluster?
>


It sounds weird indeed.

Is there a way to further delay the native transposrt startup?


You can configure 'start_native_transport: false' in 'cassandra.yaml'. (
https://github.com/apache/cassandra/blob/cassandra-3.0.6/conf/cassandra.yaml#L496
)
Then 'nodetool enablebinary' (
http://cassandra.apache.org/doc/latest/tools/nodetool/enablebinary.html)
when you are ready for it.

But I would consider this as a workaround, and it might not even work, I
hope it does though :).

Any hint on troubleshooting it further?
>

The version of Cassandra is quite an early Cassandra 3+. It's probably
worth to consider moving to 3.0.17, if not to solve this issue, not to face
other issues that were fixed since then.
To know if that would really help you, you can go through
https://github.com/apache/cassandra/blob/cassandra-3.0.17/CHANGES.txt

I am not too sure about what is going on, but here are some other things I
would look at to try to understand this:

Are all the nodes agreeing on the schema?
'nodetool describecluster'

Are all the keyspaces using the 'NetworkTopologyStrategy' and a replication
factor of 2+?
'cqlsh -e "DESCRIBE KEYSPACES;" '

What snitch are you using (in cassandra.yaml)?

What does ownership look like?
'nodetool status <ks>'

What about gossip?
'nodetool gossipinfo' or 'nodetool gossipinfo | grep STATUS' maybe.

A tombstone issue?
https://support.datastax.com/hc/en-us/articles/204612559-ReadTimeoutException-seen-when-using-the-java-driver-caused-by-excessive-tombstones

Any ERROR or WARN in the logs after the restart on this node and on other
nodes (you would see the tombstone issue here)?
'grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log'

I hope one of those will help, let us know if you need help to interpret
some of the outputs,

C*heers,
-----------------------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


Le mer. 12 sept. 2018 à 10:59, Riccardo Ferrari <ferra...@gmail.com> a
écrit :

> Hi list,
>
> We are seeing the following behaviour when performing a rolling restart:
>
> On the node I need to restart:
> *  I run the 'nodetool drain'
> * Then 'service cassandra restart'
>
> so far so good. The load incerase on the other 5 nodes is negligible.
> The node is generally out of service just for the time of the restart (ie.
> cassandra.yml update)
>
> When the node comes back up and switch on the native transport I start see
> lots of read timeouts in our various services:
>
> com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra
> timeout during read query at consistency LOCAL_ONE (1 responses were
> required but only 0 replica responded)
>
> Indeed the restarting node have a huge peak on the system load, because of
> hints and compactions, nevertheless I don't notice a load increase on the
> other 5 nodes.
>
> Specs:
> 6 nodes cluster on Cassandra 3.0.6
> - keyspace RF=3
>
> Java driver 3.5.1:
> - DefaultRetryPolicy
> - default LoadBalancingPolicy (that should be DCAwareRoundRobinPolicy)
>
> QUESTIONS:
> How come that a single node is impacting the whole cluster?
> Is there a way to further delay the native transposrt startup?
> Any hint on troubleshooting it further?
>
> Thanks
>

Re: Read timeouts when performing rolling restart

Reply via email to