Re: Read timeouts when performing rolling restart

2018-09-18 Thread Riccardo Ferrari
 can be multiple things, but having
> an interactive view of the pending requests might lead you to the root
> cause of the issue.
>
> C*heers,
> ---
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le jeu. 13 sept. 2018 à 09:50, Riccardo Ferrari  a
> écrit :
>
>> Hi Shalom,
>>
>> It happens almost at every restart, either a single node or a rolling
>> one. I do agree with you that it is good, at least on my setup, to wait few
>> minutes to let the rebooted node to cool down before moving to the next.
>> The more I look at it the more I think is something coming from hint
>> dispatching, maybe I should try  something around hints throttling.
>>
>> Thanks!
>>
>> On Thu, Sep 13, 2018 at 8:55 AM, shalom sagges 
>> wrote:
>>
>>> Hi Riccardo,
>>>
>>> Does this issue occur when performing a single restart or after several
>>> restarts during a rolling restart (as mentioned in your original post)?
>>> We have a cluster that when performing a rolling restart, we prefer to
>>> wait ~10-15 minutes between each restart because we see an increase of GC
>>> for a few minutes.
>>> If we keep restarting the nodes quickly one after the other, the
>>> applications experience timeouts (probably due to GC and hints).
>>>
>>> Hope this helps!
>>>
>>> On Thu, Sep 13, 2018 at 2:20 AM Riccardo Ferrari 
>>> wrote:
>>>
>>>> A little update on the progress.
>>>>
>>>> First:
>>>> Thank you Thomas. I checked the code in the patch and briefly skimmed
>>>> through the 3.0.6 code. Yup it should be fixed.
>>>> Thank you Surbhi. At the moment we don't need authentication as the
>>>> instances are locked down.
>>>>
>>>> Now:
>>>> - Unfortunately the start_transport_native trick does not always work.
>>>> On some nodes works on other don't. What do I mean? I still experience
>>>> timeouts and dropped messages during startup.
>>>> - I realized that cutting the concurrent_compactors to 1 was not really
>>>> a good idea, minimum vlaue should be 2, currently testing 4 (that is the
>>>> min(n_cores, n_disks))
>>>> - After rising the compactors to 4 I still see some dropped messages
>>>> for HINT and MUTATIONS. This happens during startup. Reason is "for
>>>> internal timeout". Maybe too many compactors?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> On Wed, Sep 12, 2018 at 7:09 PM, Surbhi Gupta >>> > wrote:
>>>>
>>>>> Another thing to notice is :
>>>>>
>>>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '1'}
>>>>>
>>>>> system_auth has a replication factor of 1 and even if one node is down
>>>>> it may impact the system because of the replication factor.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
>>>>> thomas.steinmau...@dynatrace.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I remember something that a client using the native protocol gets
>>>>>> notified too early by Cassandra being ready due to the following issue:
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-8236
>>>>>>
>>>>>>
>>>>>>
>>>>>> which looks similar, but above was marked as fixed in 2.2.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Riccardo Ferrari 
>>>>>> *Sent:* Mittwoch, 12. September 2018 18:25
>>>>>> *To:* user@cassandra.apache.org
>>>>>> *Subject:* Re: Read timeouts when performing rolling restart
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Alain,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thank you for chiming in!
>>>>>>
>>>>>>
>>>>>>
>>>>>> I was thinking to perform the 'start_native_transp

Re: Read timeouts when performing rolling restart

2018-09-14 Thread Alain RODRIGUEZ
 What do I mean? I still experience
>>> timeouts and dropped messages during startup.
>>> - I realized that cutting the concurrent_compactors to 1 was not really
>>> a good idea, minimum vlaue should be 2, currently testing 4 (that is the
>>> min(n_cores, n_disks))
>>> - After rising the compactors to 4 I still see some dropped messages for
>>> HINT and MUTATIONS. This happens during startup. Reason is "for internal
>>> timeout". Maybe too many compactors?
>>>
>>> Thanks!
>>>
>>>
>>> On Wed, Sep 12, 2018 at 7:09 PM, Surbhi Gupta 
>>> wrote:
>>>
>>>> Another thing to notice is :
>>>>
>>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>>> 'replication_factor': '1'}
>>>>
>>>> system_auth has a replication factor of 1 and even if one node is down
>>>> it may impact the system because of the replication factor.
>>>>
>>>>
>>>>
>>>> On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
>>>> thomas.steinmau...@dynatrace.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I remember something that a client using the native protocol gets
>>>>> notified too early by Cassandra being ready due to the following issue:
>>>>>
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-8236
>>>>>
>>>>>
>>>>>
>>>>> which looks similar, but above was marked as fixed in 2.2.
>>>>>
>>>>>
>>>>>
>>>>> Thomas
>>>>>
>>>>>
>>>>>
>>>>> *From:* Riccardo Ferrari 
>>>>> *Sent:* Mittwoch, 12. September 2018 18:25
>>>>> *To:* user@cassandra.apache.org
>>>>> *Subject:* Re: Read timeouts when performing rolling restart
>>>>>
>>>>>
>>>>>
>>>>> Hi Alain,
>>>>>
>>>>>
>>>>>
>>>>> Thank you for chiming in!
>>>>>
>>>>>
>>>>>
>>>>> I was thinking to perform the 'start_native_transport=false' test as
>>>>> well and indeed the issue is not showing up. Starting the/a node with
>>>>> native transport disabled and letting it cool down lead to no timeout
>>>>> exceptions no dropped messages, simply a crystal clean startup. Agreed it
>>>>> is a workaround
>>>>>
>>>>>
>>>>>
>>>>> # About upgrading:
>>>>>
>>>>> Yes, I desperately want to upgrade despite is a long and slow task.
>>>>> Just reviewing all the changes from 3.0.6 to 3.0.17
>>>>> is going to be a huge pain, top of your head, any breaking change I
>>>>> should absolutely take care of reviewing ?
>>>>>
>>>>>
>>>>>
>>>>> # describecluster output: YES they agree on the same schema version
>>>>>
>>>>>
>>>>>
>>>>> # keyspaces:
>>>>>
>>>>> system WITH replication = {'class': 'LocalStrategy'}
>>>>>
>>>>> system_schema WITH replication = {'class': 'LocalStrategy'}
>>>>>
>>>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '1'}
>>>>>
>>>>> system_distributed WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '3'}
>>>>>
>>>>> system_traces WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '2'}
>>>>>
>>>>>
>>>>>
>>>>>  WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '3'}
>>>>>
>>>>>   WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '3'}
>>>>>
>>>>>
>>>>>
>>>>> # Snitch
>>>>>
>>>>> Ec2Snitch
>>>>>
>>>>>
>>>>>
>>>>> ## About Snitch and replication:
>>>>>
>>>>> - We have the default DC and all nodes are in the same RACK
>>>>>
>>>>> - We are planning to move to GossipingPropertyFileSnitch configuring
>>>>> the cassandra

Re: Read timeouts when performing rolling restart

2018-09-13 Thread Riccardo Ferrari
Hi Shalom,

It happens almost at every restart, either a single node or a rolling one.
I do agree with you that it is good, at least on my setup, to wait few
minutes to let the rebooted node to cool down before moving to the next.
The more I look at it the more I think is something coming from hint
dispatching, maybe I should try  something around hints throttling.

Thanks!

On Thu, Sep 13, 2018 at 8:55 AM, shalom sagges 
wrote:

> Hi Riccardo,
>
> Does this issue occur when performing a single restart or after several
> restarts during a rolling restart (as mentioned in your original post)?
> We have a cluster that when performing a rolling restart, we prefer to
> wait ~10-15 minutes between each restart because we see an increase of GC
> for a few minutes.
> If we keep restarting the nodes quickly one after the other, the
> applications experience timeouts (probably due to GC and hints).
>
> Hope this helps!
>
> On Thu, Sep 13, 2018 at 2:20 AM Riccardo Ferrari 
> wrote:
>
>> A little update on the progress.
>>
>> First:
>> Thank you Thomas. I checked the code in the patch and briefly skimmed
>> through the 3.0.6 code. Yup it should be fixed.
>> Thank you Surbhi. At the moment we don't need authentication as the
>> instances are locked down.
>>
>> Now:
>> - Unfortunately the start_transport_native trick does not always work. On
>> some nodes works on other don't. What do I mean? I still experience
>> timeouts and dropped messages during startup.
>> - I realized that cutting the concurrent_compactors to 1 was not really a
>> good idea, minimum vlaue should be 2, currently testing 4 (that is the
>> min(n_cores, n_disks))
>> - After rising the compactors to 4 I still see some dropped messages for
>> HINT and MUTATIONS. This happens during startup. Reason is "for internal
>> timeout". Maybe too many compactors?
>>
>> Thanks!
>>
>>
>> On Wed, Sep 12, 2018 at 7:09 PM, Surbhi Gupta 
>> wrote:
>>
>>> Another thing to notice is :
>>>
>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>> 'replication_factor': '1'}
>>>
>>> system_auth has a replication factor of 1 and even if one node is down
>>> it may impact the system because of the replication factor.
>>>
>>>
>>>
>>> On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
>>> thomas.steinmau...@dynatrace.com> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I remember something that a client using the native protocol gets
>>>> notified too early by Cassandra being ready due to the following issue:
>>>>
>>>> https://issues.apache.org/jira/browse/CASSANDRA-8236
>>>>
>>>>
>>>>
>>>> which looks similar, but above was marked as fixed in 2.2.
>>>>
>>>>
>>>>
>>>> Thomas
>>>>
>>>>
>>>>
>>>> *From:* Riccardo Ferrari 
>>>> *Sent:* Mittwoch, 12. September 2018 18:25
>>>> *To:* user@cassandra.apache.org
>>>> *Subject:* Re: Read timeouts when performing rolling restart
>>>>
>>>>
>>>>
>>>> Hi Alain,
>>>>
>>>>
>>>>
>>>> Thank you for chiming in!
>>>>
>>>>
>>>>
>>>> I was thinking to perform the 'start_native_transport=false' test as
>>>> well and indeed the issue is not showing up. Starting the/a node with
>>>> native transport disabled and letting it cool down lead to no timeout
>>>> exceptions no dropped messages, simply a crystal clean startup. Agreed it
>>>> is a workaround
>>>>
>>>>
>>>>
>>>> # About upgrading:
>>>>
>>>> Yes, I desperately want to upgrade despite is a long and slow task.
>>>> Just reviewing all the changes from 3.0.6 to 3.0.17
>>>> is going to be a huge pain, top of your head, any breaking change I
>>>> should absolutely take care of reviewing ?
>>>>
>>>>
>>>>
>>>> # describecluster output: YES they agree on the same schema version
>>>>
>>>>
>>>>
>>>> # keyspaces:
>>>>
>>>> system WITH replication = {'class': 'LocalStrategy'}
>>>>
>>>> system_schema WITH replication = {'class': 'LocalStrategy'}
>>>>
>>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>>> 'replication_f

Re: Read timeouts when performing rolling restart

2018-09-13 Thread shalom sagges
Hi Riccardo,

Does this issue occur when performing a single restart or after several
restarts during a rolling restart (as mentioned in your original post)?
We have a cluster that when performing a rolling restart, we prefer to wait
~10-15 minutes between each restart because we see an increase of GC for a
few minutes.
If we keep restarting the nodes quickly one after the other, the
applications experience timeouts (probably due to GC and hints).

Hope this helps!

On Thu, Sep 13, 2018 at 2:20 AM Riccardo Ferrari  wrote:

> A little update on the progress.
>
> First:
> Thank you Thomas. I checked the code in the patch and briefly skimmed
> through the 3.0.6 code. Yup it should be fixed.
> Thank you Surbhi. At the moment we don't need authentication as the
> instances are locked down.
>
> Now:
> - Unfortunately the start_transport_native trick does not always work. On
> some nodes works on other don't. What do I mean? I still experience
> timeouts and dropped messages during startup.
> - I realized that cutting the concurrent_compactors to 1 was not really a
> good idea, minimum vlaue should be 2, currently testing 4 (that is the
> min(n_cores, n_disks))
> - After rising the compactors to 4 I still see some dropped messages for
> HINT and MUTATIONS. This happens during startup. Reason is "for internal
> timeout". Maybe too many compactors?
>
> Thanks!
>
>
> On Wed, Sep 12, 2018 at 7:09 PM, Surbhi Gupta 
> wrote:
>
>> Another thing to notice is :
>>
>> system_auth WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '1'}
>>
>> system_auth has a replication factor of 1 and even if one node is down it
>> may impact the system because of the replication factor.
>>
>>
>>
>> On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
>> thomas.steinmau...@dynatrace.com> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> I remember something that a client using the native protocol gets
>>> notified too early by Cassandra being ready due to the following issue:
>>>
>>> https://issues.apache.org/jira/browse/CASSANDRA-8236
>>>
>>>
>>>
>>> which looks similar, but above was marked as fixed in 2.2.
>>>
>>>
>>>
>>> Thomas
>>>
>>>
>>>
>>> *From:* Riccardo Ferrari 
>>> *Sent:* Mittwoch, 12. September 2018 18:25
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Read timeouts when performing rolling restart
>>>
>>>
>>>
>>> Hi Alain,
>>>
>>>
>>>
>>> Thank you for chiming in!
>>>
>>>
>>>
>>> I was thinking to perform the 'start_native_transport=false' test as
>>> well and indeed the issue is not showing up. Starting the/a node with
>>> native transport disabled and letting it cool down lead to no timeout
>>> exceptions no dropped messages, simply a crystal clean startup. Agreed it
>>> is a workaround
>>>
>>>
>>>
>>> # About upgrading:
>>>
>>> Yes, I desperately want to upgrade despite is a long and slow task. Just
>>> reviewing all the changes from 3.0.6 to 3.0.17
>>> is going to be a huge pain, top of your head, any breaking change I
>>> should absolutely take care of reviewing ?
>>>
>>>
>>>
>>> # describecluster output: YES they agree on the same schema version
>>>
>>>
>>>
>>> # keyspaces:
>>>
>>> system WITH replication = {'class': 'LocalStrategy'}
>>>
>>> system_schema WITH replication = {'class': 'LocalStrategy'}
>>>
>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>> 'replication_factor': '1'}
>>>
>>> system_distributed WITH replication = {'class': 'SimpleStrategy',
>>> 'replication_factor': '3'}
>>>
>>> system_traces WITH replication = {'class': 'SimpleStrategy',
>>> 'replication_factor': '2'}
>>>
>>>
>>>
>>>  WITH replication = {'class': 'SimpleStrategy',
>>> 'replication_factor': '3'}
>>>
>>>   WITH replication = {'class': 'SimpleStrategy',
>>> 'replication_factor': '3'}
>>>
>>>
>>>
>>> # Snitch
>>>
>>> Ec2Snitch
>>>
>>>
>>>
>>> ## About Snitch and replication:
>>>
>>> - We have the default DC and all nodes are in the same RACK
>>>
>>> - We are planning to move to GossipingPropertyFileSnitch configuring the
>>> cassand

Re: Read timeouts when performing rolling restart

2018-09-12 Thread Riccardo Ferrari
A little update on the progress.

First:
Thank you Thomas. I checked the code in the patch and briefly skimmed
through the 3.0.6 code. Yup it should be fixed.
Thank you Surbhi. At the moment we don't need authentication as the
instances are locked down.

Now:
- Unfortunately the start_transport_native trick does not always work. On
some nodes works on other don't. What do I mean? I still experience
timeouts and dropped messages during startup.
- I realized that cutting the concurrent_compactors to 1 was not really a
good idea, minimum vlaue should be 2, currently testing 4 (that is the
min(n_cores, n_disks))
- After rising the compactors to 4 I still see some dropped messages for
HINT and MUTATIONS. This happens during startup. Reason is "for internal
timeout". Maybe too many compactors?

Thanks!


On Wed, Sep 12, 2018 at 7:09 PM, Surbhi Gupta 
wrote:

> Another thing to notice is :
>
> system_auth WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '1'}
>
> system_auth has a replication factor of 1 and even if one node is down it
> may impact the system because of the replication factor.
>
>
>
> On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
>> Hi,
>>
>>
>>
>> I remember something that a client using the native protocol gets
>> notified too early by Cassandra being ready due to the following issue:
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-8236
>>
>>
>>
>> which looks similar, but above was marked as fixed in 2.2.
>>
>>
>>
>> Thomas
>>
>>
>>
>> *From:* Riccardo Ferrari 
>> *Sent:* Mittwoch, 12. September 2018 18:25
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Read timeouts when performing rolling restart
>>
>>
>>
>> Hi Alain,
>>
>>
>>
>> Thank you for chiming in!
>>
>>
>>
>> I was thinking to perform the 'start_native_transport=false' test as well
>> and indeed the issue is not showing up. Starting the/a node with native
>> transport disabled and letting it cool down lead to no timeout exceptions
>> no dropped messages, simply a crystal clean startup. Agreed it is a
>> workaround
>>
>>
>>
>> # About upgrading:
>>
>> Yes, I desperately want to upgrade despite is a long and slow task. Just
>> reviewing all the changes from 3.0.6 to 3.0.17
>> is going to be a huge pain, top of your head, any breaking change I
>> should absolutely take care of reviewing ?
>>
>>
>>
>> # describecluster output: YES they agree on the same schema version
>>
>>
>>
>> # keyspaces:
>>
>> system WITH replication = {'class': 'LocalStrategy'}
>>
>> system_schema WITH replication = {'class': 'LocalStrategy'}
>>
>> system_auth WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '1'}
>>
>> system_distributed WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '3'}
>>
>> system_traces WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '2'}
>>
>>
>>
>>  WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '3'}
>>
>>   WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '3'}
>>
>>
>>
>> # Snitch
>>
>> Ec2Snitch
>>
>>
>>
>> ## About Snitch and replication:
>>
>> - We have the default DC and all nodes are in the same RACK
>>
>> - We are planning to move to GossipingPropertyFileSnitch configuring the
>> cassandra-rackdc accortingly.
>>
>> -- This should be a transparent change, correct?
>>
>>
>>
>> - Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy'
>> with 'us-' DC and replica counts as before
>>
>> - Then adding a new DC inside the VPC, but this is another story...
>>
>>
>>
>> Any concerns here ?
>>
>>
>>
>> # nodetool status 
>>
>> --  Address Load   Tokens   Owns (effective)  Host
>> ID   Rack
>> UN  10.x.x.a  177 GB 256  50.3%
>> d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
>> UN  10.x.x.b152.46 GB  256  51.8%
>> 7888c077-346b-4e09-96b0-9f6376b8594f  rr
>> UN  10.x.x.c   159.59 GB  256  49.0%
>> 329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
>> UN  10.x.x.d  162.44 GB  256  49.3%
>> 07038c11-d200-46a0-9f6a-6e2465580fb1  rr
>> UN  10.x.x.e174.9 GB   256  50.5%
>> c35b5d51-2d14

Re: Read timeouts when performing rolling restart

2018-09-12 Thread Surbhi Gupta
Another thing to notice is :

system_auth WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '1'}

system_auth has a replication factor of 1 and even if one node is down it
may impact the system because of the replication factor.



On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

> Hi,
>
>
>
> I remember something that a client using the native protocol gets notified
> too early by Cassandra being ready due to the following issue:
>
> https://issues.apache.org/jira/browse/CASSANDRA-8236
>
>
>
> which looks similar, but above was marked as fixed in 2.2.
>
>
>
> Thomas
>
>
>
> *From:* Riccardo Ferrari 
> *Sent:* Mittwoch, 12. September 2018 18:25
> *To:* user@cassandra.apache.org
> *Subject:* Re: Read timeouts when performing rolling restart
>
>
>
> Hi Alain,
>
>
>
> Thank you for chiming in!
>
>
>
> I was thinking to perform the 'start_native_transport=false' test as well
> and indeed the issue is not showing up. Starting the/a node with native
> transport disabled and letting it cool down lead to no timeout exceptions
> no dropped messages, simply a crystal clean startup. Agreed it is a
> workaround
>
>
>
> # About upgrading:
>
> Yes, I desperately want to upgrade despite is a long and slow task. Just
> reviewing all the changes from 3.0.6 to 3.0.17
> is going to be a huge pain, top of your head, any breaking change I should
> absolutely take care of reviewing ?
>
>
>
> # describecluster output: YES they agree on the same schema version
>
>
>
> # keyspaces:
>
> system WITH replication = {'class': 'LocalStrategy'}
>
> system_schema WITH replication = {'class': 'LocalStrategy'}
>
> system_auth WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '1'}
>
> system_distributed WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '3'}
>
> system_traces WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '2'}
>
>
>
>  WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '3'}
>
>   WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '3'}
>
>
>
> # Snitch
>
> Ec2Snitch
>
>
>
> ## About Snitch and replication:
>
> - We have the default DC and all nodes are in the same RACK
>
> - We are planning to move to GossipingPropertyFileSnitch configuring the
> cassandra-rackdc accortingly.
>
> -- This should be a transparent change, correct?
>
>
>
> - Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy' with
> 'us-' DC and replica counts as before
>
> - Then adding a new DC inside the VPC, but this is another story...
>
>
>
> Any concerns here ?
>
>
>
> # nodetool status 
>
> --  Address Load   Tokens   Owns (effective)  Host
> ID   Rack
> UN  10.x.x.a  177 GB 256  50.3%
> d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
> UN  10.x.x.b152.46 GB  256  51.8%
> 7888c077-346b-4e09-96b0-9f6376b8594f  rr
> UN  10.x.x.c   159.59 GB  256  49.0%
> 329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
> UN  10.x.x.d  162.44 GB  256  49.3%
> 07038c11-d200-46a0-9f6a-6e2465580fb1  rr
> UN  10.x.x.e174.9 GB   256  50.5%
> c35b5d51-2d14-4334-9ffc-726f9dd8a214  rr
> UN  10.x.x.f  194.71 GB  256  49.2%
> f20f7a87-d5d2-4f38-a963-21e24167b8ac  rr
>
>
>
> # gossipinfo
>
> /10.x.x.a
>   STATUS:827:NORMAL,-1350078789194251746
>   LOAD:289986:1.90078037902E11
>   SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a
>   DC:6:
>   RACK:8:rr
>   RELEASE_VERSION:4:3.0.6
>   SEVERITY:290040:0.5934718251228333
>   NET_VERSION:1:10
>   HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5
>   RPC_READY:868:true
>   TOKENS:826:
> /10.x.x.b
>   STATUS:16:NORMAL,-1023229528754013265
>   LOAD:7113:1.63730480619E11
>   SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a
>   DC:6:
>   RACK:8:rr
>   RELEASE_VERSION:4:3.0.6
>   SEVERITY:7274:0.5988024473190308
>   NET_VERSION:1:10
>   HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f
>   TOKENS:15:
> /10.x.x.c
>   STATUS:732:NORMAL,-111717275923547
>   LOAD:245839:1.71409806942E11
>   SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a
>   DC:6:
>   RACK:8:rr
>   RELEASE_VERSION:4:3.0.6
>   SEVERITY:245989:0.0
>   NET_VERSION:1:10
>   HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa
>   RPC_READY:763:true
>   TOKENS:731:
> /10.x.x.d
>   STATUS:14:NORMAL,-1004942496246544417
>   LOAD:313125:1.74447964917E11
>   SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a
>   DC:6:
&g

RE: Read timeouts when performing rolling restart

2018-09-12 Thread Steinmaurer, Thomas
Hi,

I remember something that a client using the native protocol gets notified too 
early by Cassandra being ready due to the following issue:
https://issues.apache.org/jira/browse/CASSANDRA-8236

which looks similar, but above was marked as fixed in 2.2.

Thomas

From: Riccardo Ferrari 
Sent: Mittwoch, 12. September 2018 18:25
To: user@cassandra.apache.org
Subject: Re: Read timeouts when performing rolling restart

Hi Alain,

Thank you for chiming in!

I was thinking to perform the 'start_native_transport=false' test as well and 
indeed the issue is not showing up. Starting the/a node with native transport 
disabled and letting it cool down lead to no timeout exceptions no dropped 
messages, simply a crystal clean startup. Agreed it is a workaround

# About upgrading:
Yes, I desperately want to upgrade despite is a long and slow task. Just 
reviewing all the changes from 3.0.6 to 3.0.17
is going to be a huge pain, top of your head, any breaking change I should 
absolutely take care of reviewing ?

# describecluster output: YES they agree on the same schema version

# keyspaces:
system WITH replication = {'class': 'LocalStrategy'}
system_schema WITH replication = {'class': 'LocalStrategy'}
system_auth WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': '1'}
system_distributed WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': '3'}
system_traces WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': '2'}

 WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 
'3'}
  WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 
'3'}

# Snitch
Ec2Snitch

## About Snitch and replication:
- We have the default DC and all nodes are in the same RACK
- We are planning to move to GossipingPropertyFileSnitch configuring the 
cassandra-rackdc accortingly.
-- This should be a transparent change, correct?

- Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy' with 
'us-' DC and replica counts as before
- Then adding a new DC inside the VPC, but this is another story...

Any concerns here ?

# nodetool status 
--  Address Load   Tokens   Owns (effective)  Host ID   
Rack
UN  10.x.x.a  177 GB 256  50.3% 
d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
UN  10.x.x.b152.46 GB  256  51.8% 
7888c077-346b-4e09-96b0-9f6376b8594f  rr
UN  10.x.x.c   159.59 GB  256  49.0% 
329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
UN  10.x.x.d  162.44 GB  256  49.3% 
07038c11-d200-46a0-9f6a-6e2465580fb1  rr
UN  10.x.x.e174.9 GB   256  50.5% 
c35b5d51-2d14-4334-9ffc-726f9dd8a214  rr
UN  10.x.x.f  194.71 GB  256  49.2% 
f20f7a87-d5d2-4f38-a963-21e24167b8ac  rr

# gossipinfo
/10.x.x.a
  STATUS:827:NORMAL,-1350078789194251746
  LOAD:289986:1.90078037902E11
  SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:290040:0.5934718251228333
  NET_VERSION:1:10
  HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5
  RPC_READY:868:true
  TOKENS:826:
/10.x.x.b
  STATUS:16:NORMAL,-1023229528754013265
  LOAD:7113:1.63730480619E11
  SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:7274:0.5988024473190308
  NET_VERSION:1:10
  HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f
  TOKENS:15:
/10.x.x.c
  STATUS:732:NORMAL,-111717275923547
  LOAD:245839:1.71409806942E11
  SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:245989:0.0
  NET_VERSION:1:10
  HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa
  RPC_READY:763:true
  TOKENS:731:
/10.x.x.d
  STATUS:14:NORMAL,-1004942496246544417
  LOAD:313125:1.74447964917E11
  SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:313215:0.25641027092933655
  NET_VERSION:1:10
  HOST_ID:2:07038c11-d200-46a0-9f6a-6e2465580fb1
  RPC_READY:56:true
  TOKENS:13:
/10.x.x.e
  STATUS:520:NORMAL,-1058809960483771749
  LOAD:276118:1.87831573032E11
  SCHEMA:267327:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:276217:0.32786884903907776
  NET_VERSION:1:10
  HOST_ID:2:c35b5d51-2d14-4334-9ffc-726f9dd8a214
  RPC_READY:550:true
  TOKENS:519:
/10.x.x.f
  STATUS:1081:NORMAL,-1039671799603495012
  LOAD:239114:2.09082017545E11
  SCHEMA:230229:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:239180:0.5665722489356995
  NET_VERSION:1:10
  HOST_ID:2:f20f7a87-d5d2-4f38-a963-21e24167b8ac
  RPC_READY:1118:true
  TOKENS:1080:

## About load and tokens:
- While load is pretty even this does not apply to tokens, I guess we have some 
table with uneven distribution. This should not be the case for high load 
tabels as partition keys are are build with some 'id + '
- I was not able to find some

Re: Read timeouts when performing rolling restart

2018-09-12 Thread Riccardo Ferrari
Hi Alain,

Thank you for chiming in!

I was thinking to perform the 'start_native_transport=false' test as well
and indeed the issue is not showing up. Starting the/a node with native
transport disabled and letting it cool down lead to no timeout exceptions
no dropped messages, simply a crystal clean startup. Agreed it is a
workaround

# About upgrading:
Yes, I desperately want to upgrade despite is a long and slow task. Just
reviewing all the changes from 3.0.6 to 3.0.17
is going to be a huge pain, top of your head, any breaking change I should
absolutely take care of reviewing ?

# describecluster output: YES they agree on the same schema version

# keyspaces:
system WITH replication = {'class': 'LocalStrategy'}
system_schema WITH replication = {'class': 'LocalStrategy'}
system_auth WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '1'}
system_distributed WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '3'}
system_traces WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '2'}

 WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '3'}
  WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '3'}

# Snitch
Ec2Snitch

## About Snitch and replication:
- We have the default DC and all nodes are in the same RACK
- We are planning to move to GossipingPropertyFileSnitch configuring the
cassandra-rackdc accortingly.
-- This should be a transparent change, correct?

- Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy' with
'us-' DC and replica counts as before
- Then adding a new DC inside the VPC, but this is another story...

Any concerns here ?

# nodetool status 
--  Address Load   Tokens   Owns (effective)  Host
ID   Rack
UN  10.x.x.a  177 GB 256  50.3%
d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
UN  10.x.x.b152.46 GB  256  51.8%
7888c077-346b-4e09-96b0-9f6376b8594f  rr
UN  10.x.x.c   159.59 GB  256  49.0%
329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
UN  10.x.x.d  162.44 GB  256  49.3%
07038c11-d200-46a0-9f6a-6e2465580fb1  rr
UN  10.x.x.e174.9 GB   256  50.5%
c35b5d51-2d14-4334-9ffc-726f9dd8a214  rr
UN  10.x.x.f  194.71 GB  256  49.2%
f20f7a87-d5d2-4f38-a963-21e24167b8ac  rr

# gossipinfo
/10.x.x.a
  STATUS:827:NORMAL,-1350078789194251746
  LOAD:289986:1.90078037902E11
  SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:290040:0.5934718251228333
  NET_VERSION:1:10
  HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5
  RPC_READY:868:true
  TOKENS:826:
/10.x.x.b
  STATUS:16:NORMAL,-1023229528754013265
  LOAD:7113:1.63730480619E11
  SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:7274:0.5988024473190308
  NET_VERSION:1:10
  HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f
  TOKENS:15:
/10.x.x.c
  STATUS:732:NORMAL,-111717275923547
  LOAD:245839:1.71409806942E11
  SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:245989:0.0
  NET_VERSION:1:10
  HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa
  RPC_READY:763:true
  TOKENS:731:
/10.x.x.d
  STATUS:14:NORMAL,-1004942496246544417
  LOAD:313125:1.74447964917E11
  SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:313215:0.25641027092933655
  NET_VERSION:1:10
  HOST_ID:2:07038c11-d200-46a0-9f6a-6e2465580fb1
  RPC_READY:56:true
  TOKENS:13:
/10.x.x.e
  STATUS:520:NORMAL,-1058809960483771749
  LOAD:276118:1.87831573032E11
  SCHEMA:267327:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:276217:0.32786884903907776
  NET_VERSION:1:10
  HOST_ID:2:c35b5d51-2d14-4334-9ffc-726f9dd8a214
  RPC_READY:550:true
  TOKENS:519:
/10.x.x.f
  STATUS:1081:NORMAL,-1039671799603495012
  LOAD:239114:2.09082017545E11
  SCHEMA:230229:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:239180:0.5665722489356995
  NET_VERSION:1:10
  HOST_ID:2:f20f7a87-d5d2-4f38-a963-21e24167b8ac
  RPC_READY:1118:true
  TOKENS:1080:

## About load and tokens:
- While load is pretty even this does not apply to tokens, I guess we have
some table with uneven distribution. This should not be the case for high
load tabels as partition keys are are build with some 'id + '
- I was not able to find some documentation about the numbers printed next
to LOAD, SCHEMA, SEVERITY, RPC_READY ... Is there any doc around ?

# Tombstones
No ERRORS, only WARN about a very specific table that we are aware of. It
is an append only table read by spark from a batch job. (I guess it is a
read_repair chance or DTCS misconfig)

## Closing note!
We are on very only m1.xlarge 4 vcpu and raid0 (stripe) on the 4 spinning
drives, some changes to the cassandra.yml:

- dynamic_snitch: false
- concurrent_reads: 48
- concurrent_compactors: 

Re: Read timeouts when performing rolling restart

2018-09-12 Thread Alain RODRIGUEZ
Hello Ricardo

How come that a single node is impacting the whole cluster?
>

It sounds weird indeed.

Is there a way to further delay the native transposrt startup?


You can configure 'start_native_transport: false' in 'cassandra.yaml'. (
https://github.com/apache/cassandra/blob/cassandra-3.0.6/conf/cassandra.yaml#L496
)
Then 'nodetool enablebinary' (
http://cassandra.apache.org/doc/latest/tools/nodetool/enablebinary.html)
when you are ready for it.

But I would consider this as a workaround, and it might not even work, I
hope it does though :).

Any hint on troubleshooting it further?
>

The version of Cassandra is quite an early Cassandra 3+. It's probably
worth to consider moving to 3.0.17, if not to solve this issue, not to face
other issues that were fixed since then.
To know if that would really help you, you can go through
https://github.com/apache/cassandra/blob/cassandra-3.0.17/CHANGES.txt

I am not too sure about what is going on, but here are some other things I
would look at to try to understand this:

Are all the nodes agreeing on the schema?
'nodetool describecluster'

Are all the keyspaces using the 'NetworkTopologyStrategy' and a replication
factor of 2+?
'cqlsh -e "DESCRIBE KEYSPACES;" '

What snitch are you using (in cassandra.yaml)?

What does ownership look like?
'nodetool status '

What about gossip?
'nodetool gossipinfo' or 'nodetool gossipinfo | grep STATUS' maybe.

A tombstone issue?
https://support.datastax.com/hc/en-us/articles/204612559-ReadTimeoutException-seen-when-using-the-java-driver-caused-by-excessive-tombstones

Any ERROR or WARN in the logs after the restart on this node and on other
nodes (you would see the tombstone issue here)?
'grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log'

I hope one of those will help, let us know if you need help to interpret
some of the outputs,

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


Le mer. 12 sept. 2018 à 10:59, Riccardo Ferrari  a
écrit :

> Hi list,
>
> We are seeing the following behaviour when performing a rolling restart:
>
> On the node I need to restart:
> *  I run the 'nodetool drain'
> * Then 'service cassandra restart'
>
> so far so good. The load incerase on the other 5 nodes is negligible.
> The node is generally out of service just for the time of the restart (ie.
> cassandra.yml update)
>
> When the node comes back up and switch on the native transport I start see
> lots of read timeouts in our various services:
>
> com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra
> timeout during read query at consistency LOCAL_ONE (1 responses were
> required but only 0 replica responded)
>
> Indeed the restarting node have a huge peak on the system load, because of
> hints and compactions, nevertheless I don't notice a load increase on the
> other 5 nodes.
>
> Specs:
> 6 nodes cluster on Cassandra 3.0.6
> - keyspace RF=3
>
> Java driver 3.5.1:
> - DefaultRetryPolicy
> - default LoadBalancingPolicy (that should be DCAwareRoundRobinPolicy)
>
> QUESTIONS:
> How come that a single node is impacting the whole cluster?
> Is there a way to further delay the native transposrt startup?
> Any hint on troubleshooting it further?
>
> Thanks
>