Re: Aws instance stop and star with ebs

2019-11-05 Thread daemeon reiydelle
10 minutes is 600 seconds, and there are several timeouts that are set to
that, including the data center timeout as I recall.

You may be forced to tcpdump the interface(s) to see where the chatter is.
Out of curiosity, when you restart the node, have you snapped the jvm's
memory to see if e.g. heap is even in use?


On Tue, Nov 5, 2019 at 7:03 PM Rahul Reddy  wrote:

> Thanks Ben,
> Before stoping the ec2 I did run nodetool drain .so i ruled it out and
> system.log also doesn't show commitlogs being applied.
>
>
>
>
>
> On Tue, Nov 5, 2019, 7:51 PM Ben Slater 
> wrote:
>
>> The logs between first start and handshaking should give you a clue but
>> my first guess would be replaying commit logs.
>>
>> Cheers
>> Ben
>>
>> ---
>>
>>
>> *Ben Slater**Chief Product Officer*
>>
>> 
>>
>> 
>> 
>> 
>>
>> Read our latest technical blog posts here
>> .
>>
>> This email has been sent on behalf of Instaclustr Pty. Limited
>> (Australia) and Instaclustr Inc (USA).
>>
>> This email and any attachments may contain confidential and legally
>> privileged information.  If you are not the intended recipient, do not copy
>> or disclose its content, but please reply to this email immediately and
>> highlight the error to the sender and then immediately delete the message.
>>
>>
>> On Wed, 6 Nov 2019 at 04:36, Rahul Reddy 
>> wrote:
>>
>>> I can reproduce the issue.
>>>
>>> I did drain Cassandra node then stop and started Cassandra instance .
>>> Cassandra instance comes up but other nodes will be in DN state around 10
>>> minutes.
>>>
>>> I don't see error in the systemlog
>>>
>>> DN  xx.xx.xx.59   420.85 MiB  256  48.2% id  2
>>> UN  xx.xx.xx.30   432.14 MiB  256  50.0% id  0
>>> UN  xx.xx.xx.79   447.33 MiB  256  51.1% id  4
>>> DN  xx.xx.xx.144  452.59 MiB  256  51.6% id  1
>>> DN  xx.xx.xx.19   431.7 MiB  256  50.1% id  5
>>> UN  xx.xx.xx.6421.79 MiB  256  48.9%
>>>
>>> when i do nodetool status 3 nodes still showing down. and i dont see
>>> errors in system.log
>>>
>>> and after 10 mins it shows the other node is up as well.
>>>
>>>
>>> INFO  [HANDSHAKE-/10.72.100.156] 2019-11-05 15:05:09,133
>>> OutboundTcpConnection.java:561 - Handshaking version with /stopandstarted
>>> node
>>> INFO  [RequestResponseStage-7] 2019-11-05 15:16:27,166
>>> Gossiper.java:1019 - InetAddress /nodewhichitwasshowing down is now UP
>>>
>>> what is causing delay for 10mins to be able to say that node is reachable
>>>
>>> On Wed, Oct 30, 2019, 8:37 AM Rahul Reddy 
>>> wrote:
>>>
 And also aws ec2 stop and start comes with new instance with same ip
 and all our file systems are in ebs mounted fine.  Does coming new instance
 with same ip cause any gossip issues?

 On Tue, Oct 29, 2019, 6:16 PM Rahul Reddy 
 wrote:

> Thanks Alex. We have 6 nodes in each DC with RF=3  with CL local
> qourum . and we stopped and started only one instance at a time . Tough
> nodetool status says all nodes UN and system.log says canssandra started
> and started listening . Jmx explrter shows instance stayed down longer how
> do we determine what caused  the Cassandra unavialbe though log says its
> stared and listening ?
>
> On Tue, Oct 29, 2019, 4:44 PM Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
>
>> On Tue, Oct 29, 2019 at 9:34 PM Rahul Reddy 
>> wrote:
>>
>>>
>>> We have our infrastructure on aws and we use ebs storage . And aws
>>> was retiring on of the node. Since our storage was persistent we did
>>> nodetool drain and stopped and start the instance . This caused 500 
>>> errors
>>> in the service. We have local_quorum and rf=3 why does stopping one
>>> instance cause application to have issues?
>>>
>>
>> Can you still look up what was the underlying error from Cassandra
>> driver in the application logs?  Was it request timeout or not enough
>> replicas?
>>
>> For example, if you only had 3 Cassandra nodes, restarting one of
>> them reduces your cluster capacity by 33% temporarily.
>>
>> Cheers,
>> --
>> Alex
>>
>>


Re: Aws instance stop and star with ebs

2019-11-05 Thread Rahul Reddy
Thanks Ben,
Before stoping the ec2 I did run nodetool drain .so i ruled it out and
system.log also doesn't show commitlogs being applied.





On Tue, Nov 5, 2019, 7:51 PM Ben Slater  wrote:

> The logs between first start and handshaking should give you a clue but my
> first guess would be replaying commit logs.
>
> Cheers
> Ben
>
> ---
>
>
> *Ben Slater**Chief Product Officer*
>
> 
>
>    
>
>
> Read our latest technical blog posts here
> .
>
> This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
> and Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>
>
> On Wed, 6 Nov 2019 at 04:36, Rahul Reddy  wrote:
>
>> I can reproduce the issue.
>>
>> I did drain Cassandra node then stop and started Cassandra instance .
>> Cassandra instance comes up but other nodes will be in DN state around 10
>> minutes.
>>
>> I don't see error in the systemlog
>>
>> DN  xx.xx.xx.59   420.85 MiB  256  48.2% id  2
>> UN  xx.xx.xx.30   432.14 MiB  256  50.0% id  0
>> UN  xx.xx.xx.79   447.33 MiB  256  51.1% id  4
>> DN  xx.xx.xx.144  452.59 MiB  256  51.6% id  1
>> DN  xx.xx.xx.19   431.7 MiB  256  50.1% id  5
>> UN  xx.xx.xx.6421.79 MiB  256  48.9%
>>
>> when i do nodetool status 3 nodes still showing down. and i dont see
>> errors in system.log
>>
>> and after 10 mins it shows the other node is up as well.
>>
>>
>> INFO  [HANDSHAKE-/10.72.100.156] 2019-11-05 15:05:09,133
>> OutboundTcpConnection.java:561 - Handshaking version with /stopandstarted
>> node
>> INFO  [RequestResponseStage-7] 2019-11-05 15:16:27,166 Gossiper.java:1019
>> - InetAddress /nodewhichitwasshowing down is now UP
>>
>> what is causing delay for 10mins to be able to say that node is reachable
>>
>> On Wed, Oct 30, 2019, 8:37 AM Rahul Reddy 
>> wrote:
>>
>>> And also aws ec2 stop and start comes with new instance with same ip and
>>> all our file systems are in ebs mounted fine.  Does coming new instance
>>> with same ip cause any gossip issues?
>>>
>>> On Tue, Oct 29, 2019, 6:16 PM Rahul Reddy 
>>> wrote:
>>>
 Thanks Alex. We have 6 nodes in each DC with RF=3  with CL local qourum
 . and we stopped and started only one instance at a time . Tough nodetool
 status says all nodes UN and system.log says canssandra started and started
 listening . Jmx explrter shows instance stayed down longer how do we
 determine what caused  the Cassandra unavialbe though log says its stared
 and listening ?

 On Tue, Oct 29, 2019, 4:44 PM Oleksandr Shulgin <
 oleksandr.shul...@zalando.de> wrote:

> On Tue, Oct 29, 2019 at 9:34 PM Rahul Reddy 
> wrote:
>
>>
>> We have our infrastructure on aws and we use ebs storage . And aws
>> was retiring on of the node. Since our storage was persistent we did
>> nodetool drain and stopped and start the instance . This caused 500 
>> errors
>> in the service. We have local_quorum and rf=3 why does stopping one
>> instance cause application to have issues?
>>
>
> Can you still look up what was the underlying error from Cassandra
> driver in the application logs?  Was it request timeout or not enough
> replicas?
>
> For example, if you only had 3 Cassandra nodes, restarting one of them
> reduces your cluster capacity by 33% temporarily.
>
> Cheers,
> --
> Alex
>
>


Re: Aws instance stop and star with ebs

2019-11-05 Thread Ben Slater
The logs between first start and handshaking should give you a clue but my
first guess would be replaying commit logs.

Cheers
Ben

---


*Ben Slater**Chief Product Officer*



   


Read our latest technical blog posts here
.

This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
and Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.


On Wed, 6 Nov 2019 at 04:36, Rahul Reddy  wrote:

> I can reproduce the issue.
>
> I did drain Cassandra node then stop and started Cassandra instance .
> Cassandra instance comes up but other nodes will be in DN state around 10
> minutes.
>
> I don't see error in the systemlog
>
> DN  xx.xx.xx.59   420.85 MiB  256  48.2% id  2
> UN  xx.xx.xx.30   432.14 MiB  256  50.0% id  0
> UN  xx.xx.xx.79   447.33 MiB  256  51.1% id  4
> DN  xx.xx.xx.144  452.59 MiB  256  51.6% id  1
> DN  xx.xx.xx.19   431.7 MiB  256  50.1% id  5
> UN  xx.xx.xx.6421.79 MiB  256  48.9%
>
> when i do nodetool status 3 nodes still showing down. and i dont see
> errors in system.log
>
> and after 10 mins it shows the other node is up as well.
>
>
> INFO  [HANDSHAKE-/10.72.100.156] 2019-11-05 15:05:09,133
> OutboundTcpConnection.java:561 - Handshaking version with /stopandstarted
> node
> INFO  [RequestResponseStage-7] 2019-11-05 15:16:27,166 Gossiper.java:1019
> - InetAddress /nodewhichitwasshowing down is now UP
>
> what is causing delay for 10mins to be able to say that node is reachable
>
> On Wed, Oct 30, 2019, 8:37 AM Rahul Reddy 
> wrote:
>
>> And also aws ec2 stop and start comes with new instance with same ip and
>> all our file systems are in ebs mounted fine.  Does coming new instance
>> with same ip cause any gossip issues?
>>
>> On Tue, Oct 29, 2019, 6:16 PM Rahul Reddy 
>> wrote:
>>
>>> Thanks Alex. We have 6 nodes in each DC with RF=3  with CL local qourum
>>> . and we stopped and started only one instance at a time . Tough nodetool
>>> status says all nodes UN and system.log says canssandra started and started
>>> listening . Jmx explrter shows instance stayed down longer how do we
>>> determine what caused  the Cassandra unavialbe though log says its stared
>>> and listening ?
>>>
>>> On Tue, Oct 29, 2019, 4:44 PM Oleksandr Shulgin <
>>> oleksandr.shul...@zalando.de> wrote:
>>>
 On Tue, Oct 29, 2019 at 9:34 PM Rahul Reddy 
 wrote:

>
> We have our infrastructure on aws and we use ebs storage . And aws was
> retiring on of the node. Since our storage was persistent we did nodetool
> drain and stopped and start the instance . This caused 500 errors in the
> service. We have local_quorum and rf=3 why does stopping one instance 
> cause
> application to have issues?
>

 Can you still look up what was the underlying error from Cassandra
 driver in the application logs?  Was it request timeout or not enough
 replicas?

 For example, if you only had 3 Cassandra nodes, restarting one of them
 reduces your cluster capacity by 33% temporarily.

 Cheers,
 --
 Alex




Re: Aws instance stop and star with ebs

2019-11-05 Thread Rahul Reddy
I can reproduce the issue.

I did drain Cassandra node then stop and started Cassandra instance .
Cassandra instance comes up but other nodes will be in DN state around 10
minutes.

I don't see error in the systemlog

DN  xx.xx.xx.59   420.85 MiB  256  48.2% id  2
UN  xx.xx.xx.30   432.14 MiB  256  50.0% id  0
UN  xx.xx.xx.79   447.33 MiB  256  51.1% id  4
DN  xx.xx.xx.144  452.59 MiB  256  51.6% id  1
DN  xx.xx.xx.19   431.7 MiB  256  50.1% id  5
UN  xx.xx.xx.6421.79 MiB  256  48.9%

when i do nodetool status 3 nodes still showing down. and i dont see errors
in system.log

and after 10 mins it shows the other node is up as well.


INFO  [HANDSHAKE-/10.72.100.156] 2019-11-05 15:05:09,133
OutboundTcpConnection.java:561 - Handshaking version with /stopandstarted
node
INFO  [RequestResponseStage-7] 2019-11-05 15:16:27,166 Gossiper.java:1019 -
InetAddress /nodewhichitwasshowing down is now UP

what is causing delay for 10mins to be able to say that node is reachable

On Wed, Oct 30, 2019, 8:37 AM Rahul Reddy  wrote:

> And also aws ec2 stop and start comes with new instance with same ip and
> all our file systems are in ebs mounted fine.  Does coming new instance
> with same ip cause any gossip issues?
>
> On Tue, Oct 29, 2019, 6:16 PM Rahul Reddy 
> wrote:
>
>> Thanks Alex. We have 6 nodes in each DC with RF=3  with CL local qourum .
>> and we stopped and started only one instance at a time . Tough nodetool
>> status says all nodes UN and system.log says canssandra started and started
>> listening . Jmx explrter shows instance stayed down longer how do we
>> determine what caused  the Cassandra unavialbe though log says its stared
>> and listening ?
>>
>> On Tue, Oct 29, 2019, 4:44 PM Oleksandr Shulgin <
>> oleksandr.shul...@zalando.de> wrote:
>>
>>> On Tue, Oct 29, 2019 at 9:34 PM Rahul Reddy 
>>> wrote:
>>>

 We have our infrastructure on aws and we use ebs storage . And aws was
 retiring on of the node. Since our storage was persistent we did nodetool
 drain and stopped and start the instance . This caused 500 errors in the
 service. We have local_quorum and rf=3 why does stopping one instance cause
 application to have issues?

>>>
>>> Can you still look up what was the underlying error from Cassandra
>>> driver in the application logs?  Was it request timeout or not enough
>>> replicas?
>>>
>>> For example, if you only had 3 Cassandra nodes, restarting one of them
>>> reduces your cluster capacity by 33% temporarily.
>>>
>>> Cheers,
>>> --
>>> Alex
>>>
>>>