Re: Nodes fail to reconnect after several hours of network failure.

Mark Curtis Thu, 21 Jan 2016 06:52:30 -0800

Its worth checking your connectivity on each node to see if the connections
are established:


For example:

# netstat -ant | awk 'NR==2;/7001/'
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 172.31.10.93:7001       0.0.0.0:*               LISTEN
tcp        0      0 172.31.10.93:56771      172.31.10.93:7001
ESTABLISHED
tcp        0      0 172.31.10.93:7001       54.183.204.110:42231
 ESTABLISHED
tcp        0      0 172.31.10.93:52031      54.183.204.110:7001
ESTABLISHED
tcp        0      0 172.31.10.93:50759      54.183.204.110:7001
ESTABLISHED
tcp        0      0 172.31.10.93:38986      172.31.10.93:7001
ESTABLISHED
tcp        0      0 172.31.10.93:7001       172.31.10.93:42408
 ESTABLISHED
tcp        0      0 172.31.10.93:7001       172.31.10.93:38986
 ESTABLISHED
tcp        0      0 172.31.10.93:42408      172.31.10.93:7001
ESTABLISHED
tcp        0      0 172.31.10.93:7001       172.31.10.93:56771
 ESTABLISHED
tcp        0      0 172.31.10.93:7001       54.183.204.110:37491
 ESTABLISHED

Note i'm using 7001 here because my cluster uses SSL but you can use 7000
for the standard gossip port


Thanks


Mark

On 21 January 2016 at 14:08, Bernardino Mota <
bernardino.m...@knowledgeworks.pt> wrote:

> In the logs nothing strange but “nodetool gossipinfo” seems OK
>
>  ./nodetool gossipinfo
> /192.168.1.10
>   generation:1453316804
>   heartbeat:206518
>   STATUS:18:NORMAL,-1003341236369672970
>   LOAD:206420:4.3533596E7
>   SCHEMA:14:6f97097b-45ce-3479-8b2f-af2fef4967e7
>   DC:8:DC2
>   RACK:10:rack1
>   RELEASE_VERSION:4:2.2.4
>   INTERNAL_IP:6:192.168.1.10
>   RPC_ADDRESS:3:127.0.0.1
>   SEVERITY:206517:0.0
>   NET_VERSION:1:9
>   HOST_ID:2:51650afd-84dd-4e25-a6f0-13627858d5dc
>   RPC_READY:49:true
>   TOKENS:17:<hidden>
> /192.168.1.102
>   generation:1453316986
>   heartbeat:84622
>   STATUS:28:NORMAL,-1085177681742913545
>   LOAD:84535:1.2606418E7
>   SCHEMA:14:6f97097b-45ce-3479-8b2f-af2fef4967e7
>   DC:8:DC1
>   RACK:10:rack1
>   RELEASE_VERSION:4:2.2.4
>   INTERNAL_IP:6:10.0.2.10
>   RPC_ADDRESS:3:127.0.0.1
>   SEVERITY:84624:0.0
>   NET_VERSION:1:9
>   HOST_ID:2:ff906882-8224-40ac-8cdb-98f5e725814d
>   RPC_READY:98:true
>   TOKENS:27:<hidden>
>
>
>
>
> On 21 Jan 2016, at 13:17, Adil <adil.cha...@gmail.com> wrote:
>
> Hi,
> do you see any message related to gossip info?
>
>
> 2016-01-21 14:09 GMT+01:00 Bernardino Mota <
> bernardino.m...@knowledgeworks.pt>:
>
>> Using Cassandra 2.2.4 on Ubuntu.
>>
>> We have a cluster with two nodes that during several hours failed to
>> connect with each other due to network problems. The database continued to
>> be used in one of the nodes with writes being stored in the Hints file as
>> supposed.
>>
>> But now that the network is OK again and each machine can communicate we
>> see that each node indicates the other is DOWN and does not replicates.
>>
>> When the network came up we started to see in log files "Convicting /
>> 192.168.1.102 with status NORMAL - alive false"
>>
>> It seems each node evictions each other and later failing to reconnect.
>>
>> Is there some configuration that we might be missing ? Any help would be
>> much appreciated.
>>
>>
>>
>> - NODE 192.168.1.10 - "nodetool status”
>>
>> Datacenter: DC1
>> ===============
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address        Load       Tokens       Owns    Host ID
>>                Rack
>> DN  192.168.1.102  12.02 MB   256          ?
>>  ff906882-8224-40ac-8cdb-98f5e725814d  rack1
>> Datacenter: DC2
>> ===============
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address        Load       Tokens       Owns    Host ID
>>                Rack
>> UN  192.168.1.10   41.87 MB   256          ?
>>  51650afd-84dd-4e25-a6f0-13627858d5dc  rack1
>>
>>
>>
>> - NODE 192.168.1.102  - “nodetool status"
>>
>> Datacenter: DC1
>> ===============
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address        Load       Tokens       Owns    Host ID
>>                Rack
>> UN  192.168.1.102  12.4 MB    256          ?
>>  ff906882-8224-40ac-8cdb-98f5e725814d  rack1
>> Datacenter: DC2
>> ===============
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address        Load       Tokens       Owns    Host ID
>>                Rack
>> DN  192.168.1.10   26.31 MB   256          ?
>>  51650afd-84dd-4e25-a6f0-13627858d5dc  rack1
>>
>>
>>
>
>

Re: Nodes fail to reconnect after several hours of network failure.

Reply via email to