Re: Dead node still being pinged

2012-08-06 Thread Alain RODRIGUEZ
ip info to 
>>>> see if the other nodes tell it about the old ones again.
>>>
>>> You meant -Dcassandra.load_ring_state=false right ?
>>>
>>> Then nothing changed.
>>>
>>>> Sorry, gossip can be tricky to diagnose over email.
>>>
>>> No worry, I really appreciate that you take time looking into my issues.
>>>
>>> Maybe I could open a jira about my issue ? Maybe there was a config mess on 
>>> my part at some point, ie the unsynchronized date on my machines, but I 
>>> think it would be nice if cassandra could resolve itself of that 
>>> inconsistent state.
>>>
>>> Nicolas
>>>
>>>>
>>>>
>>>>
>>>>
>>>> -
>>>> Aaron Morton
>>>> Freelance Developer
>>>> @aaronmorton
>>>> http://www.thelastpickle.com
>>>>
>>>> On 12/06/2012, at 10:33 PM, Nicolas Lalevée wrote:
>>>>
>>>>> I have one dirty solution to try: bring data-2 and data-4 back up and 
>>>>> down again. Is there any way I can tell cassandra to not get any data, so 
>>>>> when I would get my old node up, no streaming would start ?
>>>>>
>>>>> cheers,
>>>>> Nicolas
>>>>>
>>>>> Le 12 juin 2012 à 12:25, Nicolas Lalevée a écrit :
>>>>>
>>>>>> Le 12 juin 2012 à 11:03, aaron morton a écrit :
>>>>>>
>>>>>>> Try purging the hints for 10.10.0.24 using the HintedHandOffManager 
>>>>>>> MBean.
>>>>>>
>>>>>> As far as I could tell, there were no hinted hand off to be delivered. 
>>>>>> Nevertheless I have called "deleteHintsForEndpoint" on every node for 
>>>>>> the two expected to be out nodes.
>>>>>> Nothing changed, I still see packet being send to these old nodes.
>>>>>>
>>>>>> I looked closer to ResponsePendingTasks of MessagingService. Actually 
>>>>>> the numbers change, between 0 and about 4. So tasks are ending but new 
>>>>>> ones come just after.
>>>>>>
>>>>>> Nicolas
>>>>>>
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> -
>>>>>>> Aaron Morton
>>>>>>> Freelance Developer
>>>>>>> @aaronmorton
>>>>>>> http://www.thelastpickle.com
>>>>>>>
>>>>>>> On 12/06/2012, at 3:33 AM, Nicolas Lalevée wrote:
>>>>>>>
>>>>>>>> finally, thanks to the groovy jmx builder, it was not that hard.
>>>>>>>>
>>>>>>>>
>>>>>>>> Le 11 juin 2012 à 12:12, Samuel CARRIERE a écrit :
>>>>>>>>
>>>>>>>>> If I were you, I would connect (through JMX, with jconsole) to one of 
>>>>>>>>> the nodes that is sending messages to an old node, and would have a 
>>>>>>>>> look at these MBean :
>>>>>>>>> - org.apache.net.FailureDetector : does SimpleStates looks good ? (or 
>>>>>>>>> do you see an IP of an old node)
>>>>>>>>
>>>>>>>> SimpleStates:[/10.10.0.22:DOWN, /10.10.0.24:DOWN, /10.10.0.26:UP, 
>>>>>>>> /10.10.0.25:UP, /10.10.0.27:UP]
>>>>>>>>
>>>>>>>>> - org.apache.net.MessagingService : do you see one of the old IP in 
>>>>>>>>> one of the attributes ?
>>>>>>>>
>>>>>>>> data-5:
>>>>>>>> CommandCompletedTasks:
>>>>>>>> [10.10.0.22:2, 10.10.0.26:6147307, 10.10.0.27:6084684, 10.10.0.24:2]
>>>>>>>> CommandPendingTasks:
>>>>>>>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
>>>>>>>> ResponseCompletedTasks:
>>>>>>>> [10.10.0.22:1487, 10.10.0.26:6187204, 10.10.0.27:6062890, 
>>>>>>>> 10.10.0.24:1495]
>>>>>>>> ResponsePendingTasks:
>>>>>>>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
>>>>>>>>
>>>>>>>> data-6:
>>>>>>>> CommandCompletedTasks:
>>>>>>>> [10.

Re: Dead node still being pinged

2012-06-14 Thread Nicolas Lalevée
;>> On 12/06/2012, at 10:33 PM, Nicolas Lalevée wrote:
>>> 
>>>> I have one dirty solution to try: bring data-2 and data-4 back up and down 
>>>> again. Is there any way I can tell cassandra to not get any data, so when 
>>>> I would get my old node up, no streaming would start ?
>>>> 
>>>> cheers,
>>>> Nicolas
>>>> 
>>>> Le 12 juin 2012 à 12:25, Nicolas Lalevée a écrit :
>>>> 
>>>>> Le 12 juin 2012 à 11:03, aaron morton a écrit :
>>>>> 
>>>>>> Try purging the hints for 10.10.0.24 using the HintedHandOffManager 
>>>>>> MBean.
>>>>> 
>>>>> As far as I could tell, there were no hinted hand off to be delivered. 
>>>>> Nevertheless I have called "deleteHintsForEndpoint" on every node for the 
>>>>> two expected to be out nodes.
>>>>> Nothing changed, I still see packet being send to these old nodes.
>>>>> 
>>>>> I looked closer to ResponsePendingTasks of MessagingService. Actually the 
>>>>> numbers change, between 0 and about 4. So tasks are ending but new ones 
>>>>> come just after.
>>>>> 
>>>>> Nicolas
>>>>> 
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> -
>>>>>> Aaron Morton
>>>>>> Freelance Developer
>>>>>> @aaronmorton
>>>>>> http://www.thelastpickle.com
>>>>>> 
>>>>>> On 12/06/2012, at 3:33 AM, Nicolas Lalevée wrote:
>>>>>> 
>>>>>>> finally, thanks to the groovy jmx builder, it was not that hard.
>>>>>>> 
>>>>>>> 
>>>>>>> Le 11 juin 2012 à 12:12, Samuel CARRIERE a écrit :
>>>>>>> 
>>>>>>>> If I were you, I would connect (through JMX, with jconsole) to one of 
>>>>>>>> the nodes that is sending messages to an old node, and would have a 
>>>>>>>> look at these MBean : 
>>>>>>>> - org.apache.net.FailureDetector : does SimpleStates looks good ? (or 
>>>>>>>> do you see an IP of an old node)
>>>>>>> 
>>>>>>> SimpleStates:[/10.10.0.22:DOWN, /10.10.0.24:DOWN, /10.10.0.26:UP, 
>>>>>>> /10.10.0.25:UP, /10.10.0.27:UP]
>>>>>>> 
>>>>>>>> - org.apache.net.MessagingService : do you see one of the old IP in 
>>>>>>>> one of the attributes ?
>>>>>>> 
>>>>>>> data-5:
>>>>>>> CommandCompletedTasks:
>>>>>>> [10.10.0.22:2, 10.10.0.26:6147307, 10.10.0.27:6084684, 10.10.0.24:2]
>>>>>>> CommandPendingTasks:
>>>>>>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
>>>>>>> ResponseCompletedTasks:
>>>>>>> [10.10.0.22:1487, 10.10.0.26:6187204, 10.10.0.27:6062890, 
>>>>>>> 10.10.0.24:1495]
>>>>>>> ResponsePendingTasks:
>>>>>>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
>>>>>>> 
>>>>>>> data-6:
>>>>>>> CommandCompletedTasks:
>>>>>>> [10.10.0.22:2, 10.10.0.27:6064992, 10.10.0.24:2, 10.10.0.25:6308102]
>>>>>>> CommandPendingTasks:
>>>>>>> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:0, 10.10.0.25:0]
>>>>>>> ResponseCompletedTasks:
>>>>>>> [10.10.0.22:1463, 10.10.0.27:6067943, 10.10.0.24:1474, 
>>>>>>> 10.10.0.25:6367692]
>>>>>>> ResponsePendingTasks:
>>>>>>> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:2, 10.10.0.25:0]
>>>>>>> 
>>>>>>> data-7:
>>>>>>> CommandCompletedTasks:
>>>>>>> [10.10.0.22:2, 10.10.0.26:6043653, 10.10.0.24:2, 10.10.0.25:5964168]
>>>>>>> CommandPendingTasks:
>>>>>>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.24:0, 10.10.0.25:0]
>>>>>>> ResponseCompletedTasks:
>>>>>>> [10.10.0.22:1424, 10.10.0.26:6090251, 10.10.0.24:1431, 
>>>>>>> 10.10.0.25:6094954]
>>>>>>> ResponsePendingTasks:
>>>>>>> [10.10.0.22:4, 10.10.0.26:0, 10.10.0.24:1, 10.10.0.25:0]
>>>>>>> 
>>>>>>>>

Re: Dead node still being pinged

2012-06-13 Thread Nicolas Lalevée
looks good ? (or do 
>>>>>> you see an IP of an old node)
>>>>> 
>>>>> SimpleStates:[/10.10.0.22:DOWN, /10.10.0.24:DOWN, /10.10.0.26:UP, 
>>>>> /10.10.0.25:UP, /10.10.0.27:UP]
>>>>> 
>>>>>> - org.apache.net.MessagingService : do you see one of the old IP in one 
>>>>>> of the attributes ?
>>>>> 
>>>>> data-5:
>>>>> CommandCompletedTasks:
>>>>> [10.10.0.22:2, 10.10.0.26:6147307, 10.10.0.27:6084684, 10.10.0.24:2]
>>>>> CommandPendingTasks:
>>>>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
>>>>> ResponseCompletedTasks:
>>>>> [10.10.0.22:1487, 10.10.0.26:6187204, 10.10.0.27:6062890, 10.10.0.24:1495]
>>>>> ResponsePendingTasks:
>>>>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
>>>>> 
>>>>> data-6:
>>>>> CommandCompletedTasks:
>>>>> [10.10.0.22:2, 10.10.0.27:6064992, 10.10.0.24:2, 10.10.0.25:6308102]
>>>>> CommandPendingTasks:
>>>>> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:0, 10.10.0.25:0]
>>>>> ResponseCompletedTasks:
>>>>> [10.10.0.22:1463, 10.10.0.27:6067943, 10.10.0.24:1474, 10.10.0.25:6367692]
>>>>> ResponsePendingTasks:
>>>>> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:2, 10.10.0.25:0]
>>>>> 
>>>>> data-7:
>>>>> CommandCompletedTasks:
>>>>> [10.10.0.22:2, 10.10.0.26:6043653, 10.10.0.24:2, 10.10.0.25:5964168]
>>>>> CommandPendingTasks:
>>>>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.24:0, 10.10.0.25:0]
>>>>> ResponseCompletedTasks:
>>>>> [10.10.0.22:1424, 10.10.0.26:6090251, 10.10.0.24:1431, 10.10.0.25:6094954]
>>>>> ResponsePendingTasks:
>>>>> [10.10.0.22:4, 10.10.0.26:0, 10.10.0.24:1, 10.10.0.25:0]
>>>>> 
>>>>>> - org.apache.net.StreamingService : do you see an old IP in 
>>>>>> StreamSources or StreamDestinations ?
>>>>> 
>>>>> nothing streaming on the 3 nodes.
>>>>> nodetool netstats confirmed that.
>>>>> 
>>>>>> - org.apache.internal.HintedHandoff : are there non-zero ActiveCount, 
>>>>>> CurrentlyBlockedTasks, PendingTasks, TotalBlockedTask ?
>>>>> 
>>>>> On the 3 nodes, all at 0.
>>>>> 
>>>>> I don't know much what I'm looking at, but it seems that some 
>>>>> ResponsePendingTasks needs to end.
>>>>> 
>>>>> Nicolas
>>>>> 
>>>>>> 
>>>>>> Samuel 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Nicolas Lalevée 
>>>>>> 08/06/2012 21:03
>>>>>> Veuillez répondre à
>>>>>> user@cassandra.apache.org
>>>>>> 
>>>>>> A
>>>>>> user@cassandra.apache.org
>>>>>> cc
>>>>>> Objet
>>>>>> Re: Dead node still being pinged
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Le 8 juin 2012 à 20:02, Samuel CARRIERE a écrit :
>>>>>> 
>>>>>>> I'm in the train but just a guess : maybe it's hinted handoff. A look 
>>>>>>> in the logs of the new nodes could confirm that : look for the IP of an 
>>>>>>> old node and maybe you'll find hinted handoff related messages.
>>>>>> 
>>>>>> I grepped on every node about every old node, I got nothing since the 
>>>>>> "crash".
>>>>>> 
>>>>>> If it can be of some help, here is some grepped log of the crash:
>>>>>> 
>>>>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>>>>> 00:39:30,241 StorageService.java (line 2417) Endpoint /10.10.0.24 is 
>>>>>> down and will not receive data for re-replication of /10.10.0.22
>>>>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>>>>> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is 
>>>>>> down and will not receive data for re-replication of /10.10.0.22
>>>>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>

Re: Dead node still being pinged

2012-06-13 Thread aaron morton
mpletedTasks:
>>>> [10.10.0.22:1463, 10.10.0.27:6067943, 10.10.0.24:1474, 10.10.0.25:6367692]
>>>> ResponsePendingTasks:
>>>> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:2, 10.10.0.25:0]
>>>> 
>>>> data-7:
>>>> CommandCompletedTasks:
>>>> [10.10.0.22:2, 10.10.0.26:6043653, 10.10.0.24:2, 10.10.0.25:5964168]
>>>> CommandPendingTasks:
>>>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.24:0, 10.10.0.25:0]
>>>> ResponseCompletedTasks:
>>>> [10.10.0.22:1424, 10.10.0.26:6090251, 10.10.0.24:1431, 10.10.0.25:6094954]
>>>> ResponsePendingTasks:
>>>> [10.10.0.22:4, 10.10.0.26:0, 10.10.0.24:1, 10.10.0.25:0]
>>>> 
>>>>> - org.apache.net.StreamingService : do you see an old IP in StreamSources 
>>>>> or StreamDestinations ?
>>>> 
>>>> nothing streaming on the 3 nodes.
>>>> nodetool netstats confirmed that.
>>>> 
>>>>> - org.apache.internal.HintedHandoff : are there non-zero ActiveCount, 
>>>>> CurrentlyBlockedTasks, PendingTasks, TotalBlockedTask ?
>>>> 
>>>> On the 3 nodes, all at 0.
>>>> 
>>>> I don't know much what I'm looking at, but it seems that some 
>>>> ResponsePendingTasks needs to end.
>>>> 
>>>> Nicolas
>>>> 
>>>>> 
>>>>> Samuel 
>>>>> 
>>>>> 
>>>>> 
>>>>> Nicolas Lalevée 
>>>>> 08/06/2012 21:03
>>>>> Veuillez répondre à
>>>>> user@cassandra.apache.org
>>>>> 
>>>>> A
>>>>> user@cassandra.apache.org
>>>>> cc
>>>>> Objet
>>>>> Re: Dead node still being pinged
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Le 8 juin 2012 à 20:02, Samuel CARRIERE a écrit :
>>>>> 
>>>>>> I'm in the train but just a guess : maybe it's hinted handoff. A look in 
>>>>>> the logs of the new nodes could confirm that : look for the IP of an old 
>>>>>> node and maybe you'll find hinted handoff related messages.
>>>>> 
>>>>> I grepped on every node about every old node, I got nothing since the 
>>>>> "crash".
>>>>> 
>>>>> If it can be of some help, here is some grepped log of the crash:
>>>>> 
>>>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>>>> 00:39:30,241 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>>>>> and will not receive data for re-replication of /10.10.0.22
>>>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>>>> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>>>>> and will not receive data for re-replication of /10.10.0.22
>>>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>>>> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>>>>> and will not receive data for re-replication of /10.10.0.22
>>>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>>>> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>>>>> and will not receive data for re-replication of /10.10.0.22
>>>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>>>> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>>>>> and will not receive data for re-replication of /10.10.0.22
>>>>> system.log.1: INFO [GossipStage:1] 2012-05-06 00:44:33,822 Gossiper.java 
>>>>> (line 818) InetAddress /10.10.0.24 is now dead.
>>>>> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,894 Gossiper.java 
>>>>> (line 818) InetAddress /10.10.0.24 is now dead.
>>>>> system.log.1: INFO [OptionalTasks:1] 2012-05-06 04:25:23,895 
>>>>> HintedHandOffManager.java (line 179) Deleting any stored hints for 
>>>>> /10.10.0.24
>>>>> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,895 
>>>>> StorageService.java (line 1157) Removing token 
>>>>> 127605887595351923798765477786913079296 for /10.10.0.24
>>>>> system.log.1: INFO [GossipStage:1] 2012-05-09 04:26:25,015 Gossiper.java 
>>>>> (line 818)

Re: Dead node still being pinged

2012-06-12 Thread Nicolas Lalevée
I have one dirty solution to try: bring data-2 and data-4 back up and down 
again. Is there any way I can tell cassandra to not get any data, so when I 
would get my old node up, no streaming would start ?

cheers,
Nicolas

Le 12 juin 2012 à 12:25, Nicolas Lalevée a écrit :

> Le 12 juin 2012 à 11:03, aaron morton a écrit :
> 
>> Try purging the hints for 10.10.0.24 using the HintedHandOffManager MBean.
> 
> As far as I could tell, there were no hinted hand off to be delivered. 
> Nevertheless I have called "deleteHintsForEndpoint" on every node for the two 
> expected to be out nodes.
> Nothing changed, I still see packet being send to these old nodes.
> 
> I looked closer to ResponsePendingTasks of MessagingService. Actually the 
> numbers change, between 0 and about 4. So tasks are ending but new ones come 
> just after.
> 
> Nicolas
> 
>> 
>> Cheers
>> 
>> -
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 12/06/2012, at 3:33 AM, Nicolas Lalevée wrote:
>> 
>>> finally, thanks to the groovy jmx builder, it was not that hard.
>>> 
>>> 
>>> Le 11 juin 2012 à 12:12, Samuel CARRIERE a écrit :
>>> 
>>>> If I were you, I would connect (through JMX, with jconsole) to one of the 
>>>> nodes that is sending messages to an old node, and would have a look at 
>>>> these MBean : 
>>>>  - org.apache.net.FailureDetector : does SimpleStates looks good ? (or do 
>>>> you see an IP of an old node)
>>> 
>>> SimpleStates:[/10.10.0.22:DOWN, /10.10.0.24:DOWN, /10.10.0.26:UP, 
>>> /10.10.0.25:UP, /10.10.0.27:UP]
>>> 
>>>>  - org.apache.net.MessagingService : do you see one of the old IP in one 
>>>> of the attributes ?
>>> 
>>> data-5:
>>> CommandCompletedTasks:
>>> [10.10.0.22:2, 10.10.0.26:6147307, 10.10.0.27:6084684, 10.10.0.24:2]
>>> CommandPendingTasks:
>>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
>>> ResponseCompletedTasks:
>>> [10.10.0.22:1487, 10.10.0.26:6187204, 10.10.0.27:6062890, 10.10.0.24:1495]
>>> ResponsePendingTasks:
>>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
>>> 
>>> data-6:
>>> CommandCompletedTasks:
>>> [10.10.0.22:2, 10.10.0.27:6064992, 10.10.0.24:2, 10.10.0.25:6308102]
>>> CommandPendingTasks:
>>> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:0, 10.10.0.25:0]
>>> ResponseCompletedTasks:
>>> [10.10.0.22:1463, 10.10.0.27:6067943, 10.10.0.24:1474, 10.10.0.25:6367692]
>>> ResponsePendingTasks:
>>> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:2, 10.10.0.25:0]
>>> 
>>> data-7:
>>> CommandCompletedTasks:
>>> [10.10.0.22:2, 10.10.0.26:6043653, 10.10.0.24:2, 10.10.0.25:5964168]
>>> CommandPendingTasks:
>>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.24:0, 10.10.0.25:0]
>>> ResponseCompletedTasks:
>>> [10.10.0.22:1424, 10.10.0.26:6090251, 10.10.0.24:1431, 10.10.0.25:6094954]
>>> ResponsePendingTasks:
>>> [10.10.0.22:4, 10.10.0.26:0, 10.10.0.24:1, 10.10.0.25:0]
>>> 
>>>>  - org.apache.net.StreamingService : do you see an old IP in StreamSources 
>>>> or StreamDestinations ?
>>> 
>>> nothing streaming on the 3 nodes.
>>> nodetool netstats confirmed that.
>>> 
>>>>  - org.apache.internal.HintedHandoff : are there non-zero ActiveCount, 
>>>> CurrentlyBlockedTasks, PendingTasks, TotalBlockedTask ?
>>> 
>>> On the 3 nodes, all at 0.
>>> 
>>> I don't know much what I'm looking at, but it seems that some 
>>> ResponsePendingTasks needs to end.
>>> 
>>> Nicolas
>>> 
>>>> 
>>>> Samuel 
>>>> 
>>>> 
>>>> 
>>>> Nicolas Lalevée 
>>>> 08/06/2012 21:03
>>>> Veuillez répondre à
>>>> user@cassandra.apache.org
>>>> 
>>>> A
>>>> user@cassandra.apache.org
>>>> cc
>>>> Objet
>>>> Re: Dead node still being pinged
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Le 8 juin 2012 à 20:02, Samuel CARRIERE a écrit :
>>>> 
>>>>> I'm in the train but just a guess : maybe it's hinted handoff. A look in 
>>>>> the logs of the new nodes could confirm that : look for the IP of an old 
>>>>> node and maybe you'll find hinte

Re: Dead node still being pinged

2012-06-12 Thread Nicolas Lalevée
Le 12 juin 2012 à 11:03, aaron morton a écrit :

> Try purging the hints for 10.10.0.24 using the HintedHandOffManager MBean.

As far as I could tell, there were no hinted hand off to be delivered. 
Nevertheless I have called "deleteHintsForEndpoint" on every node for the two 
expected to be out nodes.
Nothing changed, I still see packet being send to these old nodes.

I looked closer to ResponsePendingTasks of MessagingService. Actually the 
numbers change, between 0 and about 4. So tasks are ending but new ones come 
just after.

Nicolas

> 
> Cheers
> 
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 12/06/2012, at 3:33 AM, Nicolas Lalevée wrote:
> 
>> finally, thanks to the groovy jmx builder, it was not that hard.
>> 
>> 
>> Le 11 juin 2012 à 12:12, Samuel CARRIERE a écrit :
>> 
>>> If I were you, I would connect (through JMX, with jconsole) to one of the 
>>> nodes that is sending messages to an old node, and would have a look at 
>>> these MBean : 
>>>   - org.apache.net.FailureDetector : does SimpleStates looks good ? (or do 
>>> you see an IP of an old node)
>> 
>> SimpleStates:[/10.10.0.22:DOWN, /10.10.0.24:DOWN, /10.10.0.26:UP, 
>> /10.10.0.25:UP, /10.10.0.27:UP]
>> 
>>>   - org.apache.net.MessagingService : do you see one of the old IP in one 
>>> of the attributes ?
>> 
>> data-5:
>> CommandCompletedTasks:
>> [10.10.0.22:2, 10.10.0.26:6147307, 10.10.0.27:6084684, 10.10.0.24:2]
>> CommandPendingTasks:
>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
>> ResponseCompletedTasks:
>> [10.10.0.22:1487, 10.10.0.26:6187204, 10.10.0.27:6062890, 10.10.0.24:1495]
>> ResponsePendingTasks:
>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
>> 
>> data-6:
>> CommandCompletedTasks:
>> [10.10.0.22:2, 10.10.0.27:6064992, 10.10.0.24:2, 10.10.0.25:6308102]
>> CommandPendingTasks:
>> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:0, 10.10.0.25:0]
>> ResponseCompletedTasks:
>> [10.10.0.22:1463, 10.10.0.27:6067943, 10.10.0.24:1474, 10.10.0.25:6367692]
>> ResponsePendingTasks:
>> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:2, 10.10.0.25:0]
>> 
>> data-7:
>> CommandCompletedTasks:
>> [10.10.0.22:2, 10.10.0.26:6043653, 10.10.0.24:2, 10.10.0.25:5964168]
>> CommandPendingTasks:
>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.24:0, 10.10.0.25:0]
>> ResponseCompletedTasks:
>> [10.10.0.22:1424, 10.10.0.26:6090251, 10.10.0.24:1431, 10.10.0.25:6094954]
>> ResponsePendingTasks:
>> [10.10.0.22:4, 10.10.0.26:0, 10.10.0.24:1, 10.10.0.25:0]
>> 
>>>   - org.apache.net.StreamingService : do you see an old IP in StreamSources 
>>> or StreamDestinations ?
>> 
>> nothing streaming on the 3 nodes.
>> nodetool netstats confirmed that.
>> 
>>>   - org.apache.internal.HintedHandoff : are there non-zero ActiveCount, 
>>> CurrentlyBlockedTasks, PendingTasks, TotalBlockedTask ?
>> 
>> On the 3 nodes, all at 0.
>> 
>> I don't know much what I'm looking at, but it seems that some 
>> ResponsePendingTasks needs to end.
>> 
>> Nicolas
>> 
>>> 
>>> Samuel 
>>> 
>>> 
>>> 
>>> Nicolas Lalevée 
>>> 08/06/2012 21:03
>>> Veuillez répondre à
>>> user@cassandra.apache.org
>>> 
>>> A
>>> user@cassandra.apache.org
>>> cc
>>> Objet
>>> Re: Dead node still being pinged
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Le 8 juin 2012 à 20:02, Samuel CARRIERE a écrit :
>>> 
>>>> I'm in the train but just a guess : maybe it's hinted handoff. A look in 
>>>> the logs of the new nodes could confirm that : look for the IP of an old 
>>>> node and maybe you'll find hinted handoff related messages.
>>> 
>>> I grepped on every node about every old node, I got nothing since the 
>>> "crash".
>>> 
>>> If it can be of some help, here is some grepped log of the crash:
>>> 
>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>> 00:39:30,241 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>>> and will not receive data for re-replication of /10.10.0.22
>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>>> and will not receive data for re-replication of /10.10.0.

Re: Dead node still being pinged

2012-06-12 Thread aaron morton
Try purging the hints for 10.10.0.24 using the HintedHandOffManager MBean.

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 12/06/2012, at 3:33 AM, Nicolas Lalevée wrote:

> finally, thanks to the groovy jmx builder, it was not that hard.
> 
> 
> Le 11 juin 2012 à 12:12, Samuel CARRIERE a écrit :
> 
>> If I were you, I would connect (through JMX, with jconsole) to one of the 
>> nodes that is sending messages to an old node, and would have a look at 
>> these MBean : 
>>   - org.apache.net.FailureDetector : does SimpleStates looks good ? (or do 
>> you see an IP of an old node)
> 
> SimpleStates:[/10.10.0.22:DOWN, /10.10.0.24:DOWN, /10.10.0.26:UP, 
> /10.10.0.25:UP, /10.10.0.27:UP]
> 
>>   - org.apache.net.MessagingService : do you see one of the old IP in one of 
>> the attributes ?
> 
> data-5:
> CommandCompletedTasks:
> [10.10.0.22:2, 10.10.0.26:6147307, 10.10.0.27:6084684, 10.10.0.24:2]
> CommandPendingTasks:
> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
> ResponseCompletedTasks:
> [10.10.0.22:1487, 10.10.0.26:6187204, 10.10.0.27:6062890, 10.10.0.24:1495]
> ResponsePendingTasks:
> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
> 
> data-6:
> CommandCompletedTasks:
> [10.10.0.22:2, 10.10.0.27:6064992, 10.10.0.24:2, 10.10.0.25:6308102]
> CommandPendingTasks:
> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:0, 10.10.0.25:0]
> ResponseCompletedTasks:
> [10.10.0.22:1463, 10.10.0.27:6067943, 10.10.0.24:1474, 10.10.0.25:6367692]
> ResponsePendingTasks:
> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:2, 10.10.0.25:0]
> 
> data-7:
> CommandCompletedTasks:
> [10.10.0.22:2, 10.10.0.26:6043653, 10.10.0.24:2, 10.10.0.25:5964168]
> CommandPendingTasks:
> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.24:0, 10.10.0.25:0]
> ResponseCompletedTasks:
> [10.10.0.22:1424, 10.10.0.26:6090251, 10.10.0.24:1431, 10.10.0.25:6094954]
> ResponsePendingTasks:
> [10.10.0.22:4, 10.10.0.26:0, 10.10.0.24:1, 10.10.0.25:0]
> 
>>   - org.apache.net.StreamingService : do you see an old IP in StreamSources 
>> or StreamDestinations ?
> 
> nothing streaming on the 3 nodes.
> nodetool netstats confirmed that.
> 
>>   - org.apache.internal.HintedHandoff : are there non-zero ActiveCount, 
>> CurrentlyBlockedTasks, PendingTasks, TotalBlockedTask ?
> 
> On the 3 nodes, all at 0.
> 
> I don't know much what I'm looking at, but it seems that some 
> ResponsePendingTasks needs to end.
> 
> Nicolas
> 
>> 
>> Samuel 
>> 
>> 
>> 
>> Nicolas Lalevée 
>> 08/06/2012 21:03
>> Veuillez répondre à
>> user@cassandra.apache.org
>> 
>> A
>> user@cassandra.apache.org
>> cc
>> Objet
>> Re: Dead node still being pinged
>> 
>> 
>> 
>> 
>> 
>> 
>> Le 8 juin 2012 à 20:02, Samuel CARRIERE a écrit :
>> 
>>> I'm in the train but just a guess : maybe it's hinted handoff. A look in 
>>> the logs of the new nodes could confirm that : look for the IP of an old 
>>> node and maybe you'll find hinted handoff related messages.
>> 
>> I grepped on every node about every old node, I got nothing since the 
>> "crash".
>> 
>> If it can be of some help, here is some grepped log of the crash:
>> 
>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>> 00:39:30,241 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>> and will not receive data for re-replication of /10.10.0.22
>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>> and will not receive data for re-replication of /10.10.0.22
>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>> and will not receive data for re-replication of /10.10.0.22
>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>> and will not receive data for re-replication of /10.10.0.22
>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>> and will not receive data for re-replication of /10.10.0.22
>> system.log.1: INFO [GossipStage:1] 2012-05-06 00:44:33,822 Gossiper.java 
>> (line 818) InetAddress /10.10.0.24 is now dead.
>> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,894 Gossiper.java 
>> (li

Re: Dead node still being pinged

2012-06-11 Thread Nicolas Lalevée
finally, thanks to the groovy jmx builder, it was not that hard.


Le 11 juin 2012 à 12:12, Samuel CARRIERE a écrit :

> If I were you, I would connect (through JMX, with jconsole) to one of the 
> nodes that is sending messages to an old node, and would have a look at these 
> MBean : 
>- org.apache.net.FailureDetector : does SimpleStates looks good ? (or do 
> you see an IP of an old node)

SimpleStates:[/10.10.0.22:DOWN, /10.10.0.24:DOWN, /10.10.0.26:UP, 
/10.10.0.25:UP, /10.10.0.27:UP]

>- org.apache.net.MessagingService : do you see one of the old IP in one of 
> the attributes ?

data-5:
CommandCompletedTasks:
[10.10.0.22:2, 10.10.0.26:6147307, 10.10.0.27:6084684, 10.10.0.24:2]
CommandPendingTasks:
[10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
ResponseCompletedTasks:
[10.10.0.22:1487, 10.10.0.26:6187204, 10.10.0.27:6062890, 10.10.0.24:1495]
ResponsePendingTasks:
[10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]

data-6:
CommandCompletedTasks:
[10.10.0.22:2, 10.10.0.27:6064992, 10.10.0.24:2, 10.10.0.25:6308102]
CommandPendingTasks:
[10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:0, 10.10.0.25:0]
ResponseCompletedTasks:
[10.10.0.22:1463, 10.10.0.27:6067943, 10.10.0.24:1474, 10.10.0.25:6367692]
ResponsePendingTasks:
[10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:2, 10.10.0.25:0]

data-7:
CommandCompletedTasks:
[10.10.0.22:2, 10.10.0.26:6043653, 10.10.0.24:2, 10.10.0.25:5964168]
CommandPendingTasks:
[10.10.0.22:0, 10.10.0.26:0, 10.10.0.24:0, 10.10.0.25:0]
ResponseCompletedTasks:
[10.10.0.22:1424, 10.10.0.26:6090251, 10.10.0.24:1431, 10.10.0.25:6094954]
ResponsePendingTasks:
[10.10.0.22:4, 10.10.0.26:0, 10.10.0.24:1, 10.10.0.25:0]

>- org.apache.net.StreamingService : do you see an old IP in StreamSources 
> or StreamDestinations ?

nothing streaming on the 3 nodes.
nodetool netstats confirmed that.

>- org.apache.internal.HintedHandoff : are there non-zero ActiveCount, 
> CurrentlyBlockedTasks, PendingTasks, TotalBlockedTask ?

On the 3 nodes, all at 0.

I don't know much what I'm looking at, but it seems that some 
ResponsePendingTasks needs to end.

Nicolas

> 
> Samuel 
> 
> 
> 
> Nicolas Lalevée 
> 08/06/2012 21:03
> Veuillez répondre à
> user@cassandra.apache.org
> 
> A
> user@cassandra.apache.org
> cc
> Objet
> Re: Dead node still being pinged
> 
> 
> 
> 
> 
> 
> Le 8 juin 2012 à 20:02, Samuel CARRIERE a écrit :
> 
> > I'm in the train but just a guess : maybe it's hinted handoff. A look in 
> > the logs of the new nodes could confirm that : look for the IP of an old 
> > node and maybe you'll find hinted handoff related messages.
> 
> I grepped on every node about every old node, I got nothing since the "crash".
> 
> If it can be of some help, here is some grepped log of the crash:
> 
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
> 00:39:30,241 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
> will not receive data for re-replication of /10.10.0.22
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
> will not receive data for re-replication of /10.10.0.22
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
> will not receive data for re-replication of /10.10.0.22
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
> will not receive data for re-replication of /10.10.0.22
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
> will not receive data for re-replication of /10.10.0.22
> system.log.1: INFO [GossipStage:1] 2012-05-06 00:44:33,822 Gossiper.java 
> (line 818) InetAddress /10.10.0.24 is now dead.
> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,894 Gossiper.java 
> (line 818) InetAddress /10.10.0.24 is now dead.
> system.log.1: INFO [OptionalTasks:1] 2012-05-06 04:25:23,895 
> HintedHandOffManager.java (line 179) Deleting any stored hints for /10.10.0.24
> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,895 
> StorageService.java (line 1157) Removing token 
> 127605887595351923798765477786913079296 for /10.10.0.24
> system.log.1: INFO [GossipStage:1] 2012-05-09 04:26:25,015 Gossiper.java 
> (line 818) InetAddress /10.10.0.24 is now dead.
> 
> 
> Maybe its the way I have removed nodes ? AFAIR I didn't used the decommission 
> command. For each node I got the node down and then issue a remove token 
> command.
> Here is what I can find in the log abo

Re: Dead node still being pinged

2012-06-11 Thread Nicolas Lalevée

Le 11 juin 2012 à 12:12, Samuel CARRIERE a écrit :

> 
> Well, I don't see anything special in the logs. "Remove token" seems to have 
> done its job : accorging to the logs, old stored hints have been deleted. 
> 
> If I were you, I would connect (through JMX, with jconsole) to one of the 
> nodes that is sending messages to an old node, and would have a look at these 
> MBean : 
>- org.apache.net.FailureDetector : does SimpleStates looks good ? (or do 
> you see an IP of an old node) 
>- org.apache.net.MessagingService : do you see one of the old IP in one of 
> the attributes ? 
>- org.apache.net.StreamingService : do you see an old IP in StreamSources 
> or StreamDestinations ? 
>- org.apache.internal.HintedHandoff : are there non-zero ActiveCount, 
> CurrentlyBlockedTasks, PendingTasks, TotalBlockedTask ? 

I feared I had too do such lookups... JMX sucks when there is some ssh 
tunneling to do. I'll get time to look into thoses. Thanks.

By the way, maybe an interesting info (same on every node):
root@data-5 ~ # nodetool -h data-local gossipinfo
/10.10.0.27
  LOAD:2.34205351889E11
  SCHEMA:21099fc0-978c-11e1--bc70eee231ef
  RPC_ADDRESS:10.10.0.27
  STATUS:NORMAL,113427455640312814857969558651062452224
  RELEASE_VERSION:1.0.9
/10.10.0.26
  LOAD:2.64617657147E11
  SCHEMA:21099fc0-978c-11e1--bc70eee231ef
  RPC_ADDRESS:10.10.0.26
  STATUS:NORMAL,56713727820156407428984779325531226112
  RELEASE_VERSION:1.0.9
/10.10.0.25
  LOAD:2.34154095981E11
  SCHEMA:21099fc0-978c-11e1--bc70eee231ef
  RPC_ADDRESS:10.10.0.25
  STATUS:NORMAL,0
  RELEASE_VERSION:1.0.9
/10.10.0.24
  STATUS:removed,127605887595351923798765477786913079296,1336530323263
  REMOVAL_COORDINATOR:REMOVER,0
/10.10.0.22
  STATUS:removed,42535295865117307932921825928971026432,1336529659203
  REMOVAL_COORDINATOR:REMOVER,113427455640312814857969558651062452224


Nicolas


> 
> Samuel 
> 
> 
> 
> Nicolas Lalevée 
> 08/06/2012 21:03
> Veuillez répondre à
> user@cassandra.apache.org
> 
> A
> user@cassandra.apache.org
> cc
> Objet
> Re: Dead node still being pinged
> 
> 
> 
> 
> 
> 
> Le 8 juin 2012 à 20:02, Samuel CARRIERE a écrit :
> 
> > I'm in the train but just a guess : maybe it's hinted handoff. A look in 
> > the logs of the new nodes could confirm that : look for the IP of an old 
> > node and maybe you'll find hinted handoff related messages.
> 
> I grepped on every node about every old node, I got nothing since the "crash".
> 
> If it can be of some help, here is some grepped log of the crash:
> 
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
> 00:39:30,241 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
> will not receive data for re-replication of /10.10.0.22
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
> will not receive data for re-replication of /10.10.0.22
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
> will not receive data for re-replication of /10.10.0.22
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
> will not receive data for re-replication of /10.10.0.22
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down and 
> will not receive data for re-replication of /10.10.0.22
> system.log.1: INFO [GossipStage:1] 2012-05-06 00:44:33,822 Gossiper.java 
> (line 818) InetAddress /10.10.0.24 is now dead.
> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,894 Gossiper.java 
> (line 818) InetAddress /10.10.0.24 is now dead.
> system.log.1: INFO [OptionalTasks:1] 2012-05-06 04:25:23,895 
> HintedHandOffManager.java (line 179) Deleting any stored hints for /10.10.0.24
> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,895 
> StorageService.java (line 1157) Removing token 
> 127605887595351923798765477786913079296 for /10.10.0.24
> system.log.1: INFO [GossipStage:1] 2012-05-09 04:26:25,015 Gossiper.java 
> (line 818) InetAddress /10.10.0.24 is now dead.
> 
> 
> Maybe its the way I have removed nodes ? AFAIR I didn't used the decommission 
> command. For each node I got the node down and then issue a remove token 
> command.
> Here is what I can find in the log about when I removed one of them:
> 
> system.log.1: INFO [GossipTasks:1] 2012-05-02 17:21:10,281 Gossiper.java 
> (line 818) InetAddress /10.10.0.24 is now dead.
> system.log.1: INFO [HintedHandoff:1] 2012-05

Re: Dead node still being pinged

2012-06-11 Thread Samuel CARRIERE
Well, I don't see anything special in the logs. "Remove token" seems to 
have done its job : accorging to the logs, old stored hints have been 
deleted.

If I were you, I would connect (through JMX, with jconsole) to one of the 
nodes that is sending messages to an old node, and would have a look at 
these MBean :
   - org.apache.net.FailureDetector : does SimpleStates looks good ? (or 
do you see an IP of an old node)
   - org.apache.net.MessagingService : do you see one of the old IP in one 
of the attributes ?
   - org.apache.net.StreamingService : do you see an old IP in 
StreamSources or StreamDestinations ?
   - org.apache.internal.HintedHandoff : are there non-zero ActiveCount, 
CurrentlyBlockedTasks, PendingTasks, TotalBlockedTask ?

Samuel




Nicolas Lalevée  
08/06/2012 21:03
Veuillez répondre à
user@cassandra.apache.org


A
user@cassandra.apache.org
cc

Objet
Re: Dead node still being pinged







Le 8 juin 2012 à 20:02, Samuel CARRIERE a écrit :

> I'm in the train but just a guess : maybe it's hinted handoff. A look in 
the logs of the new nodes could confirm that : look for the IP of an old 
node and maybe you'll find hinted handoff related messages.

I grepped on every node about every old node, I got nothing since the 
"crash".

If it can be of some help, here is some grepped log of the crash:

system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
00:39:30,241 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
and will not receive data for re-replication of /10.10.0.22
system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
and will not receive data for re-replication of /10.10.0.22
system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
and will not receive data for re-replication of /10.10.0.22
system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
and will not receive data for re-replication of /10.10.0.22
system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
and will not receive data for re-replication of /10.10.0.22
system.log.1: INFO [GossipStage:1] 2012-05-06 00:44:33,822 Gossiper.java 
(line 818) InetAddress /10.10.0.24 is now dead.
system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,894 Gossiper.java 
(line 818) InetAddress /10.10.0.24 is now dead.
system.log.1: INFO [OptionalTasks:1] 2012-05-06 04:25:23,895 
HintedHandOffManager.java (line 179) Deleting any stored hints for 
/10.10.0.24
system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,895 
StorageService.java (line 1157) Removing token 
127605887595351923798765477786913079296 for /10.10.0.24
system.log.1: INFO [GossipStage:1] 2012-05-09 04:26:25,015 Gossiper.java 
(line 818) InetAddress /10.10.0.24 is now dead.


Maybe its the way I have removed nodes ? AFAIR I didn't used the 
decommission command. For each node I got the node down and then issue a 
remove token command.
Here is what I can find in the log about when I removed one of them:

system.log.1: INFO [GossipTasks:1] 2012-05-02 17:21:10,281 Gossiper.java 
(line 818) InetAddress /10.10.0.24 is now dead.
system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:21:21,496 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [GossipStage:1] 2012-05-02 17:21:59,307 Gossiper.java 
(line 818) InetAddress /10.10.0.24 is now dead.
system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:31:20,336 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:41:06,177 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:51:18,148 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:00:31,709 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:11:02,521 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:20:38,282 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:31:09,513 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:40:31,565 
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
delivery, aborting
system.log.1

Re: Dead node still being pinged

2012-06-08 Thread Nicolas Lalevée
Le 8 juin 2012 à 20:50, aaron morton a écrit :

> Are the old machines listed in the seed list on the new ones ?

No they don't.

The first of my old node was, when I was "migrating". But not anymore.

Nicolas


> Cheers
> 
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 9/06/2012, at 12:10 AM, Nicolas Lalevée wrote:
> 
>> I had a configuration where I had 4 nodes, data-1,4. We then bought 3 bigger 
>> machines, data-5,7. And we moved all data from data-1,4 to data-5,7.
>> To move all the data without interruption of service, I added one new node 
>> at a time. And then I removed one by one the old machines via a "remove 
>> token".
>> 
>> Everything was working fine. Until there was an expected load on our 
>> cluster, the machine started to swap and become unresponsive. We fixed the 
>> unexpected load and the three new machines were restarted. After that the 
>> new cassandra machines were stating that some old token were not assigned, 
>> namely from data-2 and data-4. To fix this I issued again some "remove 
>> token" commands.
>> 
>> Everything seems to be back to normal, but on the network I still see some 
>> packet from the new cluster to the old machines. On the port 7000.
>> How I can tell cassandra to completely forget about the old machines ?
>> 
>> Nicolas
>> 
> 



Re: Dead node still being pinged

2012-06-08 Thread Nicolas Lalevée
ervice.java 
(line 1157) Removing token 145835300108973619103103718265651724288 for 
/10.10.0.24


Nicolas


> 
> 
> - Message d'origine -
> De : Nicolas Lalevée [nicolas.lale...@hibnet.org]
> Envoyé : 08/06/2012 19:26 ZE2
> À : user@cassandra.apache.org
> Objet : Re: Dead node still being pinged
> 
> 
> 
> Le 8 juin 2012 à 15:17, Samuel CARRIERE a écrit :
> 
>> What does nodetool ring says ? (Ask every node)
> 
> currently, each of new node see only the tokens of the new nodes.
> 
>> Have you checked that the list of seeds in every yaml is correct ?
> 
> yes, it is correct, every of my new node point to the first of my new node
> 
>> What version of cassandra are you using ?
> 
> Sorry I should have wrote this in my first mail.
> I use the 1.0.9
> 
> Nicolas
> 
>> 
>> Samuel
>> 
>> 
>> 
>> Nicolas Lalevée 
>> 08/06/2012 14:10
>> Veuillez répondre à
>> user@cassandra.apache.org
>> 
>> A
>> user@cassandra.apache.org
>> cc
>> Objet
>> Dead node still being pinged
>> 
>> 
>> 
>> 
>> 
>> I had a configuration where I had 4 nodes, data-1,4. We then bought 3 bigger 
>> machines, data-5,7. And we moved all data from data-1,4 to data-5,7.
>> To move all the data without interruption of service, I added one new node 
>> at a time. And then I removed one by one the old machines via a "remove 
>> token".
>> 
>> Everything was working fine. Until there was an expected load on our 
>> cluster, the machine started to swap and become unresponsive. We fixed the 
>> unexpected load and the three new machines were restarted. After that the 
>> new cassandra machines were stating that some old token were not assigned, 
>> namely from data-2 and data-4. To fix this I issued again some "remove 
>> token" commands.
>> 
>> Everything seems to be back to normal, but on the network I still see some 
>> packet from the new cluster to the old machines. On the port 7000.
>> How I can tell cassandra to completely forget about the old machines ?
>> 
>> Nicolas
>> 
>> 
> 



Re: Dead node still being pinged

2012-06-08 Thread aaron morton
Are the old machines listed in the seed list on the new ones ?

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 9/06/2012, at 12:10 AM, Nicolas Lalevée wrote:

> I had a configuration where I had 4 nodes, data-1,4. We then bought 3 bigger 
> machines, data-5,7. And we moved all data from data-1,4 to data-5,7.
> To move all the data without interruption of service, I added one new node at 
> a time. And then I removed one by one the old machines via a "remove token".
> 
> Everything was working fine. Until there was an expected load on our cluster, 
> the machine started to swap and become unresponsive. We fixed the unexpected 
> load and the three new machines were restarted. After that the new cassandra 
> machines were stating that some old token were not assigned, namely from 
> data-2 and data-4. To fix this I issued again some "remove token" commands.
> 
> Everything seems to be back to normal, but on the network I still see some 
> packet from the new cluster to the old machines. On the port 7000.
> How I can tell cassandra to completely forget about the old machines ?
> 
> Nicolas
> 



Re: Dead node still being pinged

2012-06-08 Thread Samuel CARRIERE
I'm in the train but just a guess : maybe it's hinted handoff. A look in the 
logs of the new nodes could confirm that : look for the IP of an old node and 
maybe you'll find hinted handoff related messages.


- Message d'origine -
De : Nicolas Lalevée [nicolas.lale...@hibnet.org]
Envoyé : 08/06/2012 19:26 ZE2
À : user@cassandra.apache.org
Objet : Re: Dead node still being pinged



Le 8 juin 2012 à 15:17, Samuel CARRIERE a écrit :

> What does nodetool ring says ? (Ask every node)

currently, each of new node see only the tokens of the new nodes.

> Have you checked that the list of seeds in every yaml is correct ?

yes, it is correct, every of my new node point to the first of my new node

> What version of cassandra are you using ?

Sorry I should have wrote this in my first mail.
I use the 1.0.9

Nicolas

>
> Samuel
>
>
>
> Nicolas Lalevée 
> 08/06/2012 14:10
> Veuillez répondre à
> user@cassandra.apache.org
>
> A
> user@cassandra.apache.org
> cc
> Objet
> Dead node still being pinged
>
>
>
>
>
> I had a configuration where I had 4 nodes, data-1,4. We then bought 3 bigger 
> machines, data-5,7. And we moved all data from data-1,4 to data-5,7.
> To move all the data without interruption of service, I added one new node at 
> a time. And then I removed one by one the old machines via a "remove token".
>
> Everything was working fine. Until there was an expected load on our cluster, 
> the machine started to swap and become unresponsive. We fixed the unexpected 
> load and the three new machines were restarted. After that the new cassandra 
> machines were stating that some old token were not assigned, namely from 
> data-2 and data-4. To fix this I issued again some "remove token" commands.
>
> Everything seems to be back to normal, but on the network I still see some 
> packet from the new cluster to the old machines. On the port 7000.
> How I can tell cassandra to completely forget about the old machines ?
>
> Nicolas
>
>



Re: Dead node still being pinged

2012-06-08 Thread Nicolas Lalevée
Le 8 juin 2012 à 15:17, Samuel CARRIERE a écrit :

> What does nodetool ring says ? (Ask every node) 

currently, each of new node see only the tokens of the new nodes.

> Have you checked that the list of seeds in every yaml is correct ? 

yes, it is correct, every of my new node point to the first of my new node

> What version of cassandra are you using ?

Sorry I should have wrote this in my first mail.
I use the 1.0.9

Nicolas

> 
> Samuel 
> 
> 
> 
> Nicolas Lalevée 
> 08/06/2012 14:10
> Veuillez répondre à
> user@cassandra.apache.org
> 
> A
> user@cassandra.apache.org
> cc
> Objet
> Dead node still being pinged
> 
> 
> 
> 
> 
> I had a configuration where I had 4 nodes, data-1,4. We then bought 3 bigger 
> machines, data-5,7. And we moved all data from data-1,4 to data-5,7.
> To move all the data without interruption of service, I added one new node at 
> a time. And then I removed one by one the old machines via a "remove token".
> 
> Everything was working fine. Until there was an expected load on our cluster, 
> the machine started to swap and become unresponsive. We fixed the unexpected 
> load and the three new machines were restarted. After that the new cassandra 
> machines were stating that some old token were not assigned, namely from 
> data-2 and data-4. To fix this I issued again some "remove token" commands.
> 
> Everything seems to be back to normal, but on the network I still see some 
> packet from the new cluster to the old machines. On the port 7000.
> How I can tell cassandra to completely forget about the old machines ?
> 
> Nicolas
> 
> 



RE Dead node still being pinged

2012-06-08 Thread Samuel CARRIERE
Hi Nicolas,

What does nodetool ring says ? (Ask every node)
Have you checked that the list of seeds in every yaml is correct ?
What version of cassandra are you using ?

Samuel




Nicolas Lalevée  
08/06/2012 14:10
Veuillez répondre à
user@cassandra.apache.org


A
user@cassandra.apache.org
cc

Objet
Dead node still being pinged






I had a configuration where I had 4 nodes, data-1,4. We then bought 3 
bigger machines, data-5,7. And we moved all data from data-1,4 to 
data-5,7.
To move all the data without interruption of service, I added one new node 
at a time. And then I removed one by one the old machines via a "remove 
token".

Everything was working fine. Until there was an expected load on our 
cluster, the machine started to swap and become unresponsive. We fixed the 
unexpected load and the three new machines were restarted. After that the 
new cassandra machines were stating that some old token were not assigned, 
namely from data-2 and data-4. To fix this I issued again some "remove 
token" commands.

Everything seems to be back to normal, but on the network I still see some 
packet from the new cluster to the old machines. On the port 7000.
How I can tell cassandra to completely forget about the old machines ?

Nicolas