Re: Error when bringing up nodes during failure testing

Jonathan Ellis Tue, 08 Mar 2011 07:57:02 -0800

Is he trying to bootstrap?  What does that have to do with failure
recovery?  Doesn't make sense to me.


On Tue, Mar 8, 2011 at 2:33 AM, aaron morton <aa...@thelastpickle.com> wrote:
> It looks like the node is sending out it application state and waiting the 
> required time after which it expects to know about all other nodes in the 
> cluster.
>
>> INFO [main] 2011-03-07 17:04:06,660 StorageService.java (line 399) Joining: 
>> sleeping 30000 ms for pending range setup
> For some reason it cannot see them. This could be a config thing or a 
> networking thing.
>
> I was a bit off in my analysis before. When boot strapping it's smart enough 
> to wait for gossip to kick in and tell the node about the others in the 
> cluster.
>
> Try the following:
> - check network connectivity between the problem node and the others, and 
> check they have the same config
> - try to bring up the problem node with auto_bootstrap off . If it can get 
> start check it's view of the cluster with nodetool ring
> - if that fails turn on TRACE logging on all nodes, and try to bring up the 
> problem node. This will log a lot of messages about what Gossip is doing.
>
> Aaron
>
> On 8/03/2011, at 2:49 PM, mcasandra wrote:
>
>>
>> aaron morton wrote:
>>>
>>> 2) um, not sure. The nodetool output below looks like there are only 2
>>> nodes in that cluster, i.e. there are no down nodes.
>>>
>> There are actually 3 nodes. Not sure why it's not showing the other node in
>> the output which is currently down. The error I am getting is from the the
>> 3rd node that is currently down.
>>
>> Here are the logs which shows it tried to talk to other 2 nodes:
>>
>> ---
>> INFO [HintedHandoff:1] 2011-03-07 17:02:36,463 HintedHandOffManager.java
>> (line 248) Finished hinted handoff of 0 rows to endpoint /181.116.206.179
>> INFO [GossipStage:1] 2011-03-07 17:02:36,463 StorageService.java (line 606)
>> Node /181.116.208.68 state jump to normal
>> INFO [HintedHandoff:1] 2011-03-07 17:02:36,463 HintedHandOffManager.java
>> (line 192) Started hinted handoff for endpoint /181.116.208.68
>> INFO [HintedHandoff:1] 2011-03-07 17:02:36,464 HintedHandOffManager.java
>> (line 248) Finished hinted handoff of 0 rows to endpoint /181.116.208.68
>> INFO [main] 2011-03-07 17:04:06,424 StorageService.java (line 399) Joining:
>> getting bootstrap token
>> INFO [main] 2011-03-07 17:04:06,426 ColumnFamilyStore.java (line 648)
>> switching in a fresh Memtable for LocationInfo at
>> CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1299546155643.log',
>> position=296)
>> INFO [main] 2011-03-07 17:04:06,426 ColumnFamilyStore.java (line 952)
>> Enqueuing flush of Memtable-LocationInfo@1367996500(36 bytes, 1 operations)
>> INFO [FlushWriter:1] 2011-03-07 17:04:06,427 Memtable.java (line 155)
>> Writing Memtable-LocationInfo@1367996500(36 bytes, 1 operations)
>> INFO [FlushWriter:1] 2011-03-07 17:04:06,659 Memtable.java (line 162)
>> Completed flushing /var/lib/cassandra/data/system/LocationInfo-e-80-Data.db
>> (156 bytes)
>> INFO [CompactionExecutor:1] 2011-03-07 17:04:06,660 CompactionManager.java
>> (line 272) Compacting
>> [org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-77-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-78-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-79-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-80-Data.db')]
>> INFO [main] 2011-03-07 17:04:06,660 StorageService.java (line 399) Joining:
>> sleeping 30000 ms for pending range setup
>> INFO [CompactionExecutor:1] 2011-03-07 17:04:06,849 CompactionManager.java
>> (line 354) Compacted to
>> /var/lib/cassandra/data/system/LocationInfo-tmp-e-81-Data.db.  1,293 to 832
>> (~64% of original) bytes for 4 keys.  Time: 185ms.
>> INFO [main] 2011-03-07 17:04:36,667 StorageService.java (line 399)
>> Bootstrapping
>> ERROR [main] 2011-03-07 17:04:36,677 AbstractCassandraDaemon.java (line 234)
>> Exception encountered during startup.
>> java.lang.IllegalStateException: replication factor (3) exceeds number of
>> endpoints (2)
>> ----
>>
>>
>> --
>> View this message in context: 
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6099853.html
>> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
>> Nabble.com.
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Error when bringing up nodes during failure testing

Reply via email to