Is he trying to bootstrap? What does that have to do with failure recovery? Doesn't make sense to me.
On Tue, Mar 8, 2011 at 2:33 AM, aaron morton <aa...@thelastpickle.com> wrote: > It looks like the node is sending out it application state and waiting the > required time after which it expects to know about all other nodes in the > cluster. > >> INFO [main] 2011-03-07 17:04:06,660 StorageService.java (line 399) Joining: >> sleeping 30000 ms for pending range setup > For some reason it cannot see them. This could be a config thing or a > networking thing. > > I was a bit off in my analysis before. When boot strapping it's smart enough > to wait for gossip to kick in and tell the node about the others in the > cluster. > > Try the following: > - check network connectivity between the problem node and the others, and > check they have the same config > - try to bring up the problem node with auto_bootstrap off . If it can get > start check it's view of the cluster with nodetool ring > - if that fails turn on TRACE logging on all nodes, and try to bring up the > problem node. This will log a lot of messages about what Gossip is doing. > > Aaron > > On 8/03/2011, at 2:49 PM, mcasandra wrote: > >> >> aaron morton wrote: >>> >>> 2) um, not sure. The nodetool output below looks like there are only 2 >>> nodes in that cluster, i.e. there are no down nodes. >>> >> There are actually 3 nodes. Not sure why it's not showing the other node in >> the output which is currently down. The error I am getting is from the the >> 3rd node that is currently down. >> >> Here are the logs which shows it tried to talk to other 2 nodes: >> >> --- >> INFO [HintedHandoff:1] 2011-03-07 17:02:36,463 HintedHandOffManager.java >> (line 248) Finished hinted handoff of 0 rows to endpoint /181.116.206.179 >> INFO [GossipStage:1] 2011-03-07 17:02:36,463 StorageService.java (line 606) >> Node /181.116.208.68 state jump to normal >> INFO [HintedHandoff:1] 2011-03-07 17:02:36,463 HintedHandOffManager.java >> (line 192) Started hinted handoff for endpoint /181.116.208.68 >> INFO [HintedHandoff:1] 2011-03-07 17:02:36,464 HintedHandOffManager.java >> (line 248) Finished hinted handoff of 0 rows to endpoint /181.116.208.68 >> INFO [main] 2011-03-07 17:04:06,424 StorageService.java (line 399) Joining: >> getting bootstrap token >> INFO [main] 2011-03-07 17:04:06,426 ColumnFamilyStore.java (line 648) >> switching in a fresh Memtable for LocationInfo at >> CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1299546155643.log', >> position=296) >> INFO [main] 2011-03-07 17:04:06,426 ColumnFamilyStore.java (line 952) >> Enqueuing flush of Memtable-LocationInfo@1367996500(36 bytes, 1 operations) >> INFO [FlushWriter:1] 2011-03-07 17:04:06,427 Memtable.java (line 155) >> Writing Memtable-LocationInfo@1367996500(36 bytes, 1 operations) >> INFO [FlushWriter:1] 2011-03-07 17:04:06,659 Memtable.java (line 162) >> Completed flushing /var/lib/cassandra/data/system/LocationInfo-e-80-Data.db >> (156 bytes) >> INFO [CompactionExecutor:1] 2011-03-07 17:04:06,660 CompactionManager.java >> (line 272) Compacting >> [org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-77-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-78-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-79-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-80-Data.db')] >> INFO [main] 2011-03-07 17:04:06,660 StorageService.java (line 399) Joining: >> sleeping 30000 ms for pending range setup >> INFO [CompactionExecutor:1] 2011-03-07 17:04:06,849 CompactionManager.java >> (line 354) Compacted to >> /var/lib/cassandra/data/system/LocationInfo-tmp-e-81-Data.db. 1,293 to 832 >> (~64% of original) bytes for 4 keys. Time: 185ms. >> INFO [main] 2011-03-07 17:04:36,667 StorageService.java (line 399) >> Bootstrapping >> ERROR [main] 2011-03-07 17:04:36,677 AbstractCassandraDaemon.java (line 234) >> Exception encountered during startup. >> java.lang.IllegalStateException: replication factor (3) exceeds number of >> endpoints (2) >> ---- >> >> >> -- >> View this message in context: >> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6099853.html >> Sent from the cassandra-u...@incubator.apache.org mailing list archive at >> Nabble.com. > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com