Re: Error when bringing up nodes during failure testing

2011-03-08 Thread aaron morton
It looks like the node is sending out it application state and waiting the 
required time after which it expects to know about all other nodes in the 
cluster. 

 INFO [main] 2011-03-07 17:04:06,660 StorageService.java (line 399) Joining: 
 sleeping 3 ms for pending range setup
For some reason it cannot see them. This could be a config thing or a 
networking thing. 

I was a bit off in my analysis before. When boot strapping it's smart enough to 
wait for gossip to kick in and tell the node about the others in the cluster. 

Try the following:
- check network connectivity between the problem node and the others, and check 
they have the same config
- try to bring up the problem node with auto_bootstrap off . If it can get 
start check it's view of the cluster with nodetool ring
- if that fails turn on TRACE logging on all nodes, and try to bring up the 
problem node. This will log a lot of messages about what Gossip is doing.
 
Aaron

On 8/03/2011, at 2:49 PM, mcasandra wrote:

 
 aaron morton wrote:
 
 2) um, not sure. The nodetool output below looks like there are only 2
 nodes in that cluster, i.e. there are no down nodes. 
 
 There are actually 3 nodes. Not sure why it's not showing the other node in
 the output which is currently down. The error I am getting is from the the
 3rd node that is currently down.
 
 Here are the logs which shows it tried to talk to other 2 nodes:
 
 ---
 INFO [HintedHandoff:1] 2011-03-07 17:02:36,463 HintedHandOffManager.java
 (line 248) Finished hinted handoff of 0 rows to endpoint /181.116.206.179
 INFO [GossipStage:1] 2011-03-07 17:02:36,463 StorageService.java (line 606)
 Node /181.116.208.68 state jump to normal
 INFO [HintedHandoff:1] 2011-03-07 17:02:36,463 HintedHandOffManager.java
 (line 192) Started hinted handoff for endpoint /181.116.208.68
 INFO [HintedHandoff:1] 2011-03-07 17:02:36,464 HintedHandOffManager.java
 (line 248) Finished hinted handoff of 0 rows to endpoint /181.116.208.68
 INFO [main] 2011-03-07 17:04:06,424 StorageService.java (line 399) Joining:
 getting bootstrap token
 INFO [main] 2011-03-07 17:04:06,426 ColumnFamilyStore.java (line 648)
 switching in a fresh Memtable for LocationInfo at
 CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1299546155643.log',
 position=296)
 INFO [main] 2011-03-07 17:04:06,426 ColumnFamilyStore.java (line 952)
 Enqueuing flush of Memtable-LocationInfo@1367996500(36 bytes, 1 operations)
 INFO [FlushWriter:1] 2011-03-07 17:04:06,427 Memtable.java (line 155)
 Writing Memtable-LocationInfo@1367996500(36 bytes, 1 operations)
 INFO [FlushWriter:1] 2011-03-07 17:04:06,659 Memtable.java (line 162)
 Completed flushing /var/lib/cassandra/data/system/LocationInfo-e-80-Data.db
 (156 bytes)
 INFO [CompactionExecutor:1] 2011-03-07 17:04:06,660 CompactionManager.java
 (line 272) Compacting
 [org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-77-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-78-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-79-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-80-Data.db')]
 INFO [main] 2011-03-07 17:04:06,660 StorageService.java (line 399) Joining:
 sleeping 3 ms for pending range setup
 INFO [CompactionExecutor:1] 2011-03-07 17:04:06,849 CompactionManager.java
 (line 354) Compacted to
 /var/lib/cassandra/data/system/LocationInfo-tmp-e-81-Data.db.  1,293 to 832
 (~64% of original) bytes for 4 keys.  Time: 185ms.
 INFO [main] 2011-03-07 17:04:36,667 StorageService.java (line 399)
 Bootstrapping
 ERROR [main] 2011-03-07 17:04:36,677 AbstractCassandraDaemon.java (line 234)
 Exception encountered during startup.
 java.lang.IllegalStateException: replication factor (3) exceeds number of
 endpoints (2)
 
 
 
 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6099853.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.



Re: Error when bringing up nodes during failure testing

2011-03-08 Thread Jonathan Ellis
Is he trying to bootstrap?  What does that have to do with failure
recovery?  Doesn't make sense to me.

On Tue, Mar 8, 2011 at 2:33 AM, aaron morton aa...@thelastpickle.com wrote:
 It looks like the node is sending out it application state and waiting the 
 required time after which it expects to know about all other nodes in the 
 cluster.

 INFO [main] 2011-03-07 17:04:06,660 StorageService.java (line 399) Joining: 
 sleeping 3 ms for pending range setup
 For some reason it cannot see them. This could be a config thing or a 
 networking thing.

 I was a bit off in my analysis before. When boot strapping it's smart enough 
 to wait for gossip to kick in and tell the node about the others in the 
 cluster.

 Try the following:
 - check network connectivity between the problem node and the others, and 
 check they have the same config
 - try to bring up the problem node with auto_bootstrap off . If it can get 
 start check it's view of the cluster with nodetool ring
 - if that fails turn on TRACE logging on all nodes, and try to bring up the 
 problem node. This will log a lot of messages about what Gossip is doing.

 Aaron

 On 8/03/2011, at 2:49 PM, mcasandra wrote:


 aaron morton wrote:

 2) um, not sure. The nodetool output below looks like there are only 2
 nodes in that cluster, i.e. there are no down nodes.

 There are actually 3 nodes. Not sure why it's not showing the other node in
 the output which is currently down. The error I am getting is from the the
 3rd node that is currently down.

 Here are the logs which shows it tried to talk to other 2 nodes:

 ---
 INFO [HintedHandoff:1] 2011-03-07 17:02:36,463 HintedHandOffManager.java
 (line 248) Finished hinted handoff of 0 rows to endpoint /181.116.206.179
 INFO [GossipStage:1] 2011-03-07 17:02:36,463 StorageService.java (line 606)
 Node /181.116.208.68 state jump to normal
 INFO [HintedHandoff:1] 2011-03-07 17:02:36,463 HintedHandOffManager.java
 (line 192) Started hinted handoff for endpoint /181.116.208.68
 INFO [HintedHandoff:1] 2011-03-07 17:02:36,464 HintedHandOffManager.java
 (line 248) Finished hinted handoff of 0 rows to endpoint /181.116.208.68
 INFO [main] 2011-03-07 17:04:06,424 StorageService.java (line 399) Joining:
 getting bootstrap token
 INFO [main] 2011-03-07 17:04:06,426 ColumnFamilyStore.java (line 648)
 switching in a fresh Memtable for LocationInfo at
 CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1299546155643.log',
 position=296)
 INFO [main] 2011-03-07 17:04:06,426 ColumnFamilyStore.java (line 952)
 Enqueuing flush of Memtable-LocationInfo@1367996500(36 bytes, 1 operations)
 INFO [FlushWriter:1] 2011-03-07 17:04:06,427 Memtable.java (line 155)
 Writing Memtable-LocationInfo@1367996500(36 bytes, 1 operations)
 INFO [FlushWriter:1] 2011-03-07 17:04:06,659 Memtable.java (line 162)
 Completed flushing /var/lib/cassandra/data/system/LocationInfo-e-80-Data.db
 (156 bytes)
 INFO [CompactionExecutor:1] 2011-03-07 17:04:06,660 CompactionManager.java
 (line 272) Compacting
 [org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-77-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-78-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-79-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-80-Data.db')]
 INFO [main] 2011-03-07 17:04:06,660 StorageService.java (line 399) Joining:
 sleeping 3 ms for pending range setup
 INFO [CompactionExecutor:1] 2011-03-07 17:04:06,849 CompactionManager.java
 (line 354) Compacted to
 /var/lib/cassandra/data/system/LocationInfo-tmp-e-81-Data.db.  1,293 to 832
 (~64% of original) bytes for 4 keys.  Time: 185ms.
 INFO [main] 2011-03-07 17:04:36,667 StorageService.java (line 399)
 Bootstrapping
 ERROR [main] 2011-03-07 17:04:36,677 AbstractCassandraDaemon.java (line 234)
 Exception encountered during startup.
 java.lang.IllegalStateException: replication factor (3) exceeds number of
 endpoints (2)
 


 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6099853.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.





-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Error when bringing up nodes during failure testing

2011-03-08 Thread mcasandra
I turned the auto_bootstrap off and it worked fine. I don't think it's
connectivity issue or network issue at all. I am very confused about what's
going on here. Can you please let me know if this a bug that I am facing?


Also, what are the disadvantage of turning off auto bootstrap? Do I need to
do anything after the fact?

I don't see any nodetool join option in nodetool as stated previously.

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6131917.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Error when bringing up nodes during failure testing

2011-03-08 Thread Peter Schuller
 Also, what are the disadvantage of turning off auto bootstrap? Do I need to
 do anything after the fact?

Inserting a new node into a ring without auto_bootstrap implies that
it will join the ring, but will not contain any data for which it is
supposedly responsible. A 'nodetool repair' should cause data to be
replicated. But until that's done, the node should be returning
inconsistent results.

So, turning off auto_bootstrap probably just hid/changed the symptom
of the problem you're seeing rather than fix it./

-- 
/ Peter Schuller


Re: Error when bringing up nodes during failure testing

2011-03-08 Thread Peter Schuller
 2) When I brought 2 nodes down (out of 3), I was able to start one node
 (with 66 % load below) even though auto_bootstrap is set to true. Shouldn't
 it have failed for the same reason?

This is a good point/question. As far as I can tell, a node being
bootstrapped would need to receive data from a sufficient number of
replicas to satisfy the maximum consistently level that the
application(s) use, in order to avoid the potential for violating the
consistency requirement expected by clients. Not knowing what the
application expects, that would imply a quorum of nodes.

I just checked the code, and my reading (untested) is that the intent
is to receive data from all nodes responsible for the part of the ring
that is being taken over. Meaning, it satisfies the above requirement.

However, that reading is inconsistent with your test which suggests
you were able to bootstrap with two nodes missing out of three.

Is your nodetool output from the new node or the pre-existing online
node? It only lists two nodes, rather than 3 or 4 (with some being
Down). If the only remaining node doesn't know about the other two
that are down, that may explain it.

I may be mis-reading the code because it's suddenly unclear to me how
this is supposed to work with respect to nodes being down (supposing
it's truly down, forever, and needs to be replaced).

Anyone?

-- 
/ Peter Schuller


Re: Error when bringing up nodes during failure testing

2011-03-08 Thread mcasandra
I am as clear as mud with what is happening here :)

But with some suggestions I can try to start my test from scratch and post
results in that order.

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6135635.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Error when bringing up nodes during failure testing

2011-03-07 Thread aaron morton
It's failing because when the node bootstraps it does not know about enough 
nodes to support the RF...

 replication factor (3) exceeds number of
 endpoints (2)

I *think* the normal work around is to disable autobootstrap, bring the nodes 
up then run nodetool join or StorageService.joinRing() via the JConsole.

I not tested this, but reading the code that looks OK. Can you try it out and 
let me know how it goes?

Aaron


On 8/03/2011, at 7:09 AM, mcasandra wrote:

 
 aaron morton wrote:
 
 Can you include the full error stack ? 
 
 
 
 Please find the complete stack trace. Can't really move forward with it not
 knowing the cause:
 
 
 ERROR [main] 2011-03-02 16:28:23,923 AbstractCassandraDaemon.java (line 234)
 Exception encountered during startup.
 java.lang.IllegalStateException: replication factor (3) exceeds number of
 endpoints (2)
at
 org.apache.cassandra.locator.SimpleStrategy.calculateNaturalEndpoints(SimpleStrategy.java:60)
at
 org.apache.cassandra.locator.AbstractReplicationStrategy.getRangeAddresses(AbstractReplicationStrategy.java:204)
at
 org.apache.cassandra.dht.BootStrapper.getRangesWithSources(BootStrapper.java:198)
at
 org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:83)
at
 org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:417)
at
 org.apache.cassandra.service.StorageService.initServer(StorageService.java:361)
at
 org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:161)
at
 org.apache.cassandra.thrift.CassandraDaemon.setup(CassandraDaemon.java:55)
at
 org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:217)
at
 org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:134)
 
 
 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6098332.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.



Re: Error when bringing up nodes during failure testing

2011-03-07 Thread aaron morton
1) yes

2) um, not sure. The nodetool output below looks like there are only 2 nodes in 
that cluster, i.e. there are no down nodes. 

Aaron

On 8/03/2011, at 2:11 PM, mcasandra wrote:

 
 aaron morton wrote:
 
 It's failing because when the node bootstraps it does not know about
 enough nodes to support the RF...
 
 replication factor (3) exceeds number of
 endpoints (2)
 
 I *think* the normal work around is to disable autobootstrap, bring the
 nodes up then run nodetool join or StorageService.joinRing() via the
 JConsole.
 
 I not tested this, but reading the code that looks OK. Can you try it out
 and let me know how it goes?
 
 Aaron
 
 
 I am getting confused about the behaviour:
 
 1) Out of 3 nodes I have 2 nodes up and I am trying to start this node
 that's failing. Is this expected that even though there are 2 nodes up one
 node will continuously fail with replication factor (3) exceeds ..
 message?
 
 2) When I brought 2 nodes down (out of 3), I was able to start one node
 (with 66 % load below) even though auto_bootstrap is set to true. Shouldn't
 it have failed for the same reason?
 
 $ nodetool -h `hostname` ring
 Address Status State   LoadOwnsToken
 
 113427455640312821154458202477256070484
 181.116.206.179  Up Normal  645.13 KB   33.33%  0
 181.116.208.68   Up Normal  640.16 KB   66.67% 
 113427455640312821154458202477256070484
 
 
 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6099765.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.



Re: Error when bringing up nodes during failure testing

2011-03-07 Thread mcasandra

aaron morton wrote:
 
 2) um, not sure. The nodetool output below looks like there are only 2
 nodes in that cluster, i.e. there are no down nodes. 
 
There are actually 3 nodes. Not sure why it's not showing the other node in
the output which is currently down. The error I am getting is from the the
3rd node that is currently down.

Here are the logs which shows it tried to talk to other 2 nodes:

---
 INFO [HintedHandoff:1] 2011-03-07 17:02:36,463 HintedHandOffManager.java
(line 248) Finished hinted handoff of 0 rows to endpoint /181.116.206.179
 INFO [GossipStage:1] 2011-03-07 17:02:36,463 StorageService.java (line 606)
Node /181.116.208.68 state jump to normal
 INFO [HintedHandoff:1] 2011-03-07 17:02:36,463 HintedHandOffManager.java
(line 192) Started hinted handoff for endpoint /181.116.208.68
 INFO [HintedHandoff:1] 2011-03-07 17:02:36,464 HintedHandOffManager.java
(line 248) Finished hinted handoff of 0 rows to endpoint /181.116.208.68
 INFO [main] 2011-03-07 17:04:06,424 StorageService.java (line 399) Joining:
getting bootstrap token
 INFO [main] 2011-03-07 17:04:06,426 ColumnFamilyStore.java (line 648)
switching in a fresh Memtable for LocationInfo at
CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1299546155643.log',
position=296)
 INFO [main] 2011-03-07 17:04:06,426 ColumnFamilyStore.java (line 952)
Enqueuing flush of Memtable-LocationInfo@1367996500(36 bytes, 1 operations)
 INFO [FlushWriter:1] 2011-03-07 17:04:06,427 Memtable.java (line 155)
Writing Memtable-LocationInfo@1367996500(36 bytes, 1 operations)
 INFO [FlushWriter:1] 2011-03-07 17:04:06,659 Memtable.java (line 162)
Completed flushing /var/lib/cassandra/data/system/LocationInfo-e-80-Data.db
(156 bytes)
 INFO [CompactionExecutor:1] 2011-03-07 17:04:06,660 CompactionManager.java
(line 272) Compacting
[org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-77-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-78-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-79-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-80-Data.db')]
 INFO [main] 2011-03-07 17:04:06,660 StorageService.java (line 399) Joining:
sleeping 3 ms for pending range setup
 INFO [CompactionExecutor:1] 2011-03-07 17:04:06,849 CompactionManager.java
(line 354) Compacted to
/var/lib/cassandra/data/system/LocationInfo-tmp-e-81-Data.db.  1,293 to 832
(~64% of original) bytes for 4 keys.  Time: 185ms.
 INFO [main] 2011-03-07 17:04:36,667 StorageService.java (line 399)
Bootstrapping
ERROR [main] 2011-03-07 17:04:36,677 AbstractCassandraDaemon.java (line 234)
Exception encountered during startup.
java.lang.IllegalStateException: replication factor (3) exceeds number of
endpoints (2)



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6099853.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Error when bringing up nodes during failure testing

2011-03-05 Thread aaron morton
Can you include the full error stack ? 

It's failing because of the reason stated. But I need some more info to 
understand what part of the startup process it's stuck at. 

Aaron
  
On 4/03/2011, at 6:39 AM, mcasandra wrote:

 Whenever I do failure testing I see this error message and then cassandra
 process exits. This is what I am doing: 
 
 
 1. 3 node cluster. CF of RF=3, W=QUORUM and R=QUORUM 
 2. Execute client code in a loop which just reads data from CF in while
 loop. 
 2. Bring one node down (Node C). Everything ok. Client is happy (as
 expected) 
 3. Bring one more node down (Node A). Client throws error (as expected) 
 4. Bring one node up and then I receive following error message in cassandra
 and cassandra exits at this point. 
 
 Please help. But sometimes when I bring some other node up first (Node C)
 and then bring up this node (A)then it works. Not sure what's going on here. 
 
 Error in Cassandra logs:
 
 ERROR 15:36:55,153 Exception encountered during startup. 
 java.lang.IllegalStateException: replication factor (3) exceeds number of
 endpoints (2) 
at
 org.apache.cassandra.locator.SimpleStrategy.calculateNaturalEndpoints(SimpleStrategy.java:60)
  
at
 org.apache.cassandra.locator.AbstractReplicationStrategy.getRangeAddresses(AbstractReplicationStrategy.java:204)
  
 
 
 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Error-when-bringing-up-nodes-during-failure-testing-tp6085692p6085692.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.



Re: Error when bringing up nodes during failure testing

2011-03-05 Thread mcasandra

aaron morton wrote:
 
 Can you include the full error stack ? 
 
 It's failing because of the reason stated. But I need some more info to
 understand what part of the startup process it's stuck at. 
 
 
Thanks for responding! I'll send it as soon as I can get on my network. But
you mentioned that someone already stated the reason but I can't find it in
this thread. Did I miss it?

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6093527.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.