[jira] [Updated] (CASSANDRA-4583) Some nodes forget schema when 1 node fails

2012-08-29 Thread Edward Sargisson (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Sargisson updated CASSANDRA-4583:


Description: 
At present we do not have a complete reproduction for this defect but am 
raising this defect as request by Aaron Morton. We will update as we find out 
more. If any additional logging or tests are requested we will do them if we 
can. 

We have experienced 2 failures ascribed to this defect. On the cassandra user 
mailing list Peter Schuller (2012-08-28) describes an additional failure.

Reproduction steps as currently known:
1. Setup a cluster with 6 nodes (call them #1 through #6).
2. Have #5 fail completely. One failure was when the node was stopped to 
replace the battery in the hard disk cache. The second failure was when the 
hardware monitoring recorded a problem, CPU usage was increasing without 
explanation and the server console was frozen so the machine was restarted.
3. Bring #5 back

Expected behaviour:
* #5 should rejoin the ring.

Actual behaviour (based on the incident we saw yesterday):
* #5 didn't rejoin the ring.
* We stopped all nodes and started them one by one.
* Nodes #2, #4, #6 had forgotten most of their column families. They had the 
keys space but with only one column family instead of the usual 9 or so.
* We ran nodetool resetlocalschema on #2, #4 and #6.
* We ran nodetool repair -pr on #2, #4, #5 and #6
* On #2 nodetool repair appeared to crash in that there were no messages in the 
logs from it for 10min+. Nodetool compactionstats and nodetool netstats showed 
no activity.
* Restarting nodetool repair -pr fixed the problem and ran to completion.



  was:
At present we do not have a complete reproduction for this defect but am 
raising this defect as request by Aaron Morton. We will update as we find out 
more. If any additional logging or tests are requested we will do them if we 
can. 

We have experienced 2 failures ascribed to this defect. On the cassandra user 
mailing list Peter Schuller (2012-08-28) describes an additional failure.

Reproduction steps as currently known:
1. Setup a cluster with 6 nodes (call them #1 through #6).
2. Have #5 fail completely. One failure was when the node was stopped to 
replace the battery in the hard disk cache. The second failure was when the 
hardware monitoring recorded a problem, CPU usage was increasing without 
explanation and the server console was frozen so the machine was restarted.
3. Bring #5 back

Expected behaviour:
* #5 should rejoin the ring.

Actual behaviour (based on the incident we saw yesterday):
* #5 didn't rejoin the ring.
* We stopped all nodes and started them one by one.
* Nodes #2, #4, #6 had forgotten most of their column families. They had the 
keys space but with only one column family instead of the usual 9 or so.
* We ran nodetool resetlocalschema on #2, #4 and #6.
* We ran nodetool repair -pr on #2, #4, #5 and #6
* On one of these nodes nodetool repair appeared to crash in that there were no 
messages in the logs from it for 10min+. Nodetool compactionstats and nodetool 
netstats showed no activity.
* Restarting nodetool repair -pr fixed the problem and ran to completion.




 Some nodes forget schema when 1 node fails
 --

 Key: CASSANDRA-4583
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4583
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.2
 Environment: CentOS release 6.3 (Final)
Reporter: Edward Sargisson

 At present we do not have a complete reproduction for this defect but am 
 raising this defect as request by Aaron Morton. We will update as we find out 
 more. If any additional logging or tests are requested we will do them if we 
 can. 
 We have experienced 2 failures ascribed to this defect. On the cassandra user 
 mailing list Peter Schuller (2012-08-28) describes an additional failure.
 Reproduction steps as currently known:
 1. Setup a cluster with 6 nodes (call them #1 through #6).
 2. Have #5 fail completely. One failure was when the node was stopped to 
 replace the battery in the hard disk cache. The second failure was when the 
 hardware monitoring recorded a problem, CPU usage was increasing without 
 explanation and the server console was frozen so the machine was restarted.
 3. Bring #5 back
 Expected behaviour:
 * #5 should rejoin the ring.
 Actual behaviour (based on the incident we saw yesterday):
 * #5 didn't rejoin the ring.
 * We stopped all nodes and started them one by one.
 * Nodes #2, #4, #6 had forgotten most of their column families. They had the 
 keys space but with only one column family instead of the usual 9 or so.
 * We ran nodetool resetlocalschema on #2, #4 and #6.
 * We ran nodetool repair -pr on #2, #4, #5 and #6
 * On #2 nodetool repair appeared to crash in that 

[jira] [Updated] (CASSANDRA-4583) Some nodes forget schema when 1 node fails

2012-08-29 Thread Edward Sargisson (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Sargisson updated CASSANDRA-4583:


Attachment: cass-4583-5-system.log
cass-4583-2-system.log

cass-4583-5-system.log is an extract from #5 from the time of the incident.
Similarly, cass-4583-2-system.log is from #2.

#2 is 10.30.11.40
#5 is 10.30.11.43

 Some nodes forget schema when 1 node fails
 --

 Key: CASSANDRA-4583
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4583
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.2
 Environment: CentOS release 6.3 (Final)
Reporter: Edward Sargisson
 Attachments: cass-4583-2-system.log, cass-4583-5-system.log


 At present we do not have a complete reproduction for this defect but am 
 raising this defect as request by Aaron Morton. We will update as we find out 
 more. If any additional logging or tests are requested we will do them if we 
 can. 
 We have experienced 2 failures ascribed to this defect. On the cassandra user 
 mailing list Peter Schuller (2012-08-28) describes an additional failure.
 Reproduction steps as currently known:
 1. Setup a cluster with 6 nodes (call them #1 through #6).
 2. Have #5 fail completely. One failure was when the node was stopped to 
 replace the battery in the hard disk cache. The second failure was when the 
 hardware monitoring recorded a problem, CPU usage was increasing without 
 explanation and the server console was frozen so the machine was restarted.
 3. Bring #5 back
 Expected behaviour:
 * #5 should rejoin the ring.
 Actual behaviour (based on the incident we saw yesterday):
 * #5 didn't rejoin the ring.
 * We stopped all nodes and started them one by one.
 * Nodes #2, #4, #6 had forgotten most of their column families. They had the 
 keys space but with only one column family instead of the usual 9 or so.
 * We ran nodetool resetlocalschema on #2, #4 and #6.
 * We ran nodetool repair -pr on #2, #4, #5 and #6
 * On #2 nodetool repair appeared to crash in that there were no messages in 
 the logs from it for 10min+. Nodetool compactionstats and nodetool netstats 
 showed no activity.
 * Restarting nodetool repair -pr fixed the problem and ran to completion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira