Hi, I have seen several times a problem through write request on 1 side of my db cluster during reboot of the other. The result was that in some cases sequoia comes up joins the group and start without any error detected but the data in the backends isn't the same. The write was done on the 2nd machine but is not recovered at the 1st rebooting machine. I tried hard to find any specific point to get this behaviour deterministic reproducable but yet without success. This happens yet in around 10% of reboots. The main problem for us is that sequoia starts up without problem. If it would set the backends to disabled state we could do a manual resync.
May be you can give me some hints? Any to early process kill or communication
race condition? ...
My general environment:
debian etch, java 1.5, 1 backend postgresql 8.1.4, sequoia 2.10.4, appia 3.2.4,
hedera 1.5.6-cvs03.01.2007, "base view" setup of appia
Raidb-1 setup with 2 machines, each 1 controller with 1 single backend
Some strange log entries from the rebooting machine:
2007-01-12 15:43:20,374 ERROR sequoia.controller.scheduler Unexpected negative
suspendedWrites in AbstractScheduler.resumeWrites()
2007-01-12 15:43:20,374 ERROR sequoia.controller.scheduler Unexpected negative
suspendedTransactions in AbstractScheduler.resumeNewTransactions()
2007-01-12 15:43:20,374 ERROR sequoia.controller.scheduler Unexpected negative
suspendedPersistentConnections in
AbstractScheduler.resumeNewPersistentConnections()
or
2007-01-12 15:43:48,486 ERROR sequoia.controller.loadbalancer Request was not
found in total order queue, posting out of order (UPDATE bot_event_cfg_table SET
event_name='X509_OCSP_RESPONDER_UNREACHABLE',event_priority='P2',event_is_active='1'
where bot_event_cfg_id='X509_OCSP_RESPONDER_UNREACHABLE'/)
Where could these entries come from?
Just FYI:
Sometimes I get the next one which I can reproduce with iptables closing the
appia port for a moment. After this sequoia correctly set the backend in
disabled state and it will be restored:
2007-01-12 15:45:09,758 ERROR controller.recoverylog.RecoverThread Unable to get
checkpoint from recovery log.
java.sql.SQLException: Unable to get checkpoint disable
botdb1_00304874867C-10.10.10.1:25322-20070112154347718+0100 from the recovery
log (Checkpoint disable
botdb1_00304874867C-10.10.10.1:25322-20070112154347718+0100 does not exist in
recovery log)
at
org.continuent.sequoia.controller.recoverylog.events.GetCheckpointLogIdEvent.execute(GetCheckpointLogIdEvent.java:96)
at
org.continuent.sequoia.controller.recoverylog.LoggerThread.run(LoggerThread.java:732)
Thanx and Greetz,
)ngo
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ Sequoia mailing list [email protected] https://forge.continuent.org/mailman/listinfo/sequoia
