[Sequoia] Backend inconsistency after reboot (restart of controller)

Ingo Kampe Mon, 15 Jan 2007 04:53:06 -0800

Hi,

I have seen several times a problem through write request on 1 side of my db
cluster during reboot of the other. The result was that in some cases sequoia
comes up joins the group and start without any error detected but the data in
the backends isn't the same. The write was done on the 2nd machine but is not
recovered at the 1st rebooting machine. I tried hard to find any specific point
to get this behaviour deterministic reproducable but yet without success. This
happens yet in around 10% of reboots. The main problem for us is that sequoia
starts up without problem. If it would set the backends to disabled state we
could do a manual resync.


May be you can give me some hints? Any to early process kill or communication
race condition? ...

My general environment:
debian etch, java 1.5, 1 backend postgresql 8.1.4, sequoia 2.10.4, appia 3.2.4,
hedera 1.5.6-cvs03.01.2007, "base view" setup of appia
Raidb-1 setup with 2 machines, each 1 controller with 1 single backend


Some strange log entries from the rebooting machine:

2007-01-12 15:43:20,374 ERROR sequoia.controller.scheduler Unexpected negative
suspendedWrites in AbstractScheduler.resumeWrites()
2007-01-12 15:43:20,374 ERROR sequoia.controller.scheduler Unexpected negative
suspendedTransactions in AbstractScheduler.resumeNewTransactions()
2007-01-12 15:43:20,374 ERROR sequoia.controller.scheduler Unexpected negative
suspendedPersistentConnections in 
AbstractScheduler.resumeNewPersistentConnections()

or

2007-01-12 15:43:48,486 ERROR sequoia.controller.loadbalancer Request was not
found in total order queue, posting out of order (UPDATE bot_event_cfg_table SET
event_name='X509_OCSP_RESPONDER_UNREACHABLE',event_priority='P2',event_is_active='1'
where bot_event_cfg_id='X509_OCSP_RESPONDER_UNREACHABLE'/)

Where could these entries come from?


Just FYI:
Sometimes I get the next one which I can reproduce with iptables closing the
appia port for a moment. After this sequoia correctly set the backend in
disabled state and it will be restored:
2007-01-12 15:45:09,758 ERROR controller.recoverylog.RecoverThread Unable to get
checkpoint from recovery log.
java.sql.SQLException: Unable to get checkpoint disable
botdb1_00304874867C-10.10.10.1:25322-20070112154347718+0100 from the recovery
log (Checkpoint disable
botdb1_00304874867C-10.10.10.1:25322-20070112154347718+0100 does not exist in
recovery log)
        at
org.continuent.sequoia.controller.recoverylog.events.GetCheckpointLogIdEvent.execute(GetCheckpointLogIdEvent.java:96)
        at
org.continuent.sequoia.controller.recoverylog.LoggerThread.run(LoggerThread.java:732)


Thanx and Greetz,
)ngo

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia

[Sequoia] Backend inconsistency after reboot (restart of controller)

Reply via email to