Hi Ingo,
Could you please describe the way you are doing the reboot? I mean from
Sequoia point of view, which commands are you using when stopping the
vdb / controller? Could you also provide controller logs that show the
issue (both controllers)?
I will try to investigate this. If you can provide more information, it
will be very welcome.
Thanks a lot,
Stephane
Ingo Kampe a écrit :
Hi,
I have seen several times a problem through write request on 1 side of my db
cluster during reboot of the other. The result was that in some cases sequoia
comes up joins the group and start without any error detected but the data in
the backends isn't the same. The write was done on the 2nd machine but is not
recovered at the 1st rebooting machine. I tried hard to find any specific point
to get this behaviour deterministic reproducable but yet without success. This
happens yet in around 10% of reboots. The main problem for us is that sequoia
starts up without problem. If it would set the backends to disabled state we
could do a manual resync.
May be you can give me some hints? Any to early process kill or communication
race condition? ...
My general environment:
debian etch, java 1.5, 1 backend postgresql 8.1.4, sequoia 2.10.4, appia 3.2.4,
hedera 1.5.6-cvs03.01.2007, "base view" setup of appia
Raidb-1 setup with 2 machines, each 1 controller with 1 single backend
Some strange log entries from the rebooting machine:
2007-01-12 15:43:20,374 ERROR sequoia.controller.scheduler Unexpected negative
suspendedWrites in AbstractScheduler.resumeWrites()
2007-01-12 15:43:20,374 ERROR sequoia.controller.scheduler Unexpected negative
suspendedTransactions in AbstractScheduler.resumeNewTransactions()
2007-01-12 15:43:20,374 ERROR sequoia.controller.scheduler Unexpected negative
suspendedPersistentConnections in
AbstractScheduler.resumeNewPersistentConnections()
or
2007-01-12 15:43:48,486 ERROR sequoia.controller.loadbalancer Request was not
found in total order queue, posting out of order (UPDATE bot_event_cfg_table SET
event_name='X509_OCSP_RESPONDER_UNREACHABLE',event_priority='P2',event_is_active='1'
where bot_event_cfg_id='X509_OCSP_RESPONDER_UNREACHABLE'/)
Where could these entries come from?
Just FYI:
Sometimes I get the next one which I can reproduce with iptables closing the
appia port for a moment. After this sequoia correctly set the backend in
disabled state and it will be restored:
2007-01-12 15:45:09,758 ERROR controller.recoverylog.RecoverThread Unable to get
checkpoint from recovery log.
java.sql.SQLException: Unable to get checkpoint disable
botdb1_00304874867C-10.10.10.1:25322-20070112154347718+0100 from the recovery
log (Checkpoint disable
botdb1_00304874867C-10.10.10.1:25322-20070112154347718+0100 does not exist in
recovery log)
at
org.continuent.sequoia.controller.recoverylog.events.GetCheckpointLogIdEvent.execute(GetCheckpointLogIdEvent.java:96)
at
org.continuent.sequoia.controller.recoverylog.LoggerThread.run(LoggerThread.java:732)
Thanx and Greetz,
)ngo
------------------------------------------------------------------------
_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia
_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia