I found the issue and fixed it, although it wasn't where I expected.
I'd like to call attention to the JIRA I just created (https://forge.continuent.org/jira/browse/SEQUOIA-1053
), in case the bug is biting anyone else as hard as it bit me.
Symptoms:
Backend gets stuck in infinite loop trying to recover. We saw this
when finishing up a backup and also when disabling/enabling a backend.
It's possible for entire VDB to get stuck in suspended state.
Depending on whether you've updated to pre-release source code, you
may see messages like "Recovery log entry marked as still executing".
Steps to reproduce:
Two controllers, one VDB. Each controller has one backend.
Disable backend #2.
Backend #1 performs a readonly transaction (connect, set readonly,
begin, select, commit)
Backend #1 performs a non-readonly transaction (connect, begin,
update, commit)
Try to enable backend #2
recovery of backend #2 gets stuck spinning forever
Fix:
add "isReadOnly = false;" to VirtualDatabseWorkerThread reset()
function. (see JIRA for explanation)
Depending on the order and timing of transactions, it's possible to
get the entire VDB stuck. If those two transactions happen at the
right point in the recovery process, backend #2 can be spinning while
the VDB is suspended.
Hope this helps someone.
-Chris
On Feb 26, 2008, at 2:28 PM, Christopher Ekberg wrote:
We've been having sporadic problems in RecoverThread, both when
finishing up a backup and when bringing up a new node. Sequoia
2.10.8 embedded, 1 vdb (RAIDb-1), 2 machines (host1, host2), lots of
client traffic going on.
Here's an example of what we see happening:
we tell host1to backup. (vdb.backupBackend(...))
this calls requestManager.backupBackend(...)
this calls requestManager.disableBackendWithCheckpoint
backup happens
requestManager.enableBackendFromCheckpoint(...) is called, which
creates a RecoverThread
RecoverThread has everyone stop (requestManager.suspendActivity())
so host1 can catch up playing recovery log of statements coming in
during backup
host1 gets stuck in infinite loop waiting for a task to complete.
it never does, so RecoverThread never calls
requestManager.resumeActivity() to wake up host2.
host2 is still paused so clients can't make requests.
host1 is stuck in an infinite loop in RecoverThread run() here (I've
added logging, and the ability to break out of the loop I see in
unreleased sequoia code):
// Play the remaining writes that were pending and which have
been logged
boolean replayedAllLog = false;
do
{ // Loop until the whole recovery log has been replayed
// Or stop if the activity is resumed by force
try
{
logger.info("RecoverThread about to replay new recovery log
tasks");
logIdx = recover(logIdx, pendingRecoveryTasks);
// The status update for the last request (probably a
commit/rollback)
// is not be there yet. Wait for it to be flushed to the
log and
// retry.
Is it possible that host1 spins because it's waiting for a
transaction to complete that is blocked until resumeActivity is
called? i.e., connection is open in autocommit=off mode, statement
is executed, host1 tells host2 to suspendActivity so commit
statement is blocked, host1 wants transaction to be finished but
it'll never get the commit. That would imply that the "stop new
connections and wait for existing transactions to finish" I assume
happens isn't happening properly, but it's probably something else.
Anyone else seen this? Is this a known issue? Is there a reliable
workaround or fix?
-Chris
----
Chris Ekberg
Jackpot Rewards, Inc.
275 Grove Street, Suite 3-120
Newton, MA 02466-2274
617-795-2850, x. 2313
[EMAIL PROTECTED]
www.JackpotRewards.com
**Note that as of Feb. 20, my email address has changed. Please update
your contact information for me.
_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia