We've been having sporadic problems in RecoverThread, both when
finishing up a backup and when bringing up a new node. Sequoia 2.10.8
embedded, 1 vdb (RAIDb-1), 2 machines (host1, host2), lots of client
traffic going on.
Here's an example of what we see happening:
we tell host1to backup. (vdb.backupBackend(...))
this calls requestManager.backupBackend(...)
this calls requestManager.disableBackendWithCheckpoint
backup happens
requestManager.enableBackendFromCheckpoint(...) is called, which
creates a RecoverThread
RecoverThread has everyone stop (requestManager.suspendActivity()) so
host1 can catch up playing recovery log of statements coming in during
backup
host1 gets stuck in infinite loop waiting for a task to complete. it
never does, so RecoverThread never calls
requestManager.resumeActivity() to wake up host2.
host2 is still paused so clients can't make requests.
host1 is stuck in an infinite loop in RecoverThread run() here (I've
added logging, and the ability to break out of the loop I see in
unreleased sequoia code):
// Play the remaining writes that were pending and which have
been logged
boolean replayedAllLog = false;
do
{ // Loop until the whole recovery log has been replayed
// Or stop if the activity is resumed by force
try
{
logger.info("RecoverThread about to replay new recovery log
tasks");
logIdx = recover(logIdx, pendingRecoveryTasks);
// The status update for the last request (probably a
commit/rollback)
// is not be there yet. Wait for it to be flushed to the
log and
// retry.
Is it possible that host1 spins because it's waiting for a transaction
to complete that is blocked until resumeActivity is called? i.e.,
connection is open in autocommit=off mode, statement is executed,
host1 tells host2 to suspendActivity so commit statement is blocked,
host1 wants transaction to be finished but it'll never get the
commit. That would imply that the "stop new connections and wait for
existing transactions to finish" I assume happens isn't happening
properly, but it's probably something else.
Anyone else seen this? Is this a known issue? Is there a reliable
workaround or fix?
-Chris
_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia