Hi Chris,

Excellent catch!
I recommend in the meantime that driver connection pooling is disabled until that fix gets in (Reset is only used with driver connection pooling). To disable connection pooling use: 'jdbc:sequoia://host/db?connectionPooling=false'

Thanks again for the fix,
Emmanuel

I found the issue and fixed it, although it wasn't where I expected. I'd like to call attention to the JIRA I just created (https://forge.continuent.org/jira/browse/SEQUOIA-1053), in case the bug is biting anyone else as hard as it bit me.

Symptoms:
Backend gets stuck in infinite loop trying to recover. We saw this when finishing up a backup and also when disabling/enabling a backend.
  It's possible for entire VDB to get stuck in suspended state.
Depending on whether you've updated to pre-release source code, you may see messages like "Recovery log entry marked as still executing".

Steps to reproduce:
  Two controllers, one VDB.  Each controller has one backend.
  Disable backend #2.
Backend #1 performs a readonly transaction (connect, set readonly, begin, select, commit) Backend #1 performs a non-readonly transaction (connect, begin, update, commit)
  Try to enable backend #2
  recovery of backend #2 gets stuck spinning forever

Fix:
add "isReadOnly = false;" to VirtualDatabseWorkerThread reset() function. (see JIRA for explanation)


Depending on the order and timing of transactions, it's possible to get the entire VDB stuck. If those two transactions happen at the right point in the recovery process, backend #2 can be spinning while the VDB is suspended.

Hope this helps someone.
-Chris


On Feb 26, 2008, at 2:28 PM, Christopher Ekberg wrote:

We've been having sporadic problems in RecoverThread, both when finishing up a backup and when bringing up a new node. Sequoia 2.10.8 embedded, 1 vdb (RAIDb-1), 2 machines (host1, host2), lots of client traffic going on.


Here's an example of what we see happening:
we tell host1to backup. (vdb.backupBackend(...))
this calls requestManager.backupBackend(...)
this calls requestManager.disableBackendWithCheckpoint
backup happens
requestManager.enableBackendFromCheckpoint(...) is called, which creates a RecoverThread

RecoverThread has everyone stop (requestManager.suspendActivity()) so host1 can catch up playing recovery log of statements coming in during backup

host1 gets stuck in infinite loop waiting for a task to complete. it never does, so RecoverThread never calls requestManager.resumeActivity() to wake up host2.

host2 is still paused so clients can't make requests.

host1 is stuck in an infinite loop in RecoverThread run() here (I've added logging, and the ability to break out of the loop I see in unreleased sequoia code):

// Play the remaining writes that were pending and which have been logged
     boolean replayedAllLog = false;
     do
     { // Loop until the whole recovery log has been replayed
         // Or stop if the activity is resumed by force
         try
         {
logger.info("RecoverThread about to replay new recovery log tasks");
         logIdx = recover(logIdx, pendingRecoveryTasks);
// The status update for the last request (probably a commit/rollback) // is not be there yet. Wait for it to be flushed to the log and
         // retry.

Is it possible that host1 spins because it's waiting for a transaction to complete that is blocked until resumeActivity is called? i.e., connection is open in autocommit=off mode, statement is executed, host1 tells host2 to suspendActivity so commit statement is blocked, host1 wants transaction to be finished but it'll never get the commit. That would imply that the "stop new connections and wait for existing transactions to finish" I assume happens isn't happening properly, but it's probably something else.

Anyone else seen this? Is this a known issue? Is there a reliable workaround or fix?

-Chris



----
Chris Ekberg
Jackpot Rewards, Inc.
275 Grove Street, Suite 3-120
Newton, MA 02466-2274
617-795-2850, x. 2313
[EMAIL PROTECTED]
www.JackpotRewards.com
**Note that as of Feb. 20, my email address has changed. Please update your contact information for me.

_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia



--
Emmanuel Cecchet - Research scientist
EPFL - LABOS/DSLAB - IN.N 317
Phone: +41-21-693-7558

_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia

Reply via email to