Hi Robert,
I was looking at the code that generates this message. It occurs in
org.continuent.sequoia.controller.scheduler.AbstractScheduler.begin(). One
thing I notice in this code is that we have a call to wait() on Object
transactionSync. If you look at the code carefully it appears we can get a
spurious timeout in begin if this thread misses a notifyAll() call on
transactionSync.
I don't see how a thread can miss a notifyAll. If it is blocked on
wait() then it will be unblocked by a notifyAll() (or the timeout in
that case). How many notifyAll occured is irrelevant here.
So, if another thread manages to get the monitor and send a notify(), it
looks to me as if you would simply miss it. In this case the transaction
would cycle through and just get stuck in the wait state. The code I'm
looking at is shown below. The message Stuart received was clearly from the
line marked as having a possible race condition.
No this is not correct. Even if there were multiple notify before the
thread wakes up, the only thing it does when it wakes up is to check if
the timeout has expired. If it has indeed expired then an error is
thrown. If the timeout has not expired, then we check again the
condition if (suspendTransactions>0) which will be set to 0 if we were
notified of the completion.
It's not clear to me that this is the root cause of all the behavior Stuart
described, but there does seem to be at least one problem here.
What is strange is that we have a suspendTransaction>0 while
pendingTransactions=0.
public final void begin(TransactionMetaData tm, boolean isLazyStart,
AbstractRequest request) throws SQLException
{
// Check if transactions are suspended
boolean retry;
do
{
retry = false;
synchronized (transactionsSync)
{
if ((suspendTransactions > 0) && !isLazyStart
&& !tm.isPersistentConnection())
{
addSuspendedRequest(request);
try
{
// Wait on transactionSync
long timeout = tm.getTimeout();
if (timeout > 0)
{
long start = System.currentTimeMillis();
transactionsSync.wait(timeout); // RACE CONDITION HERE???
No race condition there, if you wake up and the timeout has not expired
you loop and re-enter the synchronized block and start by testing if
suspendTransactions>0 which is correct.
long end = System.currentTimeMillis();
long remaining = timeout - (end - start);
if (remaining > 0)
tm.setTimeout(remaining);
else
{
We could double-check here is suspendTransaction is now 0 in which case
we could continue and avoid throwing an exception. But it is not
incorrect to throw an exception either since we have waited at least for
the timeout that was set.
String msg = Translate.get(
"scheduler.begin.timeout.transactionSync",
pendingTransactions);
logger.warn(msg);
throw new SQLException(msg);
}
}
else
transactionsSync.wait();
}
catch (InterruptedException e)
{
String msg = Translate.get(
"scheduler.begin.timeout.transactionSync",
pendingTransactions)
+ " (" + e + ")";
logger.error(msg);
throw new SQLException(msg);
}
}
}
// <SNIP>
}
while (retry);
}
My analysis of what happened is as follows:
Backup calls first disable backend that triggers a suspendActivity (DRM
case).
2008-05-20 01:45:09,624 INFO controller.RequestManager.XXXX
Suspending activity for XXXX
2008-05-20 01:45:09,629 INFO controller.RequestManager.XXXX All
activity is suspended for XXXX
The code in DisableBackendsAndSetCheckpoint generates this:
line 139: recoveryLog.storeCheckpoint()
2008-05-20 01:45:09,632 INFO controller.virtualdatabase.XXXX Storing
checkpoint backup 20080520-172.30.20.250:25322-20080520014509624+0100
line 173: db.setState(DISABLING)
2008-05-20 01:45:09,672 INFO controller.virtualdatabase.XXXX Disabling
backend ldb4, waiting for current transaction to complete on this
backend
line 197: call to loadBalancer.disableBackend()
RAIDb1.disableBackend class terminateThreadsAndConnections()
Last thing is finalizeConnections()
2008-05-20 01:45:09,682 INFO sequoia.controller.connection 5
connections freed on "jdbc:postgresql://X.X.X.X:5432/XXXX"
2008-05-20 01:45:10,062 INFO sequoia.controller.connection 20
connections freed on "jdbc:postgresql://X.X.X.X:5432/XXXX"
2008-05-20 01:45:10,434 INFO sequoia.controller.connection 20
connections freed on "jdbc:postgresql://X.X.X.X:5432/XXXX"
All connection pools should be closed in turn but it looks like this
operation does not complete (since this is the last one of the backend
disabling operation). As all transactions are suspended it is normal
that all begin operation timeout. What is not normal is that the disable
operation does not complete.
2008-05-20 01:45:17,351 sequoia.controller.scheduler Timeout in begin,
still waiting for 0 pending transactions to complete
2008-05-20 01:51:23,631 WARN controller.virtualdatabase.XXXX Begin
failed (Timeout in begin, still waiting for 0 pending transactions to
complete) 2008-05-20 01:51:23,631 INFO
virtualdatabase.VirtualDatabaseWorkerThread.XXXX Error during command
execution (Timeout in begin, still waiting for 0 pending transactions
to complete)
The last 2 messages were repeated at random for the next 1 3/4hrs
looks as though the checkpoint was not stored for till 3:30am?
The strange thing here is that we don't get an additional message about
the completion of the connection pool termination. Could you re-post
your connection pools configuration for this backend?
2008-05-20 03:30:38,825 WARN sequoia.controller.recoverylog Checkpoint
backup 20080520-172.30.20.250:25322-20080520014509624+0100 was stored
2008-05-20 03:30:38,836 INFO controller.virtualdatabase.XXXX
Backend ldb4 is now disabled
2008-05-20 03:30:38,837 INFO controller.virtualdatabase.XXXX Backend
ldb4 disabled on controller Member(address=/X.X.X.X:50850,
uid=172.30.20.80:50850)
2008-05-20 03:30:38,840 INFO controller.RequestManager.XXXX Resuming
activity for XXXX
2008-05-20 03:30:38,845 INFO controller.RequestManager.XXXX All
activity is now resumed for XXXX
2008-05-20 03:30:38,846 INFO controller.RequestManager.XXXX Backend
ldb4 is now disabled
2008-05-20 03:30:38,846 INFO controller.RequestManager.XXXX Starting
backup of backend ldb4
All the remaining part is normal as soon as the backend has been
successfully disabled. We have to isolate the issue in the local backend
disabling code in RAIDb1 (or in the connection pool termination code).
Let me know if you find any flaw in my reasoning.
Thanks for helping us tracking down the problem.
Emmanuel
--
Emmanuel Cecchet
FTO @ Frog Thinker
Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: [EMAIL PROTECTED]
Skype: emmanuel_cecchet
_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia