Hi Guys,
On Thu, 22 May 2008 03:53:38 +0200
Emmanuel Cecchet <[EMAIL PROTECTED]> wrote:
> Hi Robert,
> > I was looking at the code that generates this message. It occurs in
> > org.continuent.sequoia.controller.scheduler.AbstractScheduler.begin().
> > One thing I notice in this code is that we have a call to wait() on
> > Object transactionSync. If you look at the code carefully it
> > appears we can get a spurious timeout in begin if this thread
> > misses a notifyAll() call on transactionSync.
> >
> I don't see how a thread can miss a notifyAll. If it is blocked on
> wait() then it will be unblocked by a notifyAll() (or the timeout in
> that case). How many notifyAll occured is irrelevant here.
> > So, if another thread manages to get the monitor and send a
> > notify(), it looks to me as if you would simply miss it. In this
> > case the transaction would cycle through and just get stuck in the
> > wait state. The code I'm looking at is shown below. The message
> > Stuart received was clearly from the line marked as having a
> > possible race condition.
> No this is not correct. Even if there were multiple notify before the
> thread wakes up, the only thing it does when it wakes up is to check
> if the timeout has expired. If it has indeed expired then an error is
> thrown. If the timeout has not expired, then we check again the
> condition if (suspendTransactions>0) which will be set to 0 if we
> were notified of the completion.
> > It's not clear to me that this is the root cause of all the
> > behavior Stuart described, but there does seem to be at least one
> > problem here.
> What is strange is that we have a suspendTransaction>0 while
> pendingTransactions=0.
> > public final void begin(TransactionMetaData tm, boolean
> > isLazyStart, AbstractRequest request) throws SQLException
> > {
> > // Check if transactions are suspended
> > boolean retry;
> > do
> > {
> > retry = false;
> > synchronized (transactionsSync)
> > {
> > if ((suspendTransactions > 0) && !isLazyStart
> > && !tm.isPersistentConnection())
> > {
> > addSuspendedRequest(request);
> > try
> > {
> > // Wait on transactionSync
> > long timeout = tm.getTimeout();
> > if (timeout > 0)
> > {
> > long start = System.currentTimeMillis();
> > transactionsSync.wait(timeout); // RACE CONDITION
> > HERE???
> No race condition there, if you wake up and the timeout has not
> expired you loop and re-enter the synchronized block and start by
> testing if
> suspendTransactions>0 which is correct.
> > long end = System.currentTimeMillis();
> > long remaining = timeout - (end - start);
> > if (remaining > 0)
> > tm.setTimeout(remaining);
> > else
> > {
> >
> We could double-check here is suspendTransaction is now 0 in which
> case we could continue and avoid throwing an exception. But it is not
> incorrect to throw an exception either since we have waited at least
> for the timeout that was set.
> > String msg = Translate.get(
> > "scheduler.begin.timeout.transactionSync",
> > pendingTransactions);
> > logger.warn(msg);
> > throw new SQLException(msg);
> > }
> > }
> > else
> > transactionsSync.wait();
> > }
> > catch (InterruptedException e)
> > {
> > String msg = Translate.get(
> > "scheduler.begin.timeout.transactionSync",
> > pendingTransactions)
> > + " (" + e + ")";
> > logger.error(msg);
> > throw new SQLException(msg);
> > }
> > }
> > }
> > // <SNIP>
> > }
> > while (retry);
> > }
> >
> My analysis of what happened is as follows:
> Backup calls first disable backend that triggers a suspendActivity
> (DRM case).
>
> 2008-05-20 01:45:09,624 INFO controller.RequestManager.XXXX
> Suspending activity for XXXX
> 2008-05-20 01:45:09,629 INFO controller.RequestManager.XXXX All
> activity is suspended for XXXX
>
>
> The code in DisableBackendsAndSetCheckpoint generates this:
> line 139: recoveryLog.storeCheckpoint()
>
> 2008-05-20 01:45:09,632 INFO controller.virtualdatabase.XXXX Storing
> checkpoint backup 20080520-172.30.20.250:25322-20080520014509624+0100
>
>
> line 173: db.setState(DISABLING)
>
> 2008-05-20 01:45:09,672 INFO controller.virtualdatabase.XXXX
> Disabling backend ldb4, waiting for current transaction to complete
> on this backend
>
>
> line 197: call to loadBalancer.disableBackend()
> RAIDb1.disableBackend class terminateThreadsAndConnections()
> Last thing is finalizeConnections()
>
> 2008-05-20 01:45:09,682 INFO sequoia.controller.connection 5
> connections freed on "jdbc:postgresql://X.X.X.X:5432/XXXX"
> 2008-05-20 01:45:10,062 INFO sequoia.controller.connection 20
> connections freed on "jdbc:postgresql://X.X.X.X:5432/XXXX"
> 2008-05-20 01:45:10,434 INFO sequoia.controller.connection 20
> connections freed on "jdbc:postgresql://X.X.X.X:5432/XXXX"
>
>
> All connection pools should be closed in turn but it looks like this
> operation does not complete (since this is the last one of the
> backend disabling operation). As all transactions are suspended it is
> normal that all begin operation timeout. What is not normal is that
> the disable operation does not complete.
>
> 2008-05-20 01:45:17,351 sequoia.controller.scheduler Timeout in begin,
> still waiting for 0 pending transactions to complete
> 2008-05-20 01:51:23,631 WARN controller.virtualdatabase.XXXX Begin
> failed (Timeout in begin, still waiting for 0 pending transactions to
> complete) 2008-05-20 01:51:23,631 INFO
> virtualdatabase.VirtualDatabaseWorkerThread.XXXX Error during command
> execution (Timeout in begin, still waiting for 0 pending transactions
> to complete)
>
> The last 2 messages were repeated at random for the next 1 3/4hrs
>
> looks as though the checkpoint was not stored for till 3:30am?
>
>
> The strange thing here is that we don't get an additional message
> about the completion of the connection pool termination. Could you
> re-post your connection pools configuration for this backend?
<DatabaseBackend name="ldb4" driver="org.postgresql.Driver"
url="jdbc:postgresql://xxxxx:5432/xxxx" connectionTestStatement="select
now()">
<ConnectionManager vLogin="xxxx" rLogin="xxxx" rPassword="xxx">
<VariablePoolConnectionManager initPoolSize="50" minPoolSize="20"
maxPoolSize="300" idleTime out="60" waitTimeout="30"/>
</ConnectionManager>
<ConnectionManager vLogin="XXXX1" rLogin="XXXX1" rPassword="XXXX1">
<VariablePoolConnectionManager initPoolSize="50" minPoolSize="20"
maxPoolSize="100" idleTime out="60" waitTimeout="30"/>
</ConnectionManager>
</DatabaseBackend>
We have two users (applications) who access this database, our backend
databases are configured to have a maximum of 500 connections. The
second user XXXX1 which is our jdbc c3p0 hibernate web application is
the one that appears to create the issue, as and when i restart these
web applications, the controller then proceeds with the backup
operation within the next minute or so.
Thanks again for looking into this,
Stuart
_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia