Hi Robert,
I was looking at the code that generates this message.  It occurs in
org.continuent.sequoia.controller.scheduler.AbstractScheduler.begin().  One
thing I notice in this code is that we have a call to wait() on Object
transactionSync.  If you look at the code carefully it appears we can get a
spurious timeout in begin if this thread misses a notifyAll() call on
transactionSync.
I don't see how a thread can miss a notifyAll. If it is blocked on wait() then it will be unblocked by a notifyAll() (or the timeout in that case). How many notifyAll occured is irrelevant here.
So, if another thread manages to get the monitor and send a notify(), it
looks to me as if you would simply miss it.  In this case the transaction
would cycle through and just get stuck in the wait state.  The code I'm
looking at is shown below.  The message Stuart received was clearly from the
line marked as having a possible race condition.
No this is not correct. Even if there were multiple notify before the thread wakes up, the only thing it does when it wakes up is to check if the timeout has expired. If it has indeed expired then an error is thrown. If the timeout has not expired, then we check again the condition if (suspendTransactions>0) which will be set to 0 if we were notified of the completion.
It's not clear to me that this is the root cause of all the behavior Stuart
described, but there does seem to be at least one problem here.
What is strange is that we have a suspendTransaction>0 while pendingTransactions=0.
  public final void begin(TransactionMetaData tm, boolean isLazyStart,
      AbstractRequest request) throws SQLException
  {
    // Check if transactions are suspended
    boolean retry;
    do
    {
      retry = false;
      synchronized (transactionsSync)
      {
        if ((suspendTransactions > 0) && !isLazyStart
            && !tm.isPersistentConnection())
        {
          addSuspendedRequest(request);
          try
          {
            // Wait on transactionSync
            long timeout = tm.getTimeout();
            if (timeout > 0)
            {
              long start = System.currentTimeMillis();
              transactionsSync.wait(timeout);  // RACE CONDITION HERE???
No race condition there, if you wake up and the timeout has not expired you loop and re-enter the synchronized block and start by testing if suspendTransactions>0 which is correct.
              long end = System.currentTimeMillis();
              long remaining = timeout - (end - start);
              if (remaining > 0)
                tm.setTimeout(remaining);
              else
              {
We could double-check here is suspendTransaction is now 0 in which case we could continue and avoid throwing an exception. But it is not incorrect to throw an exception either since we have waited at least for the timeout that was set.
                String msg = Translate.get(
                    "scheduler.begin.timeout.transactionSync",
                    pendingTransactions);
                logger.warn(msg);
                throw new SQLException(msg);
              }
            }
            else
              transactionsSync.wait();
          }
          catch (InterruptedException e)
          {
            String msg = Translate.get(
                "scheduler.begin.timeout.transactionSync",
pendingTransactions)
                + " (" + e + ")";
            logger.error(msg);
            throw new SQLException(msg);
          }
        }
       }
      // <SNIP>
    }
    while (retry);
  }
My analysis of what happened is as follows:
Backup calls first disable backend that triggers a suspendActivity (DRM case).

2008-05-20 01:45:09,624 INFO  controller.RequestManager.XXXX
Suspending activity for XXXX 2008-05-20 01:45:09,629 INFO controller.RequestManager.XXXX All activity is suspended for XXXX

The code in DisableBackendsAndSetCheckpoint generates this:
line 139: recoveryLog.storeCheckpoint()

2008-05-20 01:45:09,632 INFO controller.virtualdatabase.XXXX Storing
checkpoint backup 20080520-172.30.20.250:25322-20080520014509624+0100


line 173: db.setState(DISABLING)

2008-05-20 01:45:09,672 INFO  controller.virtualdatabase.XXXX Disabling
backend ldb4, waiting for current transaction to complete on this
backend

line 197: call to loadBalancer.disableBackend()
RAIDb1.disableBackend class terminateThreadsAndConnections()
Last thing is finalizeConnections()

2008-05-20 01:45:09,682 INFO  sequoia.controller.connection 5
connections freed on "jdbc:postgresql://X.X.X.X:5432/XXXX"
2008-05-20 01:45:10,062 INFO  sequoia.controller.connection 20
connections freed on "jdbc:postgresql://X.X.X.X:5432/XXXX"
2008-05-20 01:45:10,434 INFO  sequoia.controller.connection 20
connections freed on "jdbc:postgresql://X.X.X.X:5432/XXXX"

All connection pools should be closed in turn but it looks like this operation does not complete (since this is the last one of the backend disabling operation). As all transactions are suspended it is normal that all begin operation timeout. What is not normal is that the disable operation does not complete.

2008-05-20 01:45:17,351 sequoia.controller.scheduler Timeout in begin,
still waiting for 0 pending transactions to complete 2008-05-20 01:51:23,631 WARN controller.virtualdatabase.XXXX Begin
failed (Timeout in begin, still waiting for 0 pending transactions to
complete) 2008-05-20 01:51:23,631 INFO
virtualdatabase.VirtualDatabaseWorkerThread.XXXX Error during command
execution (Timeout in begin, still waiting for 0 pending transactions
to complete)

The last 2 messages were repeated at random for the next 1 3/4hrs

looks as though the checkpoint was not stored for till 3:30am?


The strange thing here is that we don't get an additional message about the completion of the connection pool termination. Could you re-post your connection pools configuration for this backend?

2008-05-20 03:30:38,825 WARN  sequoia.controller.recoverylog Checkpoint
backup 20080520-172.30.20.250:25322-20080520014509624+0100 was stored
2008-05-20 03:30:38,836 INFO  controller.virtualdatabase.XXXX
Backend ldb4 is now disabled 2008-05-20 03:30:38,837 INFO controller.virtualdatabase.XXXX Backend
ldb4 disabled on controller Member(address=/X.X.X.X:50850,
uid=172.30.20.80:50850) 2008-05-20 03:30:38,840 INFO controller.RequestManager.XXXX Resuming activity for XXXX 2008-05-20 03:30:38,845 INFO controller.RequestManager.XXXX All activity is now resumed for XXXX 2008-05-20 03:30:38,846 INFO controller.RequestManager.XXXX Backend ldb4 is now disabled 2008-05-20 03:30:38,846 INFO controller.RequestManager.XXXX Starting backup of backend ldb4 All the remaining part is normal as soon as the backend has been successfully disabled. We have to isolate the issue in the local backend disabling code in RAIDb1 (or in the connection pool termination code).

Let me know if you find any flaw in my reasoning.
Thanks for helping us tracking down the problem.
Emmanuel

--
Emmanuel Cecchet
FTO @ Frog Thinker Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: [EMAIL PROTECTED]
Skype: emmanuel_cecchet

_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia

Reply via email to