Re: [Sequoia] Time in begin

Emmanuel Cecchet Wed, 21 May 2008 18:59:32 -0700

Hi Robert,

I was looking at the code that generates this message.  It occurs in
org.continuent.sequoia.controller.scheduler.AbstractScheduler.begin().  One
thing I notice in this code is that we have a call to wait() on Object
transactionSync.  If you look at the code carefully it appears we can get a
spurious timeout in begin if this thread misses a notifyAll() call on
transactionSync.

I don't see how a thread can miss a notifyAll. If it is blocked onwait() then it will be unblocked by a notifyAll() (or the timeout inthat case). How many notifyAll occured is irrelevant here.

So, if another thread manages to get the monitor and send a notify(), it
looks to me as if you would simply miss it.  In this case the transaction
would cycle through and just get stuck in the wait state.  The code I'm
looking at is shown below.  The message Stuart received was clearly from the
line marked as having a possible race condition.

No this is not correct. Even if there were multiple notify before thethread wakes up, the only thing it does when it wakes up is to check ifthe timeout has expired. If it has indeed expired then an error isthrown. If the timeout has not expired, then we check again thecondition if (suspendTransactions>0) which will be set to 0 if we werenotified of the completion.

It's not clear to me that this is the root cause of all the behavior Stuart
described, but there does seem to be at least one problem here.

What is strange is that we have a suspendTransaction>0 whilependingTransactions=0.

  public final void begin(TransactionMetaData tm, boolean isLazyStart,
      AbstractRequest request) throws SQLException
  {
    // Check if transactions are suspended
    boolean retry;
    do
    {
      retry = false;
      synchronized (transactionsSync)
      {
        if ((suspendTransactions > 0) && !isLazyStart
            && !tm.isPersistentConnection())
        {
          addSuspendedRequest(request);
          try
          {
            // Wait on transactionSync
            long timeout = tm.getTimeout();
            if (timeout > 0)
            {
              long start = System.currentTimeMillis();
              transactionsSync.wait(timeout);  // RACE CONDITION HERE???

No race condition there, if you wake up and the timeout has not expiredyou loop and re-enter the synchronized block and start by testing ifsuspendTransactions>0 which is correct.

              long end = System.currentTimeMillis();
              long remaining = timeout - (end - start);
              if (remaining > 0)
                tm.setTimeout(remaining);
              else
              {

We could double-check here is suspendTransaction is now 0 in which casewe could continue and avoid throwing an exception. But it is notincorrect to throw an exception either since we have waited at least forthe timeout that was set.

                String msg = Translate.get(
                    "scheduler.begin.timeout.transactionSync",
                    pendingTransactions);
                logger.warn(msg);
                throw new SQLException(msg);
              }
            }
            else
              transactionsSync.wait();
          }
          catch (InterruptedException e)
          {
            String msg = Translate.get(
                "scheduler.begin.timeout.transactionSync",
pendingTransactions)
                + " (" + e + ")";
            logger.error(msg);
            throw new SQLException(msg);
          }
        }
       }
      // <SNIP>
    }
    while (retry);
  }

My analysis of what happened is as follows:

Backup calls first disable backend that triggers a suspendActivity (DRMcase).


2008-05-20 01:45:09,624 INFO  controller.RequestManager.XXXX

Suspending activity for XXXX2008-05-20 01:45:09,629 INFO controller.RequestManager.XXXX Allactivity is suspended for XXXX


The code in DisableBackendsAndSetCheckpoint generates this:
line 139: recoveryLog.storeCheckpoint()

2008-05-20 01:45:09,632 INFO controller.virtualdatabase.XXXX Storing
checkpoint backup 20080520-172.30.20.250:25322-20080520014509624+0100


line 173: db.setState(DISABLING)

2008-05-20 01:45:09,672 INFO  controller.virtualdatabase.XXXX Disabling
backend ldb4, waiting for current transaction to complete on this

backend


line 197: call to loadBalancer.disableBackend()
RAIDb1.disableBackend class terminateThreadsAndConnections()
Last thing is finalizeConnections()

2008-05-20 01:45:09,682 INFO  sequoia.controller.connection 5
connections freed on "jdbc:postgresql://X.X.X.X:5432/XXXX"
2008-05-20 01:45:10,062 INFO  sequoia.controller.connection 20
connections freed on "jdbc:postgresql://X.X.X.X:5432/XXXX"
2008-05-20 01:45:10,434 INFO  sequoia.controller.connection 20

connections freed on "jdbc:postgresql://X.X.X.X:5432/XXXX"

All connection pools should be closed in turn but it looks like thisoperation does not complete (since this is the last one of the backenddisabling operation). As all transactions are suspended it is normalthat all begin operation timeout. What is not normal is that the disableoperation does not complete.


2008-05-20 01:45:17,351 sequoia.controller.scheduler Timeout in begin,

still waiting for 0 pending transactions to complete2008-05-20 01:51:23,631 WARN controller.virtualdatabase.XXXX Begin

failed (Timeout in begin, still waiting for 0 pending transactions to
complete) 2008-05-20 01:51:23,631 INFO
virtualdatabase.VirtualDatabaseWorkerThread.XXXX Error during command
execution (Timeout in begin, still waiting for 0 pending transactions
to complete)

The last 2 messages were repeated at random for the next 1 3/4hrs

looks as though the checkpoint was not stored for till 3:30am?

The strange thing here is that we don't get an additional message aboutthe completion of the connection pool termination. Could you re-postyour connection pools configuration for this backend?


2008-05-20 03:30:38,825 WARN  sequoia.controller.recoverylog Checkpoint
backup 20080520-172.30.20.250:25322-20080520014509624+0100 was stored
2008-05-20 03:30:38,836 INFO  controller.virtualdatabase.XXXX

Backend ldb4 is now disabled2008-05-20 03:30:38,837 INFO controller.virtualdatabase.XXXX Backend

ldb4 disabled on controller Member(address=/X.X.X.X:50850,

uid=172.30.20.80:50850)2008-05-20 03:30:38,840 INFO controller.RequestManager.XXXX Resumingactivity for XXXX2008-05-20 03:30:38,845 INFO controller.RequestManager.XXXX Allactivity is now resumed for XXXX2008-05-20 03:30:38,846 INFO controller.RequestManager.XXXX Backendldb4 is now disabled2008-05-20 03:30:38,846 INFO controller.RequestManager.XXXX Startingbackup of backend ldb4All the remaining part is normal as soon as the backend has beensuccessfully disabled. We have to isolate the issue in the local backenddisabling code in RAIDb1 (or in the connection pool termination code).


Let me know if you find any flaw in my reasoning.
Thanks for helping us tracking down the problem.
Emmanuel

--
Emmanuel Cecchet

FTO @ Frog ThinkerOpen Source Development & Consulting

--
Web: http://www.frogthinker.org
email: [EMAIL PROTECTED]
Skype: emmanuel_cecchet

_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia

Re: [Sequoia] Time in begin

Reply via email to