Hi Guys,

On Thu, 22 May 2008 03:53:38 +0200
Emmanuel Cecchet <[EMAIL PROTECTED]> wrote:

> Hi Robert,
> > I was looking at the code that generates this message.  It occurs in
> > org.continuent.sequoia.controller.scheduler.AbstractScheduler.begin().
> > One thing I notice in this code is that we have a call to wait() on
> > Object transactionSync.  If you look at the code carefully it
> > appears we can get a spurious timeout in begin if this thread
> > misses a notifyAll() call on transactionSync.
> >   
> I don't see how a thread can miss a notifyAll. If it is blocked on 
> wait() then it will be unblocked by a notifyAll() (or the timeout in 
> that case). How many notifyAll occured is irrelevant here.
> > So, if another thread manages to get the monitor and send a
> > notify(), it looks to me as if you would simply miss it.  In this
> > case the transaction would cycle through and just get stuck in the
> > wait state.  The code I'm looking at is shown below.  The message
> > Stuart received was clearly from the line marked as having a
> > possible race condition. 
> No this is not correct. Even if there were multiple notify before the 
> thread wakes up, the only thing it does when it wakes up is to check
> if the timeout has expired. If it has indeed expired then an error is 
> thrown. If the timeout has not expired, then we check again the 
> condition if (suspendTransactions>0) which will be set to 0 if we
> were notified of the completion.
> > It's not clear to me that this is the root cause of all the
> > behavior Stuart described, but there does seem to be at least one
> > problem here. 
> What is strange is that we have a suspendTransaction>0 while 
> pendingTransactions=0.
> >   public final void begin(TransactionMetaData tm, boolean
> > isLazyStart, AbstractRequest request) throws SQLException
> >   {
> >     // Check if transactions are suspended
> >     boolean retry;
> >     do
> >     {
> >       retry = false;
> >       synchronized (transactionsSync)
> >       {
> >         if ((suspendTransactions > 0) && !isLazyStart
> >             && !tm.isPersistentConnection())
> >         {
> >           addSuspendedRequest(request);
> >           try
> >           {
> >             // Wait on transactionSync
> >             long timeout = tm.getTimeout();
> >             if (timeout > 0)
> >             {
> >               long start = System.currentTimeMillis();
> >               transactionsSync.wait(timeout);  // RACE CONDITION
> > HERE??? 
> No race condition there, if you wake up and the timeout has not
> expired you loop and re-enter the synchronized block and start by
> testing if 
> suspendTransactions>0 which is correct.
> >               long end = System.currentTimeMillis();
> >               long remaining = timeout - (end - start);
> >               if (remaining > 0)
> >                 tm.setTimeout(remaining);
> >               else
> >               {
> >   
> We could double-check here is suspendTransaction is now 0 in which
> case we could continue and avoid throwing an exception. But it is not 
> incorrect to throw an exception either since we have waited at least
> for the timeout that was set.
> >                 String msg = Translate.get(
> >                     "scheduler.begin.timeout.transactionSync",
> >                     pendingTransactions);
> >                 logger.warn(msg);
> >                 throw new SQLException(msg);
> >               }
> >             }
> >             else
> >               transactionsSync.wait();
> >           }
> >           catch (InterruptedException e)
> >           {
> >             String msg = Translate.get(
> >                 "scheduler.begin.timeout.transactionSync",
> > pendingTransactions)
> >                 + " (" + e + ")";
> >             logger.error(msg);
> >             throw new SQLException(msg);
> >           }
> >         }
> >        }
> >       // <SNIP>
> >     }
> >     while (retry);
> >   }
> >   
> My analysis of what happened is as follows:
> Backup calls first disable backend that triggers a suspendActivity
> (DRM case).
> 
> 2008-05-20 01:45:09,624 INFO  controller.RequestManager.XXXX
> Suspending activity for XXXX 
> 2008-05-20 01:45:09,629 INFO controller.RequestManager.XXXX All
> activity is suspended for XXXX 
> 
> 
> The code in DisableBackendsAndSetCheckpoint generates this:
> line 139: recoveryLog.storeCheckpoint()
> 
> 2008-05-20 01:45:09,632 INFO controller.virtualdatabase.XXXX Storing
> checkpoint backup 20080520-172.30.20.250:25322-20080520014509624+0100
> 
> 
> line 173: db.setState(DISABLING)
> 
> 2008-05-20 01:45:09,672 INFO  controller.virtualdatabase.XXXX
> Disabling backend ldb4, waiting for current transaction to complete
> on this backend 
> 
> 
> line 197: call to loadBalancer.disableBackend()
> RAIDb1.disableBackend class terminateThreadsAndConnections()
> Last thing is finalizeConnections()
> 
> 2008-05-20 01:45:09,682 INFO  sequoia.controller.connection 5
> connections freed on "jdbc:postgresql://X.X.X.X:5432/XXXX"
> 2008-05-20 01:45:10,062 INFO  sequoia.controller.connection 20
> connections freed on "jdbc:postgresql://X.X.X.X:5432/XXXX"
> 2008-05-20 01:45:10,434 INFO  sequoia.controller.connection 20
> connections freed on "jdbc:postgresql://X.X.X.X:5432/XXXX" 
> 
> 
> All connection pools should be closed in turn but it looks like this 
> operation does not complete (since this is the last one of the
> backend disabling operation). As all transactions are suspended it is
> normal that all begin operation timeout. What is not normal is that
> the disable operation does not complete.
> 
> 2008-05-20 01:45:17,351 sequoia.controller.scheduler Timeout in begin,
> still waiting for 0 pending transactions to complete 
> 2008-05-20 01:51:23,631 WARN controller.virtualdatabase.XXXX Begin
> failed (Timeout in begin, still waiting for 0 pending transactions to
> complete) 2008-05-20 01:51:23,631 INFO
> virtualdatabase.VirtualDatabaseWorkerThread.XXXX Error during command
> execution (Timeout in begin, still waiting for 0 pending transactions
> to complete)
> 
> The last 2 messages were repeated at random for the next 1 3/4hrs
> 
> looks as though the checkpoint was not stored for till 3:30am?
> 
> 
> The strange thing here is that we don't get an additional message
> about the completion of the connection pool termination. Could you
> re-post your connection pools configuration for this backend?


<DatabaseBackend name="ldb4" driver="org.postgresql.Driver"
url="jdbc:postgresql://xxxxx:5432/xxxx" connectionTestStatement="select
now()"> 

<ConnectionManager vLogin="xxxx" rLogin="xxxx" rPassword="xxx">
<VariablePoolConnectionManager initPoolSize="50" minPoolSize="20"
maxPoolSize="300" idleTime out="60" waitTimeout="30"/>
</ConnectionManager>

<ConnectionManager vLogin="XXXX1" rLogin="XXXX1" rPassword="XXXX1">
<VariablePoolConnectionManager initPoolSize="50" minPoolSize="20"
maxPoolSize="100" idleTime out="60" waitTimeout="30"/>
</ConnectionManager>
                </DatabaseBackend>


We have two users (applications) who access this database, our backend
databases are configured to have a maximum of 500 connections. The
second user XXXX1 which is our jdbc c3p0 hibernate web application is
the one that appears to create the issue, as and when i restart these
web applications, the controller then proceeds with the backup
operation within the next minute or so.

Thanks again for looking into this,

Stuart
_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia

Reply via email to