Hi everyone,

I was looking at the code that generates this message.  It occurs in
org.continuent.sequoia.controller.scheduler.AbstractScheduler.begin().  One
thing I notice in this code is that we have a call to wait() on Object
transactionSync.  If you look at the code carefully it appears we can get a
spurious timeout in begin if this thread misses a notifyAll() call on
transactionSync.

Let¹s say you get notification because a transaction completes and does a
notifyAll().  Here¹s what the JDK 1.5 Java doc says about returning from
wait():

³The thread T is then removed from the wait set for this object and
re-enabled for thread scheduling. It then competes in the usual manner with
other threads for the right to synchronize on the object; once it has gained
control of the object, all its synchronization claims on the object are
restored to the status quo ante - that is, to the situation as of the time
that the wait method was invoked. Thread T then returns from the invocation
of the wait method. Thus, on return from the wait method, the
synchronization state of the object and of thread T is exactly as it was
when the wait method was invoked.²

So, if another thread manages to get the monitor and send a notify(), it
looks to me as if you would simply miss it.  In this case the transaction
would cycle through and just get stuck in the wait state.  The code I'm
looking at is shown below.  The message Stuart received was clearly from the
line marked as having a possible race condition.

It's not clear to me that this is the root cause of all the behavior Stuart
described, but there does seem to be at least one problem here.

Incidentally, once you miss the notify call it's possible some other bad
things then happen.  I will try to debug in more detail later this afternoon
as this problem has been one of the more annoying issues in 2.10.

Cheers, Robert

  public final void begin(TransactionMetaData tm, boolean isLazyStart,
      AbstractRequest request) throws SQLException
  {
    // Check if transactions are suspended
    boolean retry;
    do
    {
      retry = false;
      synchronized (transactionsSync)
      {
        if ((suspendTransactions > 0) && !isLazyStart
            && !tm.isPersistentConnection())
        {
          addSuspendedRequest(request);
          try
          {
            // Wait on transactionSync
            long timeout = tm.getTimeout();
            if (timeout > 0)
            {
              long start = System.currentTimeMillis();
              transactionsSync.wait(timeout);  // RACE CONDITION HERE???
              long end = System.currentTimeMillis();
              long remaining = timeout - (end - start);
              if (remaining > 0)
                tm.setTimeout(remaining);
              else
              {
                String msg = Translate.get(
                    "scheduler.begin.timeout.transactionSync",
                    pendingTransactions);
                logger.warn(msg);
                throw new SQLException(msg);
              }
            }
            else
              transactionsSync.wait();
          }
          catch (InterruptedException e)
          {
            String msg = Translate.get(
                "scheduler.begin.timeout.transactionSync",
pendingTransactions)
                + " (" + e + ")";
            logger.error(msg);
            throw new SQLException(msg);
          }
        }
       }
      // <SNIP>
    }
    while (retry);
  }


On 5/20/08 8:16 AM, "Emmanuel Cecchet" <[EMAIL PROTECTED]> wrote:

> Hi Stuart,
>
>> We experienced the below after an attempted automated backup operation,
>> the pg_dump command did not actually begin till 3:30. Why is there a
>> timeout on 0 pending transactions?
>>
> Actually, begin handling is very complex in Sequoia depending on the
> nature of the transaction. There is an optimization that starts
> transactions lazily to prevent starting read-only transactions on
> multiple nodes.
> So a couple of questions about your configuration since each case has a
> different code path:
> - are you using multiple controllers?
> - was the client starting the transaction connected to the controller
> where the timeout occurred?
> - was the query following the begin a SELECT or an update (INSERT,
> UPDATE, DELETE...)?
> - are you using persistent connections?
>> 2008-05-20 01:51:23,631 WARN  sequoia.controller.scheduler Timeout in
>> begin, still waiting for 0 pending transactions to complete
>>
>> 2008-05-20 01:51:23,631 WARN  controller.virtualdatabase.seccard Begin
>> failed (Timeout in begin, still waiting for 0 pending transactions to
>> complete)
>>
>
> Would you have a complete log to see when the backup operation was
> triggered and what were the subsequent errors if any?
>
> Thanks for your feedback,
> Emmanuel
>
> --
> Emmanuel Cecchet
> FTO @ Frog Thinker
> Open Source Development & Consulting
> --
> Web: http://www.frogthinker.org
> email: [EMAIL PROTECTED]
> Skype: emmanuel_cecchet
>
> _______________________________________________
> Sequoia mailing list
> [email protected]
> https://forge.continuent.org/mailman/listinfo/sequoia
>


--
Robert Hodges, CTO, Continuent, Inc.
Email:  [EMAIL PROTECTED]
Mobile:  +1-510-501-3728  Skype:  hodgesrm


_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia

Reply via email to