Re: [HACKERS] An example of bugs for Hot Standby

2010-01-21 Thread Hiroyuki Yamada
Deadlock bug was prevented by stop-gap measure in December commit.

Full resolution patch attached for Startup process waits on buffer pins.

Startup process sets SIGALRM when waiting on a buffer pin. If woken by
alarm we send SIGUSR1 to all backends requesting that they check to see
if they are blocking Startup process. If so, they throw ERROR/FATAL as
for other conflict resolutions. Deadlock stop gap removed.
max_standby_delay = -1 option removed to prevent deadlock.

Reviews welcome, otherwise commit at end of week.


I think the patch has two problems.

 * disable_standby_sig_alarm() does not clear standby_timeout_active flag
   when it succeeds in disabling the alarm.

 * Assertion check in HoldingBufferPinThatDelaysRecovery() can fail
   with following scenario.

   1. Two transactions, xact A and xact B, are running in a HotStandby server.
   2. Xact A holds a pin on buffer X.
   3. Startup process calls LockBufferForCleanup() for buffer X,
  sets ProcGlobal-startupBufferPinWaitBufId = X,
  sends PROCSIG_RECOVERY_CONFLICT_BUFFERPIN signal to both transactions,
  and sleeps.
   4. Xact A handles the signal,
  aborts itself,
  releases the pin on buffer X,
  and awake startup process.
   5. Startup process wakes up
  and sets ProcGlobal-startupBufferPinWaitBufId = -1.
   6. Xact B handles the signal,
  checks ProcGlobal-startupBufferPinWaitBufId,
  and fails in the assertion check in HoldingBufferPinThatDelaysRecovery().


regards,

--
  Hiroyuki YAMADA
  Kokolink Corporation
  yam...@kokolink.net

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] An example of bugs for Hot Standby

2009-12-21 Thread Hiroyuki Yamada

Following question may be redundant. Just a confirmation.

Deadlock example is catstrophic while it's rather a rare event.
On the other hand, LockBufferForCleanup() can cause another 
problem.

 * One idle pin-holder backend can freeze startup process().

This problem is not catstrophic, but it seems a similar problem
which StandbyAcquireAccessExclusiveLock() tries to avoid.

...Is this the problem you call general problem above ?



Here is a typical scenario in which startup process freezes until the end of 
a certain transaction.

 1. Consider a table A, which has pages with HOT chain tuples old enough to be 
vacuumed.
 2. Xact 1 in the standby node declares a cursor for table A, fetches the page
which contains the HOT chain, and becomes idle for some reason.
 3. Xact 2 in the active node reads the table A and calls heap_page_prune()
for HOT pruning, which create XLOG_HEAP2_CLEAN record.
 4. Startup process tries to redo XLOG_HEAP2_CLEAN record, calls
LockBufferForCleanup() and freezes until the Xact 1 ends.

Note that with HOT pruning, we do not need VACUUM command, and most tables,
which has long history of updation, can be table A.


--
  Hiroyuki YAMADA
  Kokolink Corporation
  yam...@kokolink.net

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alpha3 release schedule?

2009-12-21 Thread Hiroyuki Yamada

The problem you mention here has been documented and very accessible for
months and not a single person mentioned it up to now. What's more, the
equivalent problem happens in the latest production version of Postgres
- users can delay VACUUM endlessly in just the same way, yet I've not
seen this raised as an issue in many years of using Postgres. Similarly,
there are some ways that Postgres can deadlock that it need not, yet
those negative behaviours are accepted and nobody is rushing to fix
them, nor demanding that they should be. Few things are theoretically
perfect on their first release.



Sorry for annoying you, at the very first.

Well, this is certainly a well-known problem, but the cursor example
(or deadlock example) reveals that the problem is more severe than
it was considered before, I guess.


Following comments in backup.sgml(which are now replaced by the deadlock 
example)

   Waits for buffer cleanup locks do not currently result in query
   cancellation. Long waits are uncommon, though can happen in some cases
   with long running nested loop joins.

...refered only to the example where startup process should wait
until the end of one query. And long waits are assumed to be uncommon.

The cursor example shows, however, the waits can be as long as one
transaction, and occur in usual use case. FYI, I wrote a typical freeze
scenario in the mail posted in the original deadlock example thread.

Then the startup process may have to wait until the end of transaction,
and we can not expect when the pin-holder transaction ends.


Also, you mentioned the VACCUM case of the production version, but following
two problems have different impacts.

 * One VACUUM process freezes until the end of a certain transaction.
 * Startup process(and whole recovery work) freezes until the end of
   a certain transaction.

The startup process is the last process to freeze. So I guess
this problem may become must-fix.


Anyway, the patch are committed and alpha 3 are to be released.
Do you think this problem is must-fix for the final release ?


regards,

--
  Hiroyuki YAMADA
  Kokolink Corporation
  yam...@kokolink.net

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alpha3 release schedule?

2009-12-19 Thread Hiroyuki Yamada

Do people want more time to play with hot standby?  Otherwise alpha3
should go out on Monday or Tuesday.


Well, I want to know whether the problem I refered to 
in http://archives.postgresql.org/pgsql-hackers/2009-12/msg01641.php
is must-fix or not.

This problem is a corollary of the deadlock problem. This is less catstrophic
but more likely to happen.

If you leave this problem, for example, any long-running transactions,
holding any cursors in whatever tables, have a possibility of freezing
whole recovery work in HotStandby node until the transaction commit.


regards,

--
  Hiroyuki YAMADA
  Kokolink Corporation
  yam...@kokolink.net

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alpha3 release schedule?

2009-12-19 Thread Hiroyuki Yamada

Hiroyuki Yamada yam...@kokolink.net writes:
 Well, I want to know whether the problem I refered to 
 in http://archives.postgresql.org/pgsql-hackers/2009-12/msg01641.php
 is must-fix or not.

 This problem is a corollary of the deadlock problem. This is less catstrophic
 but more likely to happen.

 If you leave this problem, for example, any long-running transactions,
 holding any cursors in whatever tables, have a possibility of freezing
 whole recovery work in HotStandby node until the transaction commit.

Seems like something we should fix ASAP, but I do not see why it need
hold up an alpha release.  Alpha releases are expected to have bugs,
and this one doesn't look like it would stop people from finding
other bugs.


At the beginning of this commit fest, Heikki said in
http://archives.postgresql.org/pgsql-hackers/2009-11/msg00914.php

Of course there should be several phases! We've *already* punted a lot
of stuff from this first increment we're currently working on. The
criteria for getting this first phase committed is: could we release
with no further changes?

And other patches seem to be checked with similar criteria, as long as
I read mails in this list. So I wanted to know whether the problem is
must-fix, and if it is, why the criteria has been changed during the
commit fest.

Anyway, thanks for answering my question.


regards,

--
  Hiroyuki YAMADA
  Kokolink Corporation
  yam...@kokolink.net

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alpha3 release schedule?

2009-12-19 Thread Hiroyuki Yamada

Well, that was the criteria I used to decide whether to commit or not.
Not everyone agreed to begin with, and the reason I used that criteria
was a selfish one: I didn't want to be forced to fix loose ends after
the commitfest myself. The big reason for that was that I didn't know
how much time I would have for that. I have no complaints about Simon's
commit. Knowing that I'm not on the hook to close the loose ends, I'm
very happy that it's finally in. (That doesn't mean that I'll stop
paying attention to this patch; I will do as much as I have time to.)

Regarding the bugs you found, I put them on the TODO list at
https://wiki.postgresql.org/wiki/Hot_Standby_TODO, under the must-fix
category. I think they need to be fixed before final release, but
there's no need to delay the alpha release for them.


I never think it's selfish. But I see. Thanks for your kind reply.


regards,

--
  Hiroyuki YAMADA
  Kokolink Corporation
  yam...@kokolink.net

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] An example of bugs for Hot Standby

2009-12-18 Thread Hiroyuki Yamada

This way we only cancel direct deadlocks.

It doesn't solve general problem of buffer waits, but they may be
solvable by different mechanism.


Following question may be redundant. Just a confirmation.

Deadlock example is catstrophic while it's rather a rare event.
On the other hand, LockBufferForCleanup() can cause another 
problem.

 * One idle pin-holder backend can freeze startup process().

This problem is not catstrophic, but it seems a similar problem
which StandbyAcquireAccessExclusiveLock() tries to avoid.

...Is this the problem you call general problem above ?


regards,


--
  Hiroyuki YAMADA
  Kokolink Corporation
  yam...@kokolink.net

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and prepared transactions

2009-12-17 Thread Hiroyuki Yamada

On Wed, 2009-12-16 at 19:35 +0900, Hiroyuki Yamada wrote:

  * There is a window beween gathering lock information in 
 GetRunningTransactionLocks()
and writing WAL in LogAccessExclusiveLocks().
  * In current lock redo algorithm, locks are released when the transaction 
 holding the lock
are commited or aborted.
 
 ... then what happens if any transaction holding ACCESS EXCLUSIVE lock 
 commits in the 
window ?

Yes, was a problem in that code. Fixed in git.

We were doing it for prepared transactions but not for normal xacts.
I will look again at that code.

Thanks very much for reading the code. Any more?!?


Well, I've read some more and have a question.

The implementation assumes that transactions write COMMIT/ABORT WAL at the end
of them, while it does not seem to write ABORT WAL in immediate shutdown. So,

1. acquire ACCESS EXCLUSIVE lock in table A in xact 1
2. execute immediate shutdown of the active node
3. restart it
4. acquire ACCESS EXCLUSIVE lock in table A in xact 2

...then, duplicate lock acquisition by two diffrent transactions can occur in 
the standby node.

Am I missing something ? Or is this already reported ?


regards,

--
  Hiroyuki YAMADA
  Kokolink Corporation
  yam...@kokolink.net

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and prepared transactions

2009-12-16 Thread Hiroyuki Yamada

That fixes or explains all known issues, from me. Are there any other
things you know about that I haven't responded to? Do you think we have
addressed every issue, except deferred items?

I will be looking to commit to CVS later today; waiting on any
objections.


Is following problem reported or fixed ?

-
1. configure with --enable-cassert option, then make, make install
2. initdb, enable WAL archiving
3. run the server
4. run pgbench -i, with scaling factor 10 or more
5. server dies with following backtrace

(gdb) backtrace
#0  0x009e17a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00a22815 in raise () from /lib/tls/libc.so.6
#2  0x00a24279 in abort () from /lib/tls/libc.so.6
#3  0x082dbf98 in ExceptionalCondition (conditionName=0x84201d4 
!(lock-nGranted == 1), errorType=0x8308dd4 FailedAssertion,
fileName=0x8420fb2 lock.c, lineNumber=2296) at assert.c:57
#4  0x08231127 in GetRunningTransactionLocks (nlocks=0x0) at lock.c:2296
#5  0x0822c110 in LogStandbySnapshot (oldestActiveXid=0x0, nextXid=0x0) at 
standby.c:578
#6  0x080cc13f in CreateCheckPoint (flags=32) at xlog.c:6826
#7  0x08204cf6 in BackgroundWriterMain () at bgwriter.c:490
#8  0x080ec291 in AuxiliaryProcessMain (argc=2, argv=0xbff25cc4) at 
bootstrap.c:413
#9  0x0820b0af in StartChildProcess (type=Variable type is not available.
) at postmaster.c:4218
#10 0x0820c722 in reaper (postgres_signal_arg=17) at postmaster.c:2322
#11 signal handler called
#12 0x009e17a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#13 0x00abcbbd in ___newselect_nocancel () from /lib/tls/libc.so.6
#14 0x0820b2b8 in ServerLoop () at postmaster.c:1360
#15 0x0820d59e in PostmasterMain (argc=3, argv=0x8579860) at postmaster.c:1065
#16 0x081b78f8 in main (argc=3, argv=0x8579860) at main.c:188
-

Also, is the problem reported in 
http://archives.postgresql.org/pgsql-hackers/2009-12/msg01324.php
fixed or deferred ?


regrards,


--
  Hiroyuki YAMADA
  Kokolink Corporation
  yam...@kokolink.net

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and prepared transactions

2009-12-16 Thread Hiroyuki Yamada

On Wed, 2009-12-16 at 18:08 +0900, Hiroyuki Yamada wrote:
 That fixes or explains all known issues, from me. Are there any other
 things you know about that I haven't responded to? Do you think we have
 addressed every issue, except deferred items?
 
 I will be looking to commit to CVS later today; waiting on any
 objections.
 
 
 Is following problem reported or fixed ?

That is fixed, as of a couple of days ago. Thanks for your vigilence.


I tested somewhat older patch(the RC patch in this mailing list). Sorry for 
annoying you.


By the way, reading LogStandbySnapshot() and GetRunningTransactionLocks()
raised following questions.

 * There is a window beween gathering lock information in 
GetRunningTransactionLocks()
   and writing WAL in LogAccessExclusiveLocks().
 * In current lock redo algorithm, locks are released when the transaction 
holding the lock
   are commited or aborted.

... then what happens if any transaction holding ACCESS EXCLUSIVE lock commits 
in the 
   window ?

Similary,

 * There is a window beween writing COMMIT WAL in RecordTransactionCommit() and
releasing locks in ResourceOwnerRelease()

... then what happens when GetRunningTransactionLocks() gathers ACCESS 
EXCLUSIVE 
   locks whose holder has already written the COMMIT WAL ?


Are there any chances of releasing locks which have no COMMIT WAL for releasing 
them ?


regards,

--
  Hiroyuki YAMADA
  Kokolink Corporation
  yam...@kokolink.net

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] An example of bugs for Hot Standby

2009-12-15 Thread Hiroyuki Yamada
Hot Standby node can freeze when startup process calls LockBufferForCleanup().
This bug can be reproduced by the following procedure.


0. start Hot Standby, with one active node(node A) and one standby node(node B)
1. create table X and table Y in node A
2. insert several rows in table X in node A
3. delete one row from table X in node A
4. begin xact 1 in node A, execute following commands, and leave xact 1 open
4.1 LOCK table Y IN ACCESS EXCLUSIVE MODE
5. wait until WAL's for above actions are applied in node B
6. begin xact 2 in node B, and execute following commands
6.1 DECLARE CURSOR test_cursor FOR SELECT * FROM table X;
6.2 FETCH test_cursor;
6.3 SELECT * FROM table Y;
7. execute VACUUM FREEZE table A in node A
8. commit xact 1 in node A

...then in node B occurs following deadlock situation, which is not detected 
by deadlock check.
 * startup process waits for xact 2 to release buffers in table X (in 
LockBufferForCleanup())
 * xact 2 waits for startup process to release ACCESS EXCLUSIVE lock in table Y

This situation can occur when
 a) a transaction in the standby node tries to acquire ACCESS SHARE lock while 
holding some buffers
 b) startup process calls LockBufferForCleanup() for any of the buffers


regards,

--
  Hiroyuki YAMADA
  Kokolink Corporation
  yam...@kokolink.net

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] An example of bugs for Hot Standby

2009-12-15 Thread Hiroyuki Yamada
Hot Standby node can freeze when startup process calls LockBufferForCleanup().
This bug can be reproduced by the following procedure.

0. start Hot Standby, with one active node(node A) and one standby node(node B)
1. create table X and table Y in node A
2. insert several rows in table X in node A
3. delete one row from table X in node A
4. begin xact 1 in node A, execute following commands, and leave xact 1 open
4.1 LOCK table Y IN ACCESS EXCLUSIVE MODE
5. wait until WAL's for above actions are applied in node B
6. begin xact 2 in node B, and execute following commands
6.1 DECLARE CURSOR test_cursor FOR SELECT * FROM table X;
6.2 FETCH test_cursor;
6.3 SELECT * FROM table Y;
7. execute VACUUM FREEZE table A in node A
8. commit xact 1 in node A

...then in node B occurs following deadlock situation, which is not detected 
by deadlock check.
 * startup process waits for xact 2 to release buffers in table X (in 
LockBufferForCleanup())
 * xact 2 waits for startup process to release ACCESS EXCLUSIVE lock in table Y

This situation can occur when
 a) a transaction in the standby node tries to acquire ACCESS SHARE lock while 
holding some buffers
 b) startup process calls LockBufferForCleanup() for any of the buffers


regards,

--
  Hiroyuki YAMADA
  Kokolink Corporation
  yam...@kokolink.net

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers