Re: [HACKERS] An example of bugs for Hot Standby
Deadlock bug was prevented by stop-gap measure in December commit. Full resolution patch attached for Startup process waits on buffer pins. Startup process sets SIGALRM when waiting on a buffer pin. If woken by alarm we send SIGUSR1 to all backends requesting that they check to see if they are blocking Startup process. If so, they throw ERROR/FATAL as for other conflict resolutions. Deadlock stop gap removed. max_standby_delay = -1 option removed to prevent deadlock. Reviews welcome, otherwise commit at end of week. I think the patch has two problems. * disable_standby_sig_alarm() does not clear standby_timeout_active flag when it succeeds in disabling the alarm. * Assertion check in HoldingBufferPinThatDelaysRecovery() can fail with following scenario. 1. Two transactions, xact A and xact B, are running in a HotStandby server. 2. Xact A holds a pin on buffer X. 3. Startup process calls LockBufferForCleanup() for buffer X, sets ProcGlobal-startupBufferPinWaitBufId = X, sends PROCSIG_RECOVERY_CONFLICT_BUFFERPIN signal to both transactions, and sleeps. 4. Xact A handles the signal, aborts itself, releases the pin on buffer X, and awake startup process. 5. Startup process wakes up and sets ProcGlobal-startupBufferPinWaitBufId = -1. 6. Xact B handles the signal, checks ProcGlobal-startupBufferPinWaitBufId, and fails in the assertion check in HoldingBufferPinThatDelaysRecovery(). regards, -- Hiroyuki YAMADA Kokolink Corporation yam...@kokolink.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] An example of bugs for Hot Standby
Following question may be redundant. Just a confirmation. Deadlock example is catstrophic while it's rather a rare event. On the other hand, LockBufferForCleanup() can cause another problem. * One idle pin-holder backend can freeze startup process(). This problem is not catstrophic, but it seems a similar problem which StandbyAcquireAccessExclusiveLock() tries to avoid. ...Is this the problem you call general problem above ? Here is a typical scenario in which startup process freezes until the end of a certain transaction. 1. Consider a table A, which has pages with HOT chain tuples old enough to be vacuumed. 2. Xact 1 in the standby node declares a cursor for table A, fetches the page which contains the HOT chain, and becomes idle for some reason. 3. Xact 2 in the active node reads the table A and calls heap_page_prune() for HOT pruning, which create XLOG_HEAP2_CLEAN record. 4. Startup process tries to redo XLOG_HEAP2_CLEAN record, calls LockBufferForCleanup() and freezes until the Xact 1 ends. Note that with HOT pruning, we do not need VACUUM command, and most tables, which has long history of updation, can be table A. -- Hiroyuki YAMADA Kokolink Corporation yam...@kokolink.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] alpha3 release schedule?
The problem you mention here has been documented and very accessible for months and not a single person mentioned it up to now. What's more, the equivalent problem happens in the latest production version of Postgres - users can delay VACUUM endlessly in just the same way, yet I've not seen this raised as an issue in many years of using Postgres. Similarly, there are some ways that Postgres can deadlock that it need not, yet those negative behaviours are accepted and nobody is rushing to fix them, nor demanding that they should be. Few things are theoretically perfect on their first release. Sorry for annoying you, at the very first. Well, this is certainly a well-known problem, but the cursor example (or deadlock example) reveals that the problem is more severe than it was considered before, I guess. Following comments in backup.sgml(which are now replaced by the deadlock example) Waits for buffer cleanup locks do not currently result in query cancellation. Long waits are uncommon, though can happen in some cases with long running nested loop joins. ...refered only to the example where startup process should wait until the end of one query. And long waits are assumed to be uncommon. The cursor example shows, however, the waits can be as long as one transaction, and occur in usual use case. FYI, I wrote a typical freeze scenario in the mail posted in the original deadlock example thread. Then the startup process may have to wait until the end of transaction, and we can not expect when the pin-holder transaction ends. Also, you mentioned the VACCUM case of the production version, but following two problems have different impacts. * One VACUUM process freezes until the end of a certain transaction. * Startup process(and whole recovery work) freezes until the end of a certain transaction. The startup process is the last process to freeze. So I guess this problem may become must-fix. Anyway, the patch are committed and alpha 3 are to be released. Do you think this problem is must-fix for the final release ? regards, -- Hiroyuki YAMADA Kokolink Corporation yam...@kokolink.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] alpha3 release schedule?
Do people want more time to play with hot standby? Otherwise alpha3 should go out on Monday or Tuesday. Well, I want to know whether the problem I refered to in http://archives.postgresql.org/pgsql-hackers/2009-12/msg01641.php is must-fix or not. This problem is a corollary of the deadlock problem. This is less catstrophic but more likely to happen. If you leave this problem, for example, any long-running transactions, holding any cursors in whatever tables, have a possibility of freezing whole recovery work in HotStandby node until the transaction commit. regards, -- Hiroyuki YAMADA Kokolink Corporation yam...@kokolink.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] alpha3 release schedule?
Hiroyuki Yamada yam...@kokolink.net writes: Well, I want to know whether the problem I refered to in http://archives.postgresql.org/pgsql-hackers/2009-12/msg01641.php is must-fix or not. This problem is a corollary of the deadlock problem. This is less catstrophic but more likely to happen. If you leave this problem, for example, any long-running transactions, holding any cursors in whatever tables, have a possibility of freezing whole recovery work in HotStandby node until the transaction commit. Seems like something we should fix ASAP, but I do not see why it need hold up an alpha release. Alpha releases are expected to have bugs, and this one doesn't look like it would stop people from finding other bugs. At the beginning of this commit fest, Heikki said in http://archives.postgresql.org/pgsql-hackers/2009-11/msg00914.php Of course there should be several phases! We've *already* punted a lot of stuff from this first increment we're currently working on. The criteria for getting this first phase committed is: could we release with no further changes? And other patches seem to be checked with similar criteria, as long as I read mails in this list. So I wanted to know whether the problem is must-fix, and if it is, why the criteria has been changed during the commit fest. Anyway, thanks for answering my question. regards, -- Hiroyuki YAMADA Kokolink Corporation yam...@kokolink.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] alpha3 release schedule?
Well, that was the criteria I used to decide whether to commit or not. Not everyone agreed to begin with, and the reason I used that criteria was a selfish one: I didn't want to be forced to fix loose ends after the commitfest myself. The big reason for that was that I didn't know how much time I would have for that. I have no complaints about Simon's commit. Knowing that I'm not on the hook to close the loose ends, I'm very happy that it's finally in. (That doesn't mean that I'll stop paying attention to this patch; I will do as much as I have time to.) Regarding the bugs you found, I put them on the TODO list at https://wiki.postgresql.org/wiki/Hot_Standby_TODO, under the must-fix category. I think they need to be fixed before final release, but there's no need to delay the alpha release for them. I never think it's selfish. But I see. Thanks for your kind reply. regards, -- Hiroyuki YAMADA Kokolink Corporation yam...@kokolink.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] An example of bugs for Hot Standby
This way we only cancel direct deadlocks. It doesn't solve general problem of buffer waits, but they may be solvable by different mechanism. Following question may be redundant. Just a confirmation. Deadlock example is catstrophic while it's rather a rare event. On the other hand, LockBufferForCleanup() can cause another problem. * One idle pin-holder backend can freeze startup process(). This problem is not catstrophic, but it seems a similar problem which StandbyAcquireAccessExclusiveLock() tries to avoid. ...Is this the problem you call general problem above ? regards, -- Hiroyuki YAMADA Kokolink Corporation yam...@kokolink.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby and prepared transactions
On Wed, 2009-12-16 at 19:35 +0900, Hiroyuki Yamada wrote: * There is a window beween gathering lock information in GetRunningTransactionLocks() and writing WAL in LogAccessExclusiveLocks(). * In current lock redo algorithm, locks are released when the transaction holding the lock are commited or aborted. ... then what happens if any transaction holding ACCESS EXCLUSIVE lock commits in the window ? Yes, was a problem in that code. Fixed in git. We were doing it for prepared transactions but not for normal xacts. I will look again at that code. Thanks very much for reading the code. Any more?!? Well, I've read some more and have a question. The implementation assumes that transactions write COMMIT/ABORT WAL at the end of them, while it does not seem to write ABORT WAL in immediate shutdown. So, 1. acquire ACCESS EXCLUSIVE lock in table A in xact 1 2. execute immediate shutdown of the active node 3. restart it 4. acquire ACCESS EXCLUSIVE lock in table A in xact 2 ...then, duplicate lock acquisition by two diffrent transactions can occur in the standby node. Am I missing something ? Or is this already reported ? regards, -- Hiroyuki YAMADA Kokolink Corporation yam...@kokolink.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby and prepared transactions
That fixes or explains all known issues, from me. Are there any other things you know about that I haven't responded to? Do you think we have addressed every issue, except deferred items? I will be looking to commit to CVS later today; waiting on any objections. Is following problem reported or fixed ? - 1. configure with --enable-cassert option, then make, make install 2. initdb, enable WAL archiving 3. run the server 4. run pgbench -i, with scaling factor 10 or more 5. server dies with following backtrace (gdb) backtrace #0 0x009e17a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x00a22815 in raise () from /lib/tls/libc.so.6 #2 0x00a24279 in abort () from /lib/tls/libc.so.6 #3 0x082dbf98 in ExceptionalCondition (conditionName=0x84201d4 !(lock-nGranted == 1), errorType=0x8308dd4 FailedAssertion, fileName=0x8420fb2 lock.c, lineNumber=2296) at assert.c:57 #4 0x08231127 in GetRunningTransactionLocks (nlocks=0x0) at lock.c:2296 #5 0x0822c110 in LogStandbySnapshot (oldestActiveXid=0x0, nextXid=0x0) at standby.c:578 #6 0x080cc13f in CreateCheckPoint (flags=32) at xlog.c:6826 #7 0x08204cf6 in BackgroundWriterMain () at bgwriter.c:490 #8 0x080ec291 in AuxiliaryProcessMain (argc=2, argv=0xbff25cc4) at bootstrap.c:413 #9 0x0820b0af in StartChildProcess (type=Variable type is not available. ) at postmaster.c:4218 #10 0x0820c722 in reaper (postgres_signal_arg=17) at postmaster.c:2322 #11 signal handler called #12 0x009e17a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #13 0x00abcbbd in ___newselect_nocancel () from /lib/tls/libc.so.6 #14 0x0820b2b8 in ServerLoop () at postmaster.c:1360 #15 0x0820d59e in PostmasterMain (argc=3, argv=0x8579860) at postmaster.c:1065 #16 0x081b78f8 in main (argc=3, argv=0x8579860) at main.c:188 - Also, is the problem reported in http://archives.postgresql.org/pgsql-hackers/2009-12/msg01324.php fixed or deferred ? regrards, -- Hiroyuki YAMADA Kokolink Corporation yam...@kokolink.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby and prepared transactions
On Wed, 2009-12-16 at 18:08 +0900, Hiroyuki Yamada wrote: That fixes or explains all known issues, from me. Are there any other things you know about that I haven't responded to? Do you think we have addressed every issue, except deferred items? I will be looking to commit to CVS later today; waiting on any objections. Is following problem reported or fixed ? That is fixed, as of a couple of days ago. Thanks for your vigilence. I tested somewhat older patch(the RC patch in this mailing list). Sorry for annoying you. By the way, reading LogStandbySnapshot() and GetRunningTransactionLocks() raised following questions. * There is a window beween gathering lock information in GetRunningTransactionLocks() and writing WAL in LogAccessExclusiveLocks(). * In current lock redo algorithm, locks are released when the transaction holding the lock are commited or aborted. ... then what happens if any transaction holding ACCESS EXCLUSIVE lock commits in the window ? Similary, * There is a window beween writing COMMIT WAL in RecordTransactionCommit() and releasing locks in ResourceOwnerRelease() ... then what happens when GetRunningTransactionLocks() gathers ACCESS EXCLUSIVE locks whose holder has already written the COMMIT WAL ? Are there any chances of releasing locks which have no COMMIT WAL for releasing them ? regards, -- Hiroyuki YAMADA Kokolink Corporation yam...@kokolink.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] An example of bugs for Hot Standby
Hot Standby node can freeze when startup process calls LockBufferForCleanup(). This bug can be reproduced by the following procedure. 0. start Hot Standby, with one active node(node A) and one standby node(node B) 1. create table X and table Y in node A 2. insert several rows in table X in node A 3. delete one row from table X in node A 4. begin xact 1 in node A, execute following commands, and leave xact 1 open 4.1 LOCK table Y IN ACCESS EXCLUSIVE MODE 5. wait until WAL's for above actions are applied in node B 6. begin xact 2 in node B, and execute following commands 6.1 DECLARE CURSOR test_cursor FOR SELECT * FROM table X; 6.2 FETCH test_cursor; 6.3 SELECT * FROM table Y; 7. execute VACUUM FREEZE table A in node A 8. commit xact 1 in node A ...then in node B occurs following deadlock situation, which is not detected by deadlock check. * startup process waits for xact 2 to release buffers in table X (in LockBufferForCleanup()) * xact 2 waits for startup process to release ACCESS EXCLUSIVE lock in table Y This situation can occur when a) a transaction in the standby node tries to acquire ACCESS SHARE lock while holding some buffers b) startup process calls LockBufferForCleanup() for any of the buffers regards, -- Hiroyuki YAMADA Kokolink Corporation yam...@kokolink.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] An example of bugs for Hot Standby
Hot Standby node can freeze when startup process calls LockBufferForCleanup(). This bug can be reproduced by the following procedure. 0. start Hot Standby, with one active node(node A) and one standby node(node B) 1. create table X and table Y in node A 2. insert several rows in table X in node A 3. delete one row from table X in node A 4. begin xact 1 in node A, execute following commands, and leave xact 1 open 4.1 LOCK table Y IN ACCESS EXCLUSIVE MODE 5. wait until WAL's for above actions are applied in node B 6. begin xact 2 in node B, and execute following commands 6.1 DECLARE CURSOR test_cursor FOR SELECT * FROM table X; 6.2 FETCH test_cursor; 6.3 SELECT * FROM table Y; 7. execute VACUUM FREEZE table A in node A 8. commit xact 1 in node A ...then in node B occurs following deadlock situation, which is not detected by deadlock check. * startup process waits for xact 2 to release buffers in table X (in LockBufferForCleanup()) * xact 2 waits for startup process to release ACCESS EXCLUSIVE lock in table Y This situation can occur when a) a transaction in the standby node tries to acquire ACCESS SHARE lock while holding some buffers b) startup process calls LockBufferForCleanup() for any of the buffers regards, -- Hiroyuki YAMADA Kokolink Corporation yam...@kokolink.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers