Re: [HACKERS] Hot Standby WAL reply uses heavyweight session locks, but doesn't have enough infrastructure set up
On 2015-02-03 14:18:02 +0900, Michael Paquier wrote: - RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u database %u relation %u, - lock-xid, lock-dbOid, lock-relOid); +RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u, + lock-xid); This patch is making the information provided less verbose, and I think that it is useful to have some information not only about the lock held, but as well about the database and the relation. It's debug4 or impossible stuff that lock.c already warned about - I doubt anybody has ever actually looked at it in a long while, if ever. If we really want to provide something more we can use something like LOCK_PRINT() - but I really doubt it's worth neither the notational, nor the verbosity overhead. Also, ISTM that StandbyAcquireLock should still use a database OID and a relation OID instead of a only LOCKTAG, and SET_LOCKTAG_RELATION should be set in StandbyAcquireLock while ResolveRecoveryConflictWithLock is extended only with the lock mode as new argument. (Patch 2 adds many calls to SET_LOCKTAG_RELATION btw justidying to keep he API changes minimal). But there's now callers acquiring other locks than relation locks, like dbase_redo() acquiring a object lock. And we need to acquire those via the standby mechanism to avoid races around release. We could add a separate wrapper for relation locks, but imo the locktag move to the callers saved about as many lines in some places as it cost in others. In patch 2, isn't it necessary to bump XLOG_PAGE_MAGIC? I don't think so, there's no incompatible change. Thanks for having a look! Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby WAL reply uses heavyweight session locks, but doesn't have enough infrastructure set up
On Sat, Jan 31, 2015 at 5:34 AM, Andres Freund and...@2ndquadrant.com wrote: On 2015-01-29 11:01:51 -0500, Robert Haas wrote: On Wed, Jan 28, 2015 at 2:41 AM, Michael Paquier michael.paqu...@gmail.com wrote: Andres Freund wrote: I think this isn't particularly pretty, but it seems to be working well enough, and changing it would be pretty invasive. So keeping in line with all that code seems to be easier. OK, I'm convinced with this part to remove the call of LockSharedObjectForSession that uses dontWait and replace it by a loop in ResolveRecoveryConflictWithDatabase. That seems right to me, too. It's slightly more complicated than that. The lock conflict should actually be resolved using ResolveRecoveryConflictWithLock()... That, combined with the race of connecting a actually already deleted database (see the XXXs I removed) seem to make the approach in here. Attached are two patches: 1) Infrastructure for attaching more kinds of locks on the startup process. 2) Use that infrastructure for database locks during replay. I'm not sure 2) alone would be sufficient justification for 1), but the nearby thread about basebackups also require similar infrastructure... Some comments about patch 1: -* No locking is required here because we already acquired -* AccessExclusiveLock. Anybody trying to connect while we do this will -* block during InitPostgres() and then disconnect when they see the -* database has been removed. +* No locking is required here because we already acquired a +* AccessExclusiveLock on the database in dbase_redo(). Anybody trying to +* connect while we do this will block during InitPostgres() and then +* disconnect when they see the database has been removed. */ This change looks unnecessary, I'd rather let it as-is. - RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u database %u relation %u, - lock-xid, lock-dbOid, lock-relOid); +RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u, + lock-xid); This patch is making the information provided less verbose, and I think that it is useful to have some information not only about the lock held, but as well about the database and the relation. Also, ISTM that StandbyAcquireLock should still use a database OID and a relation OID instead of a only LOCKTAG, and SET_LOCKTAG_RELATION should be set in StandbyAcquireLock while ResolveRecoveryConflictWithLock is extended only with the lock mode as new argument. (Patch 2 adds many calls to SET_LOCKTAG_RELATION btw justidying to keep he API changes minimal). There are some typos in the commit message: s/shanges/changes s/exlusive/exclusive In patch 2, isn't it necessary to bump XLOG_PAGE_MAGIC? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby WAL reply uses heavyweight session locks, but doesn't have enough infrastructure set up
On 2015-01-29 11:01:51 -0500, Robert Haas wrote: On Wed, Jan 28, 2015 at 2:41 AM, Michael Paquier michael.paqu...@gmail.com wrote: Andres Freund wrote: I think this isn't particularly pretty, but it seems to be working well enough, and changing it would be pretty invasive. So keeping in line with all that code seems to be easier. OK, I'm convinced with this part to remove the call of LockSharedObjectForSession that uses dontWait and replace it by a loop in ResolveRecoveryConflictWithDatabase. That seems right to me, too. It's slightly more complicated than that. The lock conflict should actually be resolved using ResolveRecoveryConflictWithLock()... That, combined with the race of connecting a actually already deleted database (see the XXXs I removed) seem to make the approach in here. Attached are two patches: 1) Infrastructure for attaching more kinds of locks on the startup process. 2) Use that infrastructure for database locks during replay. I'm not sure 2) alone would be sufficient justification for 1), but the nearby thread about basebackups also require similar infrastructure... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 8531ca4d200d24ee45265774f7ead613563adca4 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Fri, 30 Jan 2015 09:28:17 +0100 Subject: [PATCH 1/4] Allow recovery lock infrastructure to not only hold relation locks. This is a preparatory commit to fix several bugs, separated out for easier review. The startup process could so far only properly acquire access exlusive locks on transactions. As it does not setup enough state to do lock queuing - primarily to be able to issue recovery conflict interrupts, and to release them when the primary releases locks - it has it's own infrastructure to manage locks. That infrastructure so far assumed that all locks are access exlusive locks on relations. Unfortunately some code in the startup process has to acquire other locks than what's supported by the aforementioned infrastructure in standby.c. Namely dbase_redo() has to acquire locks on the database objects. Also further such locks will be added soon, to fix a another bug. So this patch shanges the infrastructure to be able to acquire locks of different modes and locktags. Additionally allow acquiring more heavyweight relation logs on the standby than RowExclusive when acquired in session mode. Discussion: 20150120152819.gc24...@alap3.anarazel.de Backpatch all the way. --- src/backend/storage/ipc/standby.c | 120 ++ src/backend/storage/lmgr/lock.c | 7 +-- src/include/storage/standby.h | 2 +- 3 files changed, 61 insertions(+), 68 deletions(-) diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c index 292bed5..0502aab 100644 --- a/src/backend/storage/ipc/standby.c +++ b/src/backend/storage/ipc/standby.c @@ -38,10 +38,16 @@ int max_standby_archive_delay = 30 * 1000; int max_standby_streaming_delay = 30 * 1000; static List *RecoveryLockList; +typedef struct RecoveryLockListEntry +{ + TransactionId xid; + LOCKMODE lockmode; + LOCKTAG locktag; +} RecoveryLockListEntry; static void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist, ProcSignalReason reason); -static void ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid); +static void ResolveRecoveryConflictWithLock(LOCKTAG *tag, LOCKMODE mode); static void SendRecoveryConflictWithBufferPin(ProcSignalReason reason); static XLogRecPtr LogCurrentRunningXacts(RunningTransactions CurrRunningXacts); static void LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks); @@ -320,10 +326,10 @@ ResolveRecoveryConflictWithDatabase(Oid dbid) * us. This is rare enough that we do this as simply as possible: no wait, * just force them off immediately. * - * No locking is required here because we already acquired - * AccessExclusiveLock. Anybody trying to connect while we do this will - * block during InitPostgres() and then disconnect when they see the - * database has been removed. + * No locking is required here because we already acquired a + * AccessExclusiveLock on the database in dbase_redo(). Anybody trying to + * connect while we do this will block during InitPostgres() and then + * disconnect when they see the database has been removed. */ while (CountDBBackends(dbid) 0) { @@ -338,14 +344,11 @@ ResolveRecoveryConflictWithDatabase(Oid dbid) } static void -ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid) +ResolveRecoveryConflictWithLock(LOCKTAG *locktag, LOCKMODE mode) { VirtualTransactionId *backends; bool lock_acquired = false; int num_attempts = 0; - LOCKTAG locktag; - - SET_LOCKTAG_RELATION(locktag, dbOid, relOid); /* * If blowing away everybody with conflicting locks doesn't work, after @@ -358,7 +361,7
Re: [HACKERS] Hot Standby WAL reply uses heavyweight session locks, but doesn't have enough infrastructure set up
On Wed, Jan 28, 2015 at 2:41 AM, Michael Paquier michael.paqu...@gmail.com wrote: Andres Freund wrote: I think this isn't particularly pretty, but it seems to be working well enough, and changing it would be pretty invasive. So keeping in line with all that code seems to be easier. OK, I'm convinced with this part to remove the call of LockSharedObjectForSession that uses dontWait and replace it by a loop in ResolveRecoveryConflictWithDatabase. That seems right to me, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby WAL reply uses heavyweight session locks, but doesn't have enough infrastructure set up
On 2015-01-27 16:23:53 +0900, Michael Paquier wrote: On Tue, Jan 27, 2015 at 6:24 AM, Andres Freund and...@2ndquadrant.com wrote: Unfortunately that Assert()s when there's a lock conflict because e.g. another backend is currently connecting. That's because ProcSleep() does a enable_timeout_after(DEADLOCK_TIMEOUT, DeadlockTimeout) - and there's no deadlock timeout (or lock timeout) handler registered. Yes, that could logically happen if there is a lock conflicting as RowExclusiveLock or lower lock can be taken in recovery. I don't this specific lock (it's a object, not relation lock) can easily be taken directly by a user except during authentication. [...] afaics, that should work? Not pretty, but probably easier than starting to reason about the deadlock detector in the startup process. Wouldn't it be cleaner to simply register a dedicated handler in StartupXlog instead, obviously something else than DEADLOCK_TIMEOUT as it is reserved for backend operations? For back-branches, we may even consider using DEADLOCK_TIMEOUT.. What would that timeout handler actually do? Two problems: a) We really don't want the startup process be killed/error out as that likely means a postmaster restart. So the default deadlock detector strategy is something that's really useless for us. b) Pretty much all other (if not actually all other) heavyweight lock acquisitions in the startup process acquire locks using dontWait = true - which means that the deadlock detector isn't actually run. That's essentially fine because we simply kill everything in our way. C.f. StandbyAcquireAccessExclusiveLock() et al. There's a dedicated 'deadlock detector' like infrastructure around ResolveRecoveryConflictWithBufferPin(), but it deals with a class of deadlocks that's not handled in the deadlock detector anyway. I think this isn't particularly pretty, but it seems to be working well enough, and changing it would be pretty invasive. So keeping in line with all that code seems to be easier. We probably should also add a Assert(!InRecovery || sessionLock) to LockAcquireExtended() - these kind of problems are otherwise hard to find in a developer setting. So this means that locks other than session ones cannot be taken while a node is in recovery, but RowExclusiveLock can be taken while in recovery. Don't we have a problem with this assertion then? Note that InRecovery doesn't mean what you probably think it means: /* * Are we doing recovery from XLOG? * * This is only ever true in the startup process; it should be read as meaning * this process is replaying WAL records, rather than the system is in * recovery mode. It should be examined primarily by functions that need * to act differently when called from a WAL redo function (e.g., to skip WAL * logging). To check whether the system is in recovery regardless of which * process you're running in, use RecoveryInProgress() but only after shared * memory startup and lock initialization. */ boolInRecovery = false; The assertion actually should be even stricter: Assert(InRecovery || (sessionLock dontWait)); i.e. we never acquire a heavyweight lock in the startup process unless it's a session lock (as we don't have resource managers/a xact to track locks) and we don't wait (as we don't have the deadlock detector infrastructure set up). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby WAL reply uses heavyweight session locks, but doesn't have enough infrastructure set up
Andres Freund wrote: I think this isn't particularly pretty, but it seems to be working well enough, and changing it would be pretty invasive. So keeping in line with all that code seems to be easier. OK, I'm convinced with this part to remove the call of LockSharedObjectForSession that uses dontWait and replace it by a loop in ResolveRecoveryConflictWithDatabase. Note that InRecovery doesn't mean what you probably think it means: [stuff] boolInRecovery = false; Yes, right. I misunderstood with RecoveryInProgress(). The assertion actually should be even stricter: Assert(!InRecovery || (sessionLock dontWait)); i.e. we never acquire a heavyweight lock in the startup process unless it's a session lock (as we don't have resource managers/a xact to track locks) and we don't wait (as we don't have the deadlock detector infrastructure set up). No problems with this assertion here. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby WAL reply uses heavyweight session locks, but doesn't have enough infrastructure set up
On Tue, Jan 27, 2015 at 6:24 AM, Andres Freund and...@2ndquadrant.com wrote: Unfortunately that Assert()s when there's a lock conflict because e.g. another backend is currently connecting. That's because ProcSleep() does a enable_timeout_after(DEADLOCK_TIMEOUT, DeadlockTimeout) - and there's no deadlock timeout (or lock timeout) handler registered. Yes, that could logically happen if there is a lock conflicting as RowExclusiveLock or lower lock can be taken in recovery. [...] afaics, that should work? Not pretty, but probably easier than starting to reason about the deadlock detector in the startup process. Wouldn't it be cleaner to simply register a dedicated handler in StartupXlog instead, obviously something else than DEADLOCK_TIMEOUT as it is reserved for backend operations? For back-branches, we may even consider using DEADLOCK_TIMEOUT.. We probably should also add a Assert(!InRecovery || sessionLock) to LockAcquireExtended() - these kind of problems are otherwise hard to find in a developer setting. So this means that locks other than session ones cannot be taken while a node is in recovery, but RowExclusiveLock can be taken while in recovery. Don't we have a problem with this assertion then? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby doesn't come up on some situation.
Hello, After all, I have confirmed that this fixes the problem on crash recovery of hot-standby botfor 9.3 and HEAD and no problem was found except unreadability :( Ok, committed. Thanks! Thank you. Any concrete suggestions about the readability? Is there some particular spot that needs clarifying? Well, Although I became to see no problem there after I understood how it works:-), I'll write about two points where I had difficulties to understand. | * (or the checkpoint record itself, if it's a shutdown checkpoint). | */ | if (checkPoint.redo RecPtr) First, it was a bit tough work to confirm the equivalence between (redo == RecPtr) and that the checkpoint is shutdown checkpoint. Although finally I was convinced that it surely holds, that is actually not the point. The point here is in the first half of the phrase. The comment might be less perplexing if it were as folowing even if only shutdown checkpoint satisfies the condition. But it would occur another quiestion in readers' mind. | * (or the checkpoint record itself, e.g. if it's a shutdown checkpoint). Second, the added code depends on the assumption that RecPtr points to the checkpoint record and EndRecPtr points to the next record there. It would be better for understandability and stability (against modifying code) to explicitly declare the precondition, like this | Here RecPtr points the checkpoint record and EndRecPtr points to the | place for the record just after. Surely this is the consequence of illegal operation but I think it is also not a issue of assertion - which fires on something wrong in design or quite rare cases(this case ?). Ah, I see. Yes, that's definitely a bug. If you don't hit the assertion, because the oldestActiveXID is set in the checkpoint record even though wal_level is 'archive', or if you simply have assertions disabled, the system will start up in hot standby mode even though it's not safe. So it might be better to show message as below on the case. | FATAL: Checkpoint doesn't have valid oldest active transaction id | HINT: Reading WAL might have been written under insufficient | wal_level. Agreed. Hmm. When I test that with 9.2, oldestActiveXID is not 0, even though wal_level is 'archive'. So the above patch doesn't fix the whole problem. The real bug here is that CheckRequiredParameterValues() tests for InArchiveRecovery, when it should be testing for ArchiveRecoveryRequested. Otherwise, the checks are not performed when going through the crash recovery followed by archive recovery. I should've changed that as part of the commit that added the crash recovery then archive recovery behavior. Fixed, thanks for pointing it out! It's my pleasre. regrds, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby doesn't come up on some situation.
Hello, After all, I have confirmed that this fixes the problem on crash recovery of hot-standby botfor 9.3 and HEAD and no problem was found except unreadability :( By the way, I moderately want to fix an assertion message to a ordinary one. Details are below. The server stops with following message during restarting after crash requesting archive recovery when the WAL has been produced with the wal_level below WAL_LEVEL_HOT_STANDBY. | TRAP: FailedAssertion(!(((oldestActiveXID) != ((TransactionId) 0))), File: xlog.c, Line: 6799) | LOG: startup process (PID 7270) was terminated by signal 6: Aborted Surely this is the consequence of illegal operation but I think it is also not a issue of assertion - which fires on something wrong in design or quite rare cases(this case ?). So it might be better to show message as below on the case. | FATAL: Checkpoint doesn't have valid oldest active transaction id | HINT: Reading WAL might have been written under insufficient wal_level. This could do in this way, == diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e3d5e10..bb6922a 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6789,7 +6789,13 @@ StartupXLOG(void) if (wasShutdown) oldestActiveXID = PrescanPreparedTransactions(xids, nxids); else + { oldestActiveXID = checkPoint.oldestActiveXid; + if (!TransactionIdIsValid(oldestActiveXID)) + ereport(FATAL, + (errmsg(Checkpoint doesn't have valid oldest active transaction id), +errhint(Reading WAL might have been written under insufficient wal_level.))); + } Assert(TransactionIdIsValid(oldestActiveXID)); /* Tell procarray about the range of xids it has to deal with */ = What do you think about this? Feel free dumping this if you feel negative on this. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby doesn't come up on some situation.
On 03/05/2014 10:51 AM, Kyotaro HORIGUCHI wrote: Hello, After all, I have confirmed that this fixes the problem on crash recovery of hot-standby botfor 9.3 and HEAD and no problem was found except unreadability :( Ok, committed. Thanks! Any concrete suggestions about the readability? Is there some particular spot that needs clarifying? By the way, I moderately want to fix an assertion message to a ordinary one. Details are below. The server stops with following message during restarting after crash requesting archive recovery when the WAL has been produced with the wal_level below WAL_LEVEL_HOT_STANDBY. | TRAP: FailedAssertion(!(((oldestActiveXID) != ((TransactionId) 0))), File: xlog.c, Line: 6799) | LOG: startup process (PID 7270) was terminated by signal 6: Aborted Surely this is the consequence of illegal operation but I think it is also not a issue of assertion - which fires on something wrong in design or quite rare cases(this case ?). Ah, I see. Yes, that's definitely a bug. If you don't hit the assertion, because the oldestActiveXID is set in the checkpoint record even though wal_level is 'archive', or if you simply have assertions disabled, the system will start up in hot standby mode even though it's not safe. So it might be better to show message as below on the case. | FATAL: Checkpoint doesn't have valid oldest active transaction id | HINT: Reading WAL might have been written under insufficient wal_level. This could do in this way, == diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e3d5e10..bb6922a 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6789,7 +6789,13 @@ StartupXLOG(void) if (wasShutdown) oldestActiveXID = PrescanPreparedTransactions(xids, nxids); else + { oldestActiveXID = checkPoint.oldestActiveXid; + if (!TransactionIdIsValid(oldestActiveXID)) + ereport(FATAL, + (errmsg(Checkpoint doesn't have valid oldest active transaction id), +errhint(Reading WAL might have been written under insufficient wal_level.))); + } Assert(TransactionIdIsValid(oldestActiveXID)); /* Tell procarray about the range of xids it has to deal with */ = What do you think about this? Feel free dumping this if you feel negative on this. Hmm. When I test that with 9.2, oldestActiveXID is not 0, even though wal_level is 'archive'. So the above patch doesn't fix the whole problem. The real bug here is that CheckRequiredParameterValues() tests for InArchiveRecovery, when it should be testing for ArchiveRecoveryRequested. Otherwise, the checks are not performed when going through the crash recovery followed by archive recovery. I should've changed that as part of the commit that added the crash recovery then archive recovery behavior. Fixed, thanks for pointing it out! - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby doesn't come up on some situation.
Hello, At Fri, 28 Feb 2014 14:45:58 +0200, Heikki Linnakangas hlinnakan...@vmware.com wrote in 53108506.2010...@vmware.com Yes, but the same stuation could be made by restarting crashed secondary. Yeah. I have no idea about the scenario on whitch this behavior was regarded as undesirable but anyway I think that the secondry should start accepting client just after crash recovery is completed. Agreed, this is a bug. I don't think your patch is the right fix for this though. Setting minRecoveryPoint to EndRecPtr is the right thing to do; EndRecPtr points to the end of the last read and replayed record. What's wrong in this case is lastReplayedEndRecptr. At the beginning of recovery, it's initialized to the REDO point, but with a shutdown checkpoint, that's not quite right. When starting from a shutdown checkpoint, REDO points to the beginning of the shutdown record, but we've already effectively replayed it. The next record we replay is the one after the checkpoint. It's more reasonable. I felt uncelar about that but I forgot to doubt the correctness of lastReplayedEndRecptr then, but surely the shutdown record itself was effectively alredy replayed when the recored is inseretd. To see that, I added some elog(LOG) calls: ~/pgsql.93stable$ bin/postmaster -D data LOG: database system was shut down at 2014-02-28 14:06:18 EET LOG: ReadCheckpointRecord: 0/16479C98 LOG: database system is ready to accept connections LOG: autovacuum launcher started ^CLOG: received fast shutdown request LOG: aborting any active transactions LOG: autovacuum launcher shutting down LOG: shutting down LOG: INSERT @ 0/16479D00: prev 0/16479C98; xid 0; len 72: XLOG - checkpoint: redo 0/16479D00; tli 1; prev tli 1; fpw true; xid 0/793393; oid 24988; multi 655288; offset 1356722; oldest xid 687 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown LOG: xlog flush request 0/16479D68; write 0/0; flush 0/0 LOG: database system is shut down ~/pgsql.93stable$ bin/postmaster -D data LOG: database system was shut down at 2014-02-28 14:06:23 EET LOG: ReadCheckpointRecord: 0/16479D00 LOG: database system is ready to accept connections LOG: autovacuum launcher started Killed At this point, the last record is the shutdown checkpoint, beginning at 16479D00, and the server has been killed (immediate shutdown). ~/pgsql.93stable$ cp recovery.conf data/recovery.conf ~/pgsql.93stable$ bin/postmaster -D data LOG: database system was interrupted; last known up at 2014-02-28 14:06:29 EET LOG: entering standby mode LOG: ReadCheckpointRecord: 0/16479D00 LOG: database system was not properly shut down; automatic recovery in progress LOG: record with zero length at 0/16479D68 LOG: reached end of WAL in pg_xlog, entering archive recovery LOG: EndRecPtr: 0/16479D68 lastReplayedEndRecPtr: 0/16479D00 FATAL: could not connect to the primary server: could not connect to server: Connection refused ... Recovery starts from the checkpoint record, but lastReplayedEndRecPtr is set to the *beginning* of the checkpoint record, even though the checkpoint record has already been effectively replayed, by the feat of starting recovery from it. EndRecPtr correctly points to the end of the checkpoint record. Because of the incorrect lastReplayedEndRecPtr value, the CheckRecoveryConsistency() call concludes that it's not consistent. I completely understood the behavior thanks to your detailed explanation. (And how to use log messages effectively :-) I agree that the fix is appropriate. I believe the attached fix is the right way to fix this. It also worked for me. Thank you. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby doesn't come up on some situation.
Ouch! It brought another bug. I completely understood the behavior thanks to your detailed explanation. (And how to use log messages effectively :-) Sorry, I just found that it's wrong, and found another problem brought by your patch. I agree that the fix is appropriate. I believe the attached fix is the right way to fix this. It also worked for me. Thank you. | * as if we had just replayed the record before the REDO location | * (or the checkpoint record itself, if it's a shutdown checkpoint). The test script following raises assertion failure. It's added with 'non-shutdown' checkpoint' just before shutting down immediately. Starting server aborts with the following message. | LOG: database system was not properly shut down; automatic recovery in progress | TRAP: FailedAssertion(!(((oldestActiveXID) != ((TransactionId) 0))), File: xlog.c, Line: 6771) | LOG: startup process (PID 28561) was terminated by signal 6: Aborted regards, -- Kyotaro Horiguchi NTT Open Source Software Center === #! /bin/sh killall postgres rm -rf $PGDATA/* initdb pg_ctl start -w sleep 1 psql postgres -c 'checkpoint' pg_ctl stop -m i cat $PGDATA/recovery.conf EOF standby_mode = 'on' primary_conninfo = 'host=localhost port= user=repuser application_name=pm01 keepalives_idle=60 keepalives_interval=5 keepalives_count=5' #restore_command = '/bin/true' recovery_target_timeline = 'latest' EOF cat $PGDATA/postgresql.conf EOF #log_min_messages = debug5 hot_standby = on EOF pg_ctl start -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby doesn't come up on some situation.
Correcting one point of my last mail. Ouch! It brought another bug. My patch also did. regards, I completely understood the behavior thanks to your detailed explanation. (And how to use log messages effectively :-) Sorry, I just found that it's wrong, and found another problem brought by your patch. I agree that the fix is appropriate. I believe the attached fix is the right way to fix this. It also worked for me. Thank you. | * as if we had just replayed the record before the REDO location | * (or the checkpoint record itself, if it's a shutdown checkpoint). The test script following raises assertion failure. It's added with 'non-shutdown' checkpoint' just before shutting down immediately. Starting server aborts with the following message. | LOG: database system was not properly shut down; automatic recovery in progress | TRAP: FailedAssertion(!(((oldestActiveXID) != ((TransactionId) 0))), File: xlog.c, Line: 6771) | LOG: startup process (PID 28561) was terminated by signal 6: Aborted regards, -- Kyotaro Horiguchi NTT Open Source Software Center === #! /bin/sh killall postgres rm -rf $PGDATA/* initdb pg_ctl start -w sleep 1 psql postgres -c 'checkpoint' pg_ctl stop -m i cat $PGDATA/recovery.conf EOF standby_mode = 'on' primary_conninfo = 'host=localhost port= user=repuser application_name=pm01 keepalives_idle=60 keepalives_interval=5 keepalives_count=5' #restore_command = '/bin/true' recovery_target_timeline = 'latest' EOF cat $PGDATA/postgresql.conf EOF #log_min_messages = debug5 hot_standby = on EOF pg_ctl start -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby doesn't come up on some situation.
Hello, | * as if we had just replayed the record before the REDO location | * (or the checkpoint record itself, if it's a shutdown checkpoint). The test script following raises assertion failure. It's added with 'non-shutdown' checkpoint' just before shutting down immediately. Starting server aborts with the following message. | LOG: database system was not properly shut down; automatic recovery in progress | TRAP: FailedAssertion(!(((oldestActiveXID) != ((TransactionId) 0))), File: xlog.c, Line: 6771) | LOG: startup process (PID 28561) was terminated by signal 6: Aborted This is because the checkpoint was done with 'wal_level = minimal'. The server restarts correctly by starting the server with 'wal_level = hot_standby' at first. It looks a mere mis-op. The log message looks unfriendly but I'm uncertain of the necessity of changing it. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby doesn't come up on some situation.
Ouch. this is the same issue to the mail below, http://www.postgresql.org/message-id/53104595.6060...@lab.ntt.co.jp regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby doesn't come up on some situation.
On 2014-02-28 17:55:21 +0900, Kyotaro HORIGUCHI wrote: The recovery process stays on 'incosistent' state forever when the server has crashed before any wal record is inserted after the last checkpoint. # killall postgres # rm -rf $PGDATA/* initdb pg_ctl start -w sleep 1 pg_ctl stop -m i cat $PGDATA/recovery.conf EOF standby_mode = 'on' primary_conninfo = 'host=localhost port= user=repuser application_name=pm01 keepalives_idle=60 keepalives_interval=5 keepalives_count=5' #restore_command = '/bin/true' recovery_target_timeline = 'latest' EOF cat $PGDATA/postgresql.conf EOF #log_min_messages = debug5 hot_standby = on EOF pg_ctl start Uh. So, if I understand correctly, what you did is to convert a normal live pg, into a replica that doesn't have a upstream node, right? Normally the primary will just do an additional write shortly afterwards, resolving this situation? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby doesn't come up on some situation.
Hello, 2014/02/28 18:07 Andres Freund : On 2014-02-28 17:55:21 +0900, Kyotaro HORIGUCHI wrote: The recovery process stays on 'incosistent' state forever when the server has crashed before any wal record is inserted after the last checkpoint. # killall postgres # rm -rf $PGDATA/* initdb pg_ctl start -w sleep 1 pg_ctl stop -m i cat $PGDATA/recovery.conf EOF standby_mode = 'on' primary_conninfo = 'host=localhost port= user=repuser application_name=pm01 keepalives_idle=60 keepalives_interval=5 keepalives_count=5' #restore_command = '/bin/true' recovery_target_timeline = 'latest' EOF cat $PGDATA/postgresql.conf EOF #log_min_messages = debug5 hot_standby = on EOF pg_ctl start Uh. So, if I understand correctly, what you did is to convert a normal live pg, into a replica that doesn't have a upstream node, right? Yes, but the same stuation could be made by restarting crashed secondary. I didn't tried it since my console is far behind now... Can a restart of secondry after crashing just after the completion of restartpoint with no further records make the same situation? I have no idea about the scenario on whitch this behavior was regarded as undesirable but anyway I think that the secondry should start accepting client just after crash recovery is completed. Normally the primary will just do an additional write shortly afterwards, resolving this situation? Maybe so. I haven't tried though. regards, -- Kyotaro Horiguchi NTT Opensource Software Center
Re: [HACKERS] Hot standby doesn't come up on some situation.
On 02/28/2014 11:51 AM, Kyotaro HORIGUCHI wrote: Hello, 2014/02/28 18:07 Andres Freund : On 2014-02-28 17:55:21 +0900, Kyotaro HORIGUCHI wrote: The recovery process stays on 'incosistent' state forever when the server has crashed before any wal record is inserted after the last checkpoint. # killall postgres # rm -rf $PGDATA/* initdb pg_ctl start -w sleep 1 pg_ctl stop -m i cat $PGDATA/recovery.conf EOF standby_mode = 'on' primary_conninfo = 'host=localhost port= user=repuser application_name=pm01 keepalives_idle=60 keepalives_interval=5 keepalives_count=5' #restore_command = '/bin/true' recovery_target_timeline = 'latest' EOF cat $PGDATA/postgresql.conf EOF #log_min_messages = debug5 hot_standby = on EOF pg_ctl start Uh. So, if I understand correctly, what you did is to convert a normal live pg, into a replica that doesn't have a upstream node, right? Yes, but the same stuation could be made by restarting crashed secondary. Yeah. I have no idea about the scenario on whitch this behavior was regarded as undesirable but anyway I think that the secondry should start accepting client just after crash recovery is completed. Agreed, this is a bug. I don't think your patch is the right fix for this though. Setting minRecoveryPoint to EndRecPtr is the right thing to do; EndRecPtr points to the end of the last read and replayed record. What's wrong in this case is lastReplayedEndRecptr. At the beginning of recovery, it's initialized to the REDO point, but with a shutdown checkpoint, that's not quite right. When starting from a shutdown checkpoint, REDO points to the beginning of the shutdown record, but we've already effectively replayed it. The next record we replay is the one after the checkpoint. To see that, I added some elog(LOG) calls: ~/pgsql.93stable$ bin/postmaster -D data LOG: database system was shut down at 2014-02-28 14:06:18 EET LOG: ReadCheckpointRecord: 0/16479C98 LOG: database system is ready to accept connections LOG: autovacuum launcher started ^CLOG: received fast shutdown request LOG: aborting any active transactions LOG: autovacuum launcher shutting down LOG: shutting down LOG: INSERT @ 0/16479D00: prev 0/16479C98; xid 0; len 72: XLOG - checkpoint: redo 0/16479D00; tli 1; prev tli 1; fpw true; xid 0/793393; oid 24988; multi 655288; offset 1356722; oldest xid 687 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown LOG: xlog flush request 0/16479D68; write 0/0; flush 0/0 LOG: database system is shut down ~/pgsql.93stable$ bin/postmaster -D data LOG: database system was shut down at 2014-02-28 14:06:23 EET LOG: ReadCheckpointRecord: 0/16479D00 LOG: database system is ready to accept connections LOG: autovacuum launcher started Killed At this point, the last record is the shutdown checkpoint, beginning at 16479D00, and the server has been killed (immediate shutdown). ~/pgsql.93stable$ cp recovery.conf data/recovery.conf ~/pgsql.93stable$ bin/postmaster -D data LOG: database system was interrupted; last known up at 2014-02-28 14:06:29 EET LOG: entering standby mode LOG: ReadCheckpointRecord: 0/16479D00 LOG: database system was not properly shut down; automatic recovery in progress LOG: record with zero length at 0/16479D68 LOG: reached end of WAL in pg_xlog, entering archive recovery LOG: EndRecPtr: 0/16479D68 lastReplayedEndRecPtr: 0/16479D00 FATAL: could not connect to the primary server: could not connect to server: Connection refused ... Recovery starts from the checkpoint record, but lastReplayedEndRecPtr is set to the *beginning* of the checkpoint record, even though the checkpoint record has already been effectively replayed, by the feat of starting recovery from it. EndRecPtr correctly points to the end of the checkpoint record. Because of the incorrect lastReplayedEndRecPtr value, the CheckRecoveryConsistency() call concludes that it's not consistent. I believe the attached fix is the right way to fix this. - Heikki diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 7399fd4..a36354f 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5457,13 +5457,17 @@ StartupXLOG(void) /* * Initialize shared variables for tracking progress of WAL replay, - * as if we had just replayed the record before the REDO location. + * as if we had just replayed the record before the REDO location + * (or the checkpoint record itself, if it's a shutdown checkpoint). */ SpinLockAcquire(xlogctl-info_lck); - xlogctl-replayEndRecPtr = checkPoint.redo; + if (checkPoint.redo RecPtr) + xlogctl-replayEndRecPtr = checkPoint.redo; + else + xlogctl-replayEndRecPtr = EndRecPtr; xlogctl-replayEndTLI = ThisTimeLineID; - xlogctl-lastReplayedEndRecPtr = checkPoint.redo; - xlogctl-lastReplayedTLI = ThisTimeLineID; + xlogctl-lastReplayedEndRecPtr = xlogctl-replayEndRecPtr; + xlogctl-lastReplayedTLI = xlogctl-replayEndTLI;
Re: [HACKERS] Hot Standby conflict resolution handling
On 2013-01-17 01:38:31 -0500, Tom Lane wrote: But having said that ... are we sure this code is not actually broken? ISTM that if we dare not interrupt for fear of confusing OpenSSL, we cannot safely attempt to send an error message to the client either; but ereport(FATAL) will try exactly that. You're absolutely right. ISTM, to fix it we would have to either provide a COMERROR like facility for FATAL errors or just remove the requirement of raising an error exactly there. If I remember what I tried before correctly the latter seems to involve setting openssl into nonblocking mode and check for the saved error in the EINTR loop in be-secure and re-raise the error from there. Do we want to backport either - there hasn't been any report that I could link to it right now, but on the other hand it could possibly cause rather ugly problems (data leakage, segfaults, code execution aren't all that improbable)? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby conflict resolution handling
Pavan Deolasee pavan.deola...@gmail.com writes: On Thu, Jan 17, 2013 at 12:08 PM, Tom Lane t...@sss.pgh.pa.us wrote: ISTM that if we dare not interrupt for fear of confusing OpenSSL, we cannot safely attempt to send an error message to the client either; but ereport(FATAL) will try exactly that. I thought since FATAL will force the backend to exit, we don't care much about corrupted OpenSSL state. I even thought that's why we raise ERROR to FATAL so that the backend can start in a clean state. But clearly I'm missing a point here because you don't think that way. If we were to simply exit(1), leaving the kernel to close the client socket, it'd be safe enough because control would never have returned to OpenSSL. But this code doesn't do that. What we're looking at is that we've interrupted OpenSSL at some arbitrary point, and now we're going to make fresh calls to it to try to pump the FATAL error message out to the client. It seems fairly unlikely that that's safe. I'm not sure I credit Andres' worry of arbitrary code execution, but I do fear that OpenSSL could get confused to the point of freezing up, or even more likely that it would transmit garbage to the client, which rather defeats the purpose. Don't see a nice fix. The COMMERROR approach (ie, don't try to send anything to the client, only the log) is not nice at all since the client would get the impression that the server crashed. On the other hand, anything else requires waiting till we get control back from OpenSSL, which might be a long time, and meanwhile we're still holding locks that prevent WAL recovery from proceeding. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby conflict resolution handling
On 2013-01-17 10:19:23 -0500, Tom Lane wrote: Pavan Deolasee pavan.deola...@gmail.com writes: On Thu, Jan 17, 2013 at 12:08 PM, Tom Lane t...@sss.pgh.pa.us wrote: ISTM that if we dare not interrupt for fear of confusing OpenSSL, we cannot safely attempt to send an error message to the client either; but ereport(FATAL) will try exactly that. I thought since FATAL will force the backend to exit, we don't care much about corrupted OpenSSL state. I even thought that's why we raise ERROR to FATAL so that the backend can start in a clean state. But clearly I'm missing a point here because you don't think that way. If we were to simply exit(1), leaving the kernel to close the client socket, it'd be safe enough because control would never have returned to OpenSSL. But this code doesn't do that. What we're looking at is that we've interrupted OpenSSL at some arbitrary point, and now we're going to make fresh calls to it to try to pump the FATAL error message out to the client. It seems fairly unlikely that that's safe. I'm not sure I credit Andres' worry of arbitrary code execution, but I do fear that OpenSSL could get confused to the point of freezing up, or even more likely that it would transmit garbage to the client, which rather defeats the purpose. I don't think its likely either, I seem to remember it copying arround function pointers though, so it seems possible with some bad luck. Don't see a nice fix. The COMMERROR approach (ie, don't try to send anything to the client, only the log) is not nice at all since the client would get the impression that the server crashed. On the other hand, anything else requires waiting till we get control back from OpenSSL, which might be a long time, and meanwhile we're still holding locks that prevent WAL recovery from proceeding. I think we can make openssl return pretty much immediately if we assume recv() can reliably interrupted by signals, possibly by setting the socket to nonblocking in the signal handler. We just need to tell openssl not to retry immediately and we should be fine. Given that quite some people use openssl with nonblocking sockets, that code path should be reasonably safe. That still requires ugliness around saving the error and reraising it after returning from openssl though... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby conflict resolution handling
At 2012-12-29 14:23:45 -0500, sfr...@snowman.net wrote: Regarding the actual comment, here's the wording that I'd use: Sorry for nitpicking, but we can't long jumps made me cringe. Here's a slightly more condensed version: /* * We can't use ereport(ERROR) here, because any longjmps * in DoingCommandRead state run the risk of violating our * protocol or the SSL protocol, by interrupting OpenSSL in * the middle of changing its internal state. * * Currently, the only option is to promote ERROR to FATAL * until we figure out a better way to handle errors in this * state. */ Patch along these lines attached, which also removes trailing whitespace from the original patch. -- Abhijit diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 407c548..5b952d4 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -2908,6 +2908,17 @@ ProcessInterrupts(void) DisableNotifyInterrupt(); DisableCatchupInterrupt(); pgstat_report_recovery_conflict(RecoveryConflictReason); + + /* + * We can't use ereport(ERROR) here, because any longjmps + * in DoingCommandRead state run the risk of violating our + * protocol or the SSL protocol, by interrupting OpenSSL in + * the middle of changing its internal state. + * + * Currently, the only option is to promote ERROR to FATAL + * until we figure out a better way to handle errors in this + * state. + */ if (DoingCommandRead) ereport(FATAL, (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby conflict resolution handling
On Thu, Jan 17, 2013 at 9:26 AM, Abhijit Menon-Sen a...@2ndquadrant.comwrote: At 2012-12-29 14:23:45 -0500, sfr...@snowman.net wrote: Regarding the actual comment, here's the wording that I'd use: Sorry for nitpicking, but we can't long jumps made me cringe. Here's a slightly more condensed version: /* * We can't use ereport(ERROR) here, because any longjmps * in DoingCommandRead state run the risk of violating our * protocol or the SSL protocol, by interrupting OpenSSL in * the middle of changing its internal state. * * Currently, the only option is to promote ERROR to FATAL * until we figure out a better way to handle errors in this * state. */ Patch along these lines attached, which also removes trailing whitespace from the original patch. Thanks Stephen and Abhijit for improving the comments. I like this wording. So +1 from my side. Abhijit, do you want to add the patch and change the CF status appropriately ? Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
Re: [HACKERS] Hot Standby conflict resolution handling
Abhijit Menon-Sen a...@2ndquadrant.com writes: Sorry for nitpicking, but we can't long jumps made me cringe. Agreed :-( Here's a slightly more condensed version: I find this still not quite right, because where it's placed, it applies to both the DoingCommandRead and !DoingCommandRead branches of the if/else statement. The wording would be okay if placed inside the first if branch, but I think visually that would look ugly. I'm inclined to suggest wording it like * If we're in DoingCommandRead state, we can't use ereport(ERROR), * because any longjmp would risk interrupting OpenSSL operations * and thus confusing that library and/or violating wire protocol. plus your second para as-is. But having said that ... are we sure this code is not actually broken? ISTM that if we dare not interrupt for fear of confusing OpenSSL, we cannot safely attempt to send an error message to the client either; but ereport(FATAL) will try exactly that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby conflict resolution handling
On Thu, Jan 17, 2013 at 12:08 PM, Tom Lane t...@sss.pgh.pa.us wrote: But having said that ... are we sure this code is not actually broken? I'm not. ISTM that if we dare not interrupt for fear of confusing OpenSSL, we cannot safely attempt to send an error message to the client either; but ereport(FATAL) will try exactly that. I thought since FATAL will force the backend to exit, we don't care much about corrupted OpenSSL state. I even thought that's why we raise ERROR to FATAL so that the backend can start in a clean state. But clearly I'm missing a point here because you don't think that way. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
Re: [HACKERS] Hot Standby conflict resolution handling
Pavan, * Pavan Deolasee (pavan.deola...@gmail.com) wrote: On Tue, Dec 4, 2012 at 6:01 PM, Pavan Deolasee pavan.deola...@gmail.comwrote: Thanks Andres. I also read the original thread and I now understand why we are using FATAL here, at least until we have a better solution. Obviously the connection reset is no good either because as someone commented in the original discussion, I thought that I'm seeing a server crash while it was not. How about attached comment to be added at the appropriate place ? I've extracted this mostly from Tom's explanation in the original thread. I was hoping to see an update to the actual error messages returned in this patch.. I agree that it's good to add the comments but that doesn't do anything to help the user out in this case. Regarding the actual comment, here's the wording that I'd use: --- If we are in DoingCommandRead state, we can't use ereport(ERROR) because we can't long jumps in this state. If we attempt to longjmps in this state, we not only risk breaking protocol at our level, but also risk leaving openssl in an inconsistent state, either violating the ssl protocol or having its internal state trashed because it was interrupted while in the middle of changing that state. Currently, the only option is to promote ERROR to FATAL until we figure out a way to handle errors more effectively while in this state. --- If you agree with that wording update, can you update the patch? Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Hot Standby conflict resolution handling
Hi, On 2012-12-04 12:30:43 +0530, Pavan Deolasee wrote: I was trying some simple queries on a Hot Standby with streaming replication. On standby, I do this: postgres=# begin transaction isolation level repeatable read; BEGIN postgres=# explain verbose select count(b) from test WHERE a 10; On master, I insert a bunch of tuples in the table and run VACUUM ANALYZE. postgres=# INSERT INTO test VALUES (generate_series(110001,12), 'foo', 1); INSERT 0 1 postgres=# VACUUM ANALYZE test; VACUUM After max_standby_streaming_delay, the standby starts cancelling the queries. I get an error like this on the standby: postgres=# explain verbose select count(b) from test WHERE a 10; FATAL: terminating connection due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. So I've couple questions/concerns here 1. Why to throw a FATAL error here ? A plain ERROR should be enough to abort the transaction. There are four places in ProcessInterrupts() where we throw these kind of errors and three of them are FATAL. The problem here is that were in IDLE IN TRANSACTION in this case. Which currently cannot be cancelled (i.e. pg_cancel_backend() just won't do anything). There are two problems making this non-trivial. For one, while we're in IDLE IN TXN the client doesn't expect a response on a protocol level, so we can't simply ereport() at that time. For another, when were in IDLE IN TXN we're potentially inside openssl so we can't jump out of there anyway because that would quite likely corrupt the internal state of openssl. I tried to fix this before (c.f. Idle in transaction cancellation or similar) but while I had some kind of fix for the first issue (i saved the error and reported it later when the protocol state allows it) I missed the jumping out of openssl bit. I think its not that hard to solve though. I remember having something preliminary but I never had the time to finish it. If I remember correctly the trick was to set openssl into non-blocking mode temporarily and return to the caller inside be-secure.c:my_sock_read. At that location ProcessInterrupts can run safely, error out silently, and reraise the error once were in the right protocol state. 2836 else if (RecoveryConflictPending RecoveryConflictRetryable) 2837 { 2838 pgstat_report_recovery_conflict(RecoveryConflictReason); 2839 ereport(FATAL, 2840 (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), 2841 errmsg(terminating connection due to conflict with recovery), 2842 errdetail_recovery_conflict())); 2843 } 2844 else if (RecoveryConflictPending) 2845 { 2846 /* Currently there is only one non-retryable recovery conflict */ 2847 Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE); 2848 pgstat_report_recovery_conflict(RecoveryConflictReason); 2849 ereport(FATAL, 2850 (errcode(ERRCODE_DATABASE_DROPPED), 2851 errmsg(terminating connection due to conflict with recovery), 2852 errdetail_recovery_conflict())); 2853 } AFAICS the first of these should be ereport(ERROR). Otherwise irrespective of whether RecoveryConflictRetryable is true or false, we will always ereport(FATAL). Which is fine, because were below if (ProcDiePending). Note there's a separate path for QueryCancelPending. We go on to kill connections once the normal conflict handling has tried several times. 2. For my test, the error message itself looks wrong because I did not actually remove any rows on the master. VACUUM probably marked a bunch of pages as all-visible and that should have triggered a conflict on the standby in order to support index-only scans. IMHO we should improve the error message to avoid any confusion. Or we can add a new ProcSignalReason to differentiate between a cancel due to clean up vs visibilitymap_set() operation. I think we desparately need to improve *all* of these message with significantly more detail (cause for cancellation, relation, current xid, conflicting xid, current/last query). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby conflict resolution handling
On Tue, Dec 4, 2012 at 1:44 PM, Andres Freund and...@2ndquadrant.comwrote: After max_standby_streaming_delay, the standby starts cancelling the queries. I get an error like this on the standby: postgres=# explain verbose select count(b) from test WHERE a 10; FATAL: terminating connection due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. So I've couple questions/concerns here 1. Why to throw a FATAL error here ? A plain ERROR should be enough to abort the transaction. There are four places in ProcessInterrupts() where we throw these kind of errors and three of them are FATAL. The problem here is that were in IDLE IN TRANSACTION in this case. Which currently cannot be cancelled (i.e. pg_cancel_backend() just won't do anything). There are two problems making this non-trivial. For one, while we're in IDLE IN TXN the client doesn't expect a response on a protocol level, so we can't simply ereport() at that time. For another, when were in IDLE IN TXN we're potentially inside openssl so we can't jump out of there anyway because that would quite likely corrupt the internal state of openssl. I tried to fix this before (c.f. Idle in transaction cancellation or similar) but while I had some kind of fix for the first issue (i saved the error and reported it later when the protocol state allows it) I missed the jumping out of openssl bit. I think its not that hard to solve though. I remember having something preliminary but I never had the time to finish it. If I remember correctly the trick was to set openssl into non-blocking mode temporarily and return to the caller inside be-secure.c:my_sock_read. Thanks Andres. I also read the original thread and I now understand why we are using FATAL here, at least until we have a better solution. Obviously the connection reset is no good either because as someone commented in the original discussion, I thought that I'm seeing a server crash while it was not. AFAICS the first of these should be ereport(ERROR). Otherwise irrespective of whether RecoveryConflictRetryable is true or false, we will always ereport(FATAL). Which is fine, because were below if (ProcDiePending). Note there's a separate path for QueryCancelPending. We go on to kill connections once the normal conflict handling has tried several times. Ok. Understood.I now see that every path below if (ProcDiePending) will call FATAL, albeit with different error codes. That explains the current code. I think we desparately need to improve *all* of these message with significantly more detail (cause for cancellation, relation, current xid, conflicting xid, current/last query). I agree. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
Re: [HACKERS] Hot Standby conflict resolution handling
On Tue, Dec 4, 2012 at 6:01 PM, Pavan Deolasee pavan.deola...@gmail.comwrote: Thanks Andres. I also read the original thread and I now understand why we are using FATAL here, at least until we have a better solution. Obviously the connection reset is no good either because as someone commented in the original discussion, I thought that I'm seeing a server crash while it was not. How about attached comment to be added at the appropriate place ? I've extracted this mostly from Tom's explanation in the original thread. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee clarify-fatal-error-in-conflict-resolve.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
On Fri, Nov 30, 2012 at 6:41 PM, Andres Freund and...@2ndquadrant.com wrote: While we're talking about changing defaults, how about changing the default value of the recovery.conf parameter 'standby_mode' to on? Not sure about anybody else, but I never want it any other way. Hm. But only if there is a recovery.conf I guess? Yeah, wouldn't make much sense otherwise. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
Magnus Hagander wrote: On 30.11.2012 21:02, Andres Freund wrote: There are workloads where its detrimental, but in general having it default to on improver experience tremendously because getting conflicts because of vacuum is rather confusing. In the workloads where it might not be a good idea (very long queries on the standby, many dead tuples on the primary) you need to think very carefuly about the strategy of avoiding conflicts anyway, and explicit configuration is required as well. Does anybody have an argument against changing the default value? -1. By default, I would expect a standby server to not have any meaningful impact on the performance of the master. With hot standby feedback, you can bloat the master very badly if you're not careful. I'm with Heikki on the -1 on this. It's certainly unexpected to have the slave affect the master by default - people will expect the master to be independent. I agree. +1. Having your reporting query time out *shows you* the problem. Having the master bloat for you won't show the problem until later - when it's much bigger, and it's much more pain to recover from. I couldn't agree more. There are different requirements, and there will always be people who need to change the defaults, but the way it is is the safest in my opinion. Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
On 30 November 2012 19:02, Andres Freund and...@2ndquadrant.com wrote: The subject says it all. There are workloads where its detrimental, but in general having it default to on improver experience tremendously because getting conflicts because of vacuum is rather confusing. In the workloads where it might not be a good idea (very long queries on the standby, many dead tuples on the primary) you need to think very carefuly about the strategy of avoiding conflicts anyway, and explicit configuration is required as well. Does anybody have an argument against changing the default value? I don't see a technical objection, perhaps others do. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
On Fri, Nov 30, 2012 at 2:02 PM, Andres Freund and...@2ndquadrant.com wrote: Does anybody have an argument against changing the default value? Well, the disadvantage of it is that the standby can bloat the master, which might be surprising to some people, too. But I don't really have a lot of skin in this game. While we're talking about changing defaults, how about changing the default value of the recovery.conf parameter 'standby_mode' to on? Not sure about anybody else, but I never want it any other way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
In the workloads where it might not be a good idea (very long queries on the standby, many dead tuples on the primary) you need to think very carefuly about the strategy of avoiding conflicts anyway, and explicit configuration is required as well. Does anybody have an argument against changing the default value? On balance, I think it's a good idea. It's easier for new users, conceptually, to deal with table bloat than query cancel. Have we done testing on how much query cancel it actually eliminates? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
On 30.11.2012 21:02, Andres Freund wrote: Hi, The subject says it all. There are workloads where its detrimental, but in general having it default to on improver experience tremendously because getting conflicts because of vacuum is rather confusing. In the workloads where it might not be a good idea (very long queries on the standby, many dead tuples on the primary) you need to think very carefuly about the strategy of avoiding conflicts anyway, and explicit configuration is required as well. Does anybody have an argument against changing the default value? -1. By default, I would expect a standby server to not have any meaningful impact on the performance of the master. With hot standby feedback, you can bloat the master very badly if you're not careful. Think of someone setting up a test server, by setting it up as a standby from the master. Now, when someone holds a transaction open in the test server, you get bloat in the master. Or if you set up a standby for reporting purposes - a very common use case - you would not expect a long running ad-hoc query in the standby to bloat the master. That's precisely why you set up such a standby in the first place. You could of course still turn it off, but you would have to know about it in the first place. I think it's a reasonable assumption that a standby does *not* affect the master (aside from the bandwidth and disk space required to retain/ship the WAL). If you have to remember to explicitly set a GUC to get that behavior, that's a pretty big gotcha. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
On Fri, Nov 30, 2012 at 5:46 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Think of someone setting up a test server, by setting it up as a standby from the master. Now, when someone holds a transaction open in the test server, you get bloat in the master. Or if you set up a standby for reporting purposes - a very common use case - you would not expect a long running ad-hoc query in the standby to bloat the master. That's precisely why you set up such a standby in the first place. Without hot standby feedback, reporting queries are impossible. I've experienced it. Cancellations make it impossible to finish any decently complex reporting query. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
Robert Haas robertmh...@gmail.com writes: While we're talking about changing defaults, how about changing the default value of the recovery.conf parameter 'standby_mode' to on? Not sure about anybody else, but I never want it any other way. Dunno, it's been only a couple of days since there was a thread about somebody who had turned it on and not gotten the results he wanted (because he was only trying to do a point-in-time recovery not create a standby). There's enough other configuration needed to set up a standby node that I'm not sure flipping this default helps the case much. But having said that, would it be practical to get rid of the explicit standby_mode parameter altogether? I'm thinking we could assume standby mode is wanted if primary_conninfo has a nonempty value. There remains the case of a standby being fed solely from WAL archive without primary_conninfo, but that's a pretty darn corner-y corner case, and I doubt it has to be easy to set up. One possibility for it is to allow primary_conninfo to be set to none, which would still trigger standby mode but could be coded to not enable connection attempts. Mind you, I'm not sure that such a design is easier to understand or document. But it would be one less parameter. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
On 30.11.2012 22:49, Claudio Freire wrote: On Fri, Nov 30, 2012 at 5:46 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Think of someone setting up a test server, by setting it up as a standby from the master. Now, when someone holds a transaction open in the test server, you get bloat in the master. Or if you set up a standby for reporting purposes - a very common use case - you would not expect a long running ad-hoc query in the standby to bloat the master. That's precisely why you set up such a standby in the first place. Without hot standby feedback, reporting queries are impossible. I've experienced it. Cancellations make it impossible to finish any decently complex reporting query. Maybe so, but I'd rather get cancellations in the standby, and then read up on feedback and the other options and figure out how to make it work, than get severe bloat in the master and scratch my head wondering what's causing it. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
Claudio Freire wrote: Without hot standby feedback, reporting queries are impossible. I've experienced it. Cancellations make it impossible to finish any decently complex reporting query. With what setting of max_standby_streaming_delay? I would rather default that to -1 than default hot_standby_feedback on. That way what you do on the standby only affects the standby. A default that allows anyone who has a read-only login to a standby to bloat the server by default, which may require hours of down time to correct, seems dangerous to me. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
On Fri, Nov 30, 2012 at 6:06 PM, Kevin Grittner kgri...@mail.com wrote: Without hot standby feedback, reporting queries are impossible. I've experienced it. Cancellations make it impossible to finish any decently complex reporting query. With what setting of max_standby_streaming_delay? I would rather default that to -1 than default hot_standby_feedback on. That way what you do on the standby only affects the standby. 1d -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
Claudio Freire klaussfre...@gmail.com writes: Without hot standby feedback, reporting queries are impossible. I've experienced it. Cancellations make it impossible to finish any decently complex reporting query. The original expectation was that slave-side cancels would be infrequent. Maybe there's some fixing/tuning to be done there. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
Claudio Freire wrote: With what setting of max_standby_streaming_delay? I would rather default that to -1 than default hot_standby_feedback on. That way what you do on the standby only affects the standby. 1d Was there actually a transaction hanging open for an entire day on the standby? Was it a query which actually ran that long, or an ill-behaved user or piece of software? I have most certainly managed databases where holding up vacuuming on the source would cripple performance to the point that users would have demanded that any other process causing it must be immediately canceled. And canceling it wouldn't be enough at that point -- the bloat would still need to be fixed before they could work efficiently. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
On Fri, Nov 30, 2012 at 6:20 PM, Kevin Grittner kgri...@mail.com wrote: Claudio Freire wrote: With what setting of max_standby_streaming_delay? I would rather default that to -1 than default hot_standby_feedback on. That way what you do on the standby only affects the standby. 1d Was there actually a transaction hanging open for an entire day on the standby? Was it a query which actually ran that long, or an ill-behaved user or piece of software? No, and if there was, I wouldn't care for it to be cancelled. Queries were being cancelled way before that timeout was reached, probably something to do with max_keep_segments on the master side being unable to keep up for that long. I have most certainly managed databases where holding up vacuuming on the source would cripple performance to the point that users would have demanded that any other process causing it must be immediately canceled. And canceling it wouldn't be enough at that point -- the bloat would still need to be fixed before they could work efficiently. I wouldn't mind occasional cancels, but these were recurring. When a query ran long enough, there was no way for it to finish, no matter how many times you tried. The master never stops being busy, that's probably a factor. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
On 30.11.2012 23:40, Claudio Freire wrote: On Fri, Nov 30, 2012 at 6:20 PM, Kevin Grittnerkgri...@mail.com wrote: Claudio Freire wrote: With what setting of max_standby_streaming_delay? I would rather default that to -1 than default hot_standby_feedback on. That way what you do on the standby only affects the standby. 1d Was there actually a transaction hanging open for an entire day on the standby? Was it a query which actually ran that long, or an ill-behaved user or piece of software? No, and if there was, I wouldn't care for it to be cancelled. Queries were being cancelled way before that timeout was reached, probably something to do with max_keep_segments on the master side being unable to keep up for that long. Running out of max_keep_segments would produce a different error, requiring a new base backup. I have most certainly managed databases where holding up vacuuming on the source would cripple performance to the point that users would have demanded that any other process causing it must be immediately canceled. And canceling it wouldn't be enough at that point -- the bloat would still need to be fixed before they could work efficiently. I wouldn't mind occasional cancels, but these were recurring. When a query ran long enough, there was no way for it to finish, no matter how many times you tried. The master never stops being busy, that's probably a factor. Hmm, it sounds like max_standby_streaming_delay=1d didn't work as intended for some reason. It should've given the query one day to run before canceling it. Unless the standby was running one day behind the master already, but that seems unlikely. Any chance you could reproduce that? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
On Fri, Nov 30, 2012 at 6:49 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I have most certainly managed databases where holding up vacuuming on the source would cripple performance to the point that users would have demanded that any other process causing it must be immediately canceled. And canceling it wouldn't be enough at that point -- the bloat would still need to be fixed before they could work efficiently. I wouldn't mind occasional cancels, but these were recurring. When a query ran long enough, there was no way for it to finish, no matter how many times you tried. The master never stops being busy, that's probably a factor. Hmm, it sounds like max_standby_streaming_delay=1d didn't work as intended for some reason. It should've given the query one day to run before canceling it. Unless the standby was running one day behind the master already, but that seems unlikely. Any chance you could reproduce that? I have a pre-production server with replication for these tests. I could create a fake stream of writes on it, disable feedback, and see what happens. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
On 2012-11-30 14:35:37 -0500, Robert Haas wrote: On Fri, Nov 30, 2012 at 2:02 PM, Andres Freund and...@2ndquadrant.com wrote: Does anybody have an argument against changing the default value? Well, the disadvantage of it is that the standby can bloat the master, which might be surprising to some people, too. But I don't really have a lot of skin in this game. Sure, thats a problem. But ISTM that its a problem everyone running postgres has to know about anyway from running the master itself. While we're talking about changing defaults, how about changing the default value of the recovery.conf parameter 'standby_mode' to on? Not sure about anybody else, but I never want it any other way. Hm. But only if there is a recovery.conf I guess? Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
On Fri, Nov 30, 2012 at 11:35 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Nov 30, 2012 at 2:02 PM, Andres Freund and...@2ndquadrant.com wrote: Does anybody have an argument against changing the default value? Well, the disadvantage of it is that the standby can bloat the master, which might be surprising to some people, too. But I don't really have a lot of skin in this game. Under this precept, we used to not enable hot standby feedback and instead allowed more or less unbounded staleness of the standby through very long cancellation times. Although not immediate, eventually we decided that enough people were getting confused by sufficiently long standby delay caused by bad queries and idle in xact backends, so now we have enabled feedback for new database replicants, along with some fairly un-aggressive cancellation timeouts. It's all rather messy and not very satisfying. We have yet to know if feedback causes or solves problems, on average. In very early versions we tried the default cancellation settings, and query cancellation confused everyone a *lot*. That went away in a hurry as a result, so I suppose it's not entirely unreasonable to say in retrospect that the defaults can be considered kind of bad. Longer term, I think I'd be keen to switch all our user-controlled replication to logical except for use cases where the workload of the standby is under our (and not the user's) control, such as for failover. Unfortunately, our experience with the feature and its use suggests that the contract granted by the mechanisms seen in hot standby are too complex for full-stack developers to keep in careful consideration along with all the other things they want to do with their application and/or have to remember about Postgres to get by. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
On Fri, Nov 30, 2012 at 9:46 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 30.11.2012 21:02, Andres Freund wrote: Hi, The subject says it all. There are workloads where its detrimental, but in general having it default to on improver experience tremendously because getting conflicts because of vacuum is rather confusing. In the workloads where it might not be a good idea (very long queries on the standby, many dead tuples on the primary) you need to think very carefuly about the strategy of avoiding conflicts anyway, and explicit configuration is required as well. Does anybody have an argument against changing the default value? -1. By default, I would expect a standby server to not have any meaningful impact on the performance of the master. With hot standby feedback, you can bloat the master very badly if you're not careful. I'm with Heikki on the -1 on this. It's certainly unexpected to have the slave affect the master by default - people will expect the master to be independent. Also, it doesn't IMHO actually *help*. The big thing that makes it harder for people to set up replication that way is wal_level=minimal by default, and in a smaller sense max_wal_senders (but wal_level=minimal also has the interesting property that it's not enough to change it to wal_level=hot_standby if you figure it out too late - you have to turn off hot standby on the slave, start it, have it catch up, shut it down, and reenable hot standby). And they requires a *restart* of the master, which is a lot worse than a small change to the config of the *slave*. So unless you're suggesting to change the default of those two values as well, I'm not sure it really helps that much... Think of someone setting up a test server, by setting it up as a standby from the master. Now, when someone holds a transaction open in the test server, you get bloat in the master. Or if you set up a standby for reporting purposes - a very common use case - you would not expect a long running ad-hoc query in the standby to bloat the master. That's precisely why you set up such a standby in the first place. You could of course still turn it off, but you would have to know about it in the first place. I think it's a reasonable assumption that a standby does *not* affect the master (aside from the bandwidth and disk space required to retain/ship the WAL). If you have to remember to explicitly set a GUC to get that behavior, that's a pretty big gotcha. +1. Having your reporting query time out *shows you* the problem. Having the master bloat for you won't show the problem until later - when it's much bigger, and it's much more pain to recover from. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
On Fri, Nov 30, 2012 at 10:09 PM, Tom Lane t...@sss.pgh.pa.us wrote: Claudio Freire klaussfre...@gmail.com writes: Without hot standby feedback, reporting queries are impossible. I've experienced it. Cancellations make it impossible to finish any decently complex reporting query. The original expectation was that slave-side cancels would be infrequent. Maybe there's some fixing/tuning to be done there. It depends completely on the query pattern on the master. Saying that cancellations makes it impossible to finish any decently complex reporting query is completely incorrect - it depends on the queries on the *master*, not on the complexity of the query on the slave. I know a lot of scenarios where query cancels pretty much never happen at all. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
On 2012-11-30 22:46:06 +0200, Heikki Linnakangas wrote: On 30.11.2012 21:02, Andres Freund wrote: Hi, The subject says it all. There are workloads where its detrimental, but in general having it default to on improver experience tremendously because getting conflicts because of vacuum is rather confusing. In the workloads where it might not be a good idea (very long queries on the standby, many dead tuples on the primary) you need to think very carefuly about the strategy of avoiding conflicts anyway, and explicit configuration is required as well. Does anybody have an argument against changing the default value? -1. By default, I would expect a standby server to not have any meaningful impact on the performance of the master. With hot standby feedback, you can bloat the master very badly if you're not careful. True. But everyone running postgres hopefully knows the problem already. So that effect is relatively easy to explain. The other control possibilities we have are rather hard to understand and to setup in my experience. Think of someone setting up a test server, by setting it up as a standby from the master. Now, when someone holds a transaction open in the test server, you get bloat in the master. Or if you set up a standby for reporting purposes - a very common use case - you would not expect a long running ad-hoc query in the standby to bloat the master. That's precisely why you set up such a standby in the first place. But you can't do any meaningful reporting without changing the current variables around this anyway. If you have any writes on the master barely any significant query ever completes. The two basic choices we give people suck more imo: * you setup a large delay: It possibly takes a very long time to catch up if the primary dies, you don't see any up2date data in later queries * you abort queries: You can't do any reporting queries Both are unusable for most scenarios and getting the former just right is hard. Imo a default of on works in far more scenarios than the contrary. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
All: Well, the problem is that we have three configurations which only work for one very common scenario: - reporting slave: feedback off, very long replication_delay - load-balancing slave: feedback on, short replication_delay - backup/failover slave: feedback off, short replication_delay I don't think anyone without a serious market survey can say that any of the above scenarios is more common than the others; I run into all three pretty darned frequently. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Feedback should default to on in 9.3+
On 2012-11-30 16:09:15 -0500, Tom Lane wrote: Claudio Freire klaussfre...@gmail.com writes: Without hot standby feedback, reporting queries are impossible. I've experienced it. Cancellations make it impossible to finish any decently complex reporting query. The original expectation was that slave-side cancels would be infrequent. Maybe there's some fixing/tuning to be done there. I've mostly seen snapshot conflicts. Its hard to say anything more precise because we don't log any additional information (its admittedly not easy). I think it would already help a lot if ResolveRecoveryConflictWithSnapshot would log: * the relfilenode (already passed) * the removed xid * the pid of the backend holding the oldest snapshot * the oldest xid of that backend Most of that should be easy to get. But I don't think we really can expect a very low rate of conflicts if the primary has few longrunning transactions but the standby does. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hot standby PSQL 9.1 Windows 2008 Servers
On Tue, May 22, 2012 at 12:15 PM, chinnaobi chinna...@gmail.com wrote: You mean when the primary which is going to switch its role to standby might not have sent all the WAL records to the standby and If it is switched to standby it has more WAL records than the standby which is now serves as primary. Is it ?? Yes, that is possible. Or the standby might have received all the WAL records but not be caught up in terms of replaying them. It is actually the standby server which has to be restored from archive when it is switching to primary right .. Not the primary which is switching to standby ?? If you want to promote a standby, you can just do it (pg_ctl promote). If you have a master that you want to demote to a standby, you've got to resync it to whatever the current master is. I understand repmgr has some tooling to help automate that, although I have not played with it myself. In any event rsync can be a big help in reducing the resync time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hot standby PSQL 9.1 Windows 2008 Servers
chinnaobi wrote: You mean when the primary which is going to switch its role to standby might not have sent all the WAL records to the standby and If it is switched to standby it has more WAL records than the standby which is now serves as primary. Is it ?? What happens when there is a network fault between the primary and the standby, but not between the primary and some of the clients updating it? Similarly, if this is asynchronous replication, what if there have been commits on the primary which were still in the network buffer when the primary crashed? Clean automated failover is not a trivial task. If you are writing your own, it would be best to follow the steps recommended in the documentation rather closely. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hot standby PSQL 9.1 Windows 2008 Servers
Dear Kevin, Thank you for your reply. Yeah I am writing an application using powershell, it's true it is not trivial and especially a guy like me who has no idea on database. You raised all the cases which I am muddling with, But currently I am testing this setup: Always standby server is configured from base backup and restore from storage server then start streaming replication(asynchronous). Base backups are taken frequently. I am sure there is some data loss during switching. Still researching how to do it clean. suggest me if you have any good papers on this .. Reddy. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Re-hot-standby-PSQL-9-1-Windows-2008-Servers-tp5710824p5710830.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hot standby PSQL 9.1 Windows 2008 Servers
Sorry to mention, In my setup the primary and standby servers receive same traffic, so no issue with the network fault between the primary and the standby, but not between the primary and some of the clients updating it -- View this message in context: http://postgresql.1045698.n5.nabble.com/Re-hot-standby-PSQL-9-1-Windows-2008-Servers-tp5710824p5710832.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hot standby PSQL 9.1 Windows 2008 Servers
On Mon, May 14, 2012 at 8:18 AM, chinnaobi chinna...@gmail.com wrote: I do base backup only first time on standby when it is going to be replicated. when ever primary goes down, standby becomes primary and primary becomes standby when primary comes up. When primary becomes standby I am restoring data from WAL archive and start postgres service streaming replication to connect to primary. This setup is working. I don't think this is safe. The primary might have WAL that never made it to the standby, in which case the two machines will be out of sync with each other and all sorts of bad stuff could happen. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hot standby PSQL 9.1 Windows 2008 Servers
Dear Robert, Thank you very much for the reply. You mean when the primary which is going to switch its role to standby might not have sent all the WAL records to the standby and If it is switched to standby it has more WAL records than the standby which is now serves as primary. Is it ?? It is actually the standby server which has to be restored from archive when it is switching to primary right .. Not the primary which is switching to standby ?? Regards, Reddy. -- View this message in context: http://postgresql.1045698.n5.nabble.com/hot-standby-PSQL-9-1-Windows-2008-Servers-tp5708637p5709495.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby Failover Scenario
On 02/27/2012 10:05 PM, Lucky Haryadi wrote: 3. When Master A fails to service, the database will failovered to Slave B by triggering with trigger file. As soon as you trigger a standby, it changes it to a new timeline. At that point, the series of WAL files diverges. It's no longer possible to apply them to a system that is still on the original timeline, such as your original master A in this situation. There's a good reason for that. Let's say that A committed an additional transaction before it went down, but that commit wasn't replicated to B. You can't just move records from B over anymore in that case. The only way to make sure A is in sync again is to do a new base backup, which you can potentially accelerate using rsync to only copy what has changed. I see a lot of people try to bypass one of the steps recommended in the manual using various schemes like yours, and they usually have a bug like this in there--sometimes obvious like this, sometimes subtle. Trying to get too clever here is dangerous to your database. Warning: pgsql-hackers is the mailing list for people to discuss the development of PostgreSQL, not how to use it. Questions like this should be asked on either the pgsql-admin or pgsql-general mailing list. I'm not going to answer additional questions like this from you here on this list, and I doubt anyone else will either. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby fails if any backend crashes
On Fri, Feb 3, 2012 at 4:48 AM, Tom Lane t...@sss.pgh.pa.us wrote: I think saner behavior might only require this change: /* * Any unexpected exit (including FATAL exit) of the startup * process is treated as a crash, except that we don't want to * reinitialize. */ if (!EXIT_STATUS_0(exitstatus)) { - RecoveryError = true; + if (!FatalError) + RecoveryError = true; HandleChildCrash(pid, exitstatus, _(startup process)); continue; } plus suitable comment adjustments of course. Haven't tested this yet though. Looks good, will test. It's a bit disturbing that nobody has reported this from the field yet. Seems to imply that hot standby isn't being used much. There are many people I know using it in production for more than a year now. Either they haven't seen it or they haven't reported it to us. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby fails if any backend crashes
On Thu, Feb 2, 2012 at 8:48 PM, Tom Lane t...@sss.pgh.pa.us wrote: It's a bit disturbing that nobody has reported this from the field yet. Seems to imply that hot standby isn't being used much. I have seen this, but didn't get to dig in, as I thought it could be a problem from other things done outside Postgres (it also came up in #6200, but I didn't mention it). Consider it retroactively reported. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby fails if any backend crashes
I wrote: I'm currently working with Duncan Rance's test case for bug #6425, and I am observing a very nasty behavior in HEAD: once one of the hot-standby query backends crashes, the standby postmaster SIGQUIT's all its children and then just quits itself, with no log message and apparently no effort to restart. Surely this is not intended? I looked through postmaster.c and found that the cause of this is pretty obvious: if the startup process exits with any non-zero status, we assume that represents an unrecoverable error condition, and set RecoveryError which causes the postmaster to exit silently as soon as its last child is gone. But we do this even if the reason the startup process did exit(1) is that we sent it SIGQUIT as a result of a crash of some other process. Of course this logic dates from a time where the startup process could not have any siblings, so when it was written, such a thing was impossible. I think saner behavior might only require this change: /* * Any unexpected exit (including FATAL exit) of the startup * process is treated as a crash, except that we don't want to * reinitialize. */ if (!EXIT_STATUS_0(exitstatus)) { - RecoveryError = true; + if (!FatalError) + RecoveryError = true; HandleChildCrash(pid, exitstatus, _(startup process)); continue; } plus suitable comment adjustments of course. Haven't tested this yet though. It's a bit disturbing that nobody has reported this from the field yet. Seems to imply that hot standby isn't being used much. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby fails if any backend crashes
On Fri, Feb 3, 2012 at 1:48 PM, Tom Lane t...@sss.pgh.pa.us wrote: I wrote: I'm currently working with Duncan Rance's test case for bug #6425, and I am observing a very nasty behavior in HEAD: once one of the hot-standby query backends crashes, the standby postmaster SIGQUIT's all its children and then just quits itself, with no log message and apparently no effort to restart. Surely this is not intended? I looked through postmaster.c and found that the cause of this is pretty obvious: if the startup process exits with any non-zero status, we assume that represents an unrecoverable error condition, and set RecoveryError which causes the postmaster to exit silently as soon as its last child is gone. But we do this even if the reason the startup process did exit(1) is that we sent it SIGQUIT as a result of a crash of some other process. Of course this logic dates from a time where the startup process could not have any siblings, so when it was written, such a thing was impossible. I think saner behavior might only require this change: /* * Any unexpected exit (including FATAL exit) of the startup * process is treated as a crash, except that we don't want to * reinitialize. */ if (!EXIT_STATUS_0(exitstatus)) { - RecoveryError = true; + if (!FatalError) + RecoveryError = true; HandleChildCrash(pid, exitstatus, _(startup process)); continue; } plus suitable comment adjustments of course. Haven't tested this yet though. Looks good to me. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby off of hot standby?
On Mon, Jan 30, 2012 at 4:36 AM, Igor Schtein ischt...@gmail.com wrote: Is it possible to use a standby instance as a master/primary for another standby in Postgres 9.0? In other words, does PG 9.0 supports cascading standby configuration? No, but 9.2 will support that feature, known as cascading replication. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby off of hot standby?
On 1/29/12 8:36 PM, Igor Schtein wrote: Hi, Is it possible to use a standby instance as a master/primary for another standby in Postgres 9.0? In other words, does PG 9.0 supports cascading standby configuration? No, that's a 9.2 feature in development. If you can build PostgreSQL from source and apply patches, we could use your help testing it! -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby startup with overflowed snapshots
On Fri, Oct 28, 2011 at 3:42 AM, Chris Redekop ch...@replicon.com wrote: On a side note I am sporadically seeing another error on hotstandby startup. I'm not terribly concerned about it as it is pretty rare and it will work on a retry so it's not a big deal. The error is FATAL: out-of-order XID insertion in KnownAssignedXids. If you think it might be a bug and are interested in hunting it down let me know and I'll help any way I can...but if you're not too worried about it then neither am I :) I'd be interested to see further details of this if you see it again, or have access to previous logs. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby startup with overflowed snapshots
oopsreply-to-all -- Forwarded message -- From: Chris Redekop ch...@replicon.com Date: Wed, Nov 2, 2011 at 8:41 AM Subject: Re: [HACKERS] Hot Standby startup with overflowed snapshots To: Simon Riggs si...@2ndquadrant.com Sure, I've got quite a few logs lying around - I've attached 3 of 'em...let me know if there are any specific things you'd like me to do or look for next time it happens On Wed, Nov 2, 2011 at 2:59 AM, Simon Riggs si...@2ndquadrant.com wrote: On Fri, Oct 28, 2011 at 3:42 AM, Chris Redekop ch...@replicon.com wrote: On a side note I am sporadically seeing another error on hotstandby startup. I'm not terribly concerned about it as it is pretty rare and it will work on a retry so it's not a big deal. The error is FATAL: out-of-order XID insertion in KnownAssignedXids. If you think it might be a bug and are interested in hunting it down let me know and I'll help any way I can...but if you're not too worried about it then neither am I :) I'd be interested to see further details of this if you see it again, or have access to previous logs. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services postgresql-2011-10-27_202007.log Description: Binary data postgresql-2011-10-31_152925.log Description: Binary data postgresql-2011-11-01_094501.log Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby startup with overflowed snapshots
Thanks for the patch Simon, but unfortunately it does not resolve the issue I am seeing. The standby still refuses to finish starting up until long after all clients have disconnected from the primary (10 minutes). I do see your new log statement on startup, but only once - it does not repeat. Is there any way for me to see what the oldest xid on the standby is via controldata or something like that? The standby does stream to keep up with the primary while the primary has load, and then it becomes idle when the primary becomes idle (when I kill all the connections)so it appears to be current...but it just doesn't finish starting up I'm not sure if it's relevant, but after it has sat idle for a couple minutes I start seeing these statements in the log (with the same offset every time): DEBUG: skipping restartpoint, already performed at 9/9520 On Thu, Oct 27, 2011 at 7:26 AM, Simon Riggs si...@2ndquadrant.com wrote: Chris Redekop's recent report of slow startup for Hot Standby has made me revisit the code there. Although there isn't a bug, there is a missed opportunity for starting up faster which could be the source of Chris' annoyance. The following patch allows a faster startup in some circumstances. The patch also alters the log levels for messages and gives a single simple message for this situation. The log will now say LOG: recovery snapshot waiting for non-overflowed snapshot or until oldest active xid on standby is at least %u (now %u) ...multiple times until snapshot non-overflowed or xid reached... whereas before the first LOG message shown was LOG: consistent state delayed because recovery snapshot incomplete and only later, at DEBUG2 do you see LOG: recovery snapshot waiting for %u oldest active xid on standby is %u ...multiple times until xid reached... Comments please. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby startup with overflowed snapshots
On Thu, Oct 27, 2011 at 5:26 PM, Chris Redekop ch...@replicon.com wrote: Thanks for the patch Simon, but unfortunately it does not resolve the issue I am seeing. The standby still refuses to finish starting up until long after all clients have disconnected from the primary (10 minutes). I do see your new log statement on startup, but only once - it does not repeat. Is there any way for me to see what the oldest xid on the standby is via controldata or something like that? The standby does stream to keep up with the primary while the primary has load, and then it becomes idle when the primary becomes idle (when I kill all the connections)so it appears to be current...but it just doesn't finish starting up I'm not sure if it's relevant, but after it has sat idle for a couple minutes I start seeing these statements in the log (with the same offset every time): DEBUG: skipping restartpoint, already performed at 9/9520 OK, so it looks like there are 2 opportunities to improve, not just one. Try this. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services faster_hot_standby_startup_withsubxacts.v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby startup with overflowed snapshots
hrmz, still basically the same behaviour. I think it might be a *little* better with this patch. Before when under load it would start up quickly maybe 2 or 3 times out of 10 attemptswith this patch it might be up to 4 or 5 times out of 10...ish...or maybe it was just fluke *shrug*. I'm still only seeing your log statement a single time (I'm running at debug2). I have discovered something though - when the standby is in this state if I force a checkpoint on the primary then the standby comes right up. Is there anything I check or try for you to help figure this out?or is it actually as designed that it could take 10-ish minutes to start up even after all clients have disconnected from the primary? On Thu, Oct 27, 2011 at 11:27 AM, Simon Riggs si...@2ndquadrant.com wrote: On Thu, Oct 27, 2011 at 5:26 PM, Chris Redekop ch...@replicon.com wrote: Thanks for the patch Simon, but unfortunately it does not resolve the issue I am seeing. The standby still refuses to finish starting up until long after all clients have disconnected from the primary (10 minutes). I do see your new log statement on startup, but only once - it does not repeat. Is there any way for me to see what the oldest xid on the standby is via controldata or something like that? The standby does stream to keep up with the primary while the primary has load, and then it becomes idle when the primary becomes idle (when I kill all the connections)so it appears to be current...but it just doesn't finish starting up I'm not sure if it's relevant, but after it has sat idle for a couple minutes I start seeing these statements in the log (with the same offset every time): DEBUG: skipping restartpoint, already performed at 9/9520 OK, so it looks like there are 2 opportunities to improve, not just one. Try this. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] Hot Standby startup with overflowed snapshots
On Thu, Oct 27, 2011 at 10:09 PM, Chris Redekop ch...@replicon.com wrote: hrmz, still basically the same behaviour. I think it might be a *little* better with this patch. Before when under load it would start up quickly maybe 2 or 3 times out of 10 attemptswith this patch it might be up to 4 or 5 times out of 10...ish...or maybe it was just fluke *shrug*. I'm still only seeing your log statement a single time (I'm running at debug2). I have discovered something though - when the standby is in this state if I force a checkpoint on the primary then the standby comes right up. Is there anything I check or try for you to help figure this out?or is it actually as designed that it could take 10-ish minutes to start up even after all clients have disconnected from the primary? Thanks for testing. The improvements cover specific cases, so its not subject to chance; its not a performance patch. It's not designed to act the way you describe, but it does. The reason this occurs is that you have a transaction heavy workload with occasional periods of complete quiet and a base backup time that is much less than checkpoint_timeout. If your base backup was slower the checkpoint would have hit naturally before recovery had reached a consistent state. Which seems fairly atypical. I guess you're doing this on a test system. It seems cheap to add in a call to LogStandbySnapshot() after each call to pg_stop_backup(). Does anyone think this case is worth adding code for? Seems like one more thing to break. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby startup with overflowed snapshots
Sorry...designed was poor choice of words, I meant not unexpected. Doing the checkpoint right after pg_stop_backup() looks like it will work perfectly for me, so thanks for all your help! On a side note I am sporadically seeing another error on hotstandby startup. I'm not terribly concerned about it as it is pretty rare and it will work on a retry so it's not a big deal. The error is FATAL: out-of-order XID insertion in KnownAssignedXids. If you think it might be a bug and are interested in hunting it down let me know and I'll help any way I can...but if you're not too worried about it then neither am I :) On Thu, Oct 27, 2011 at 4:55 PM, Simon Riggs si...@2ndquadrant.com wrote: On Thu, Oct 27, 2011 at 10:09 PM, Chris Redekop ch...@replicon.com wrote: hrmz, still basically the same behaviour. I think it might be a *little* better with this patch. Before when under load it would start up quickly maybe 2 or 3 times out of 10 attemptswith this patch it might be up to 4 or 5 times out of 10...ish...or maybe it was just fluke *shrug*. I'm still only seeing your log statement a single time (I'm running at debug2). I have discovered something though - when the standby is in this state if I force a checkpoint on the primary then the standby comes right up. Is there anything I check or try for you to help figure this out?or is it actually as designed that it could take 10-ish minutes to start up even after all clients have disconnected from the primary? Thanks for testing. The improvements cover specific cases, so its not subject to chance; its not a performance patch. It's not designed to act the way you describe, but it does. The reason this occurs is that you have a transaction heavy workload with occasional periods of complete quiet and a base backup time that is much less than checkpoint_timeout. If your base backup was slower the checkpoint would have hit naturally before recovery had reached a consistent state. Which seems fairly atypical. I guess you're doing this on a test system. It seems cheap to add in a call to LogStandbySnapshot() after each call to pg_stop_backup(). Does anyone think this case is worth adding code for? Seems like one more thing to break. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] Hot Standby startup with overflowed snapshots
On Thu, Oct 27, 2011 at 6:55 PM, Simon Riggs si...@2ndquadrant.com wrote: It seems cheap to add in a call to LogStandbySnapshot() after each call to pg_stop_backup(). Does anyone think this case is worth adding code for? Seems like one more thing to break. Why at that particular time? It would maybe nice if the master could notice when it has a plausible (non-overflowed) snapshot and log it then. But that might be more code than the problem is worth. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hot standby startup, visibility map, clog
On Thu, Jun 9, 2011 at 5:14 AM, Daniel Farina dan...@heroku.com wrote: The first fact is that turning off hot standby will let the cluster start up, but only after seeing a spate of messages like these (dozen or dozens, not thousands): 2011-06-09 08:02:32 UTC LOG: restored log file 0002002C00C0 from archive 2011-06-09 08:02:33 UTC WARNING: xlog min recovery request 2C/C1F09658 is past current point 2C/C037B278 2011-06-09 08:02:33 UTC CONTEXT: writing block 0 of relation base/16385/16784_vm xlog redo insert: rel 1663/16385/128029; tid 114321/63 2011-06-09 08:02:33 UTC LOG: restartpoint starting: xlog Most importantly, *all* such messages are in visibility map forks (_vm). I reasonably confident that my code does not start reading data until pg_start_backup() has returned, and blocks on pg_stop_backup() after having read all the data. Also, the mailing list correspondence at http://archives.postgresql.org/pgsql-hackers/2010-11/msg02034.php suggests that the visibility map is not flushed at checkpoints, so perhaps with some poor timing an old page can wander onto disk even after a checkpoint barrier that pg_start_backup waits for. (I have not yet found the critical section that makes visibilitymap buffers immune to checkpoint though). I don't believe there's anything that makes visibilitymap buffers immune to checkpoint. I might be wrong. You could probably find out by inserting a few lines of code into the buffer manager to inspect each buffer tag as the buffer is written out and elog() if the fork number is VISIBILITYMAP_FORKNUM. There is some weirdness around how the LSNs of visibility map pages are set, however. See visibilitymap_set()/clear(). Clearing a visibility map bit is done without touching the LSN at all; setting the bit advances the LSN to that of the heap page, if it currently precedes that value. It's not obvious to me whether or how that's related. I'm a bit puzzled by the warning message itself, too. I think that in order to generate that warning, you need to try to flush a page to disk whose LSN is higher than the current minimum recovery point. And the minimum recovery point is initially set to point to the redo pointer from the last checkpoint. But during an online backup, it's expected that there may be pages on disk with LSNs higher than the redo pointer from the last checkpoint. So actually now I'm not sure why people don't see this warning whenever a lengthy recovery is required to reach consitency, especially with small values of shared_buffers. I'm probably missing something here... One thing that might make it easier to understand what's going on here is to crank the log level up to DEBUG2 and post the full output from a recovery that exhibits this problem, rather than just a few lines. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby btree delete records and vacuum_defer_cleanup_age
On Fri, Mar 11, 2011 at 4:46 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Is this an open item for 9.1? Simon fixed it, commit b9075a6d2f9b07a00262a670dd60272904c79dce. Oh, thanks! Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby btree delete records and vacuum_defer_cleanup_age
On Thu, Dec 9, 2010 at 4:55 PM, Simon Riggs si...@2ndquadrant.com wrote: On Thu, 2010-12-09 at 00:16 +0100, Heikki Linnakangas wrote: On 09.12.2010 00:10, Heikki Linnakangas wrote: On 08.12.2010 16:00, Simon Riggs wrote: Heikki pointed out to me that the btree delete record processing does not respect vacuum_defer_cleanup_age. It should. Attached patch to implement that. Looking to commit in next few hours barring objections/suggestions, to both HEAD and 9_0_STABLE, in time for next minor release. Please note that it was Noah Misch that raised this a while ago: http://archives.postgresql.org/pgsql-hackers/2010-11/msg01919.php On closer look, that's not actually the same issue, sorry for the noise.. Heikki, this one *is* important. Will fix. Thanks for the analysis Noah. Is this an open item for 9.1? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby btree delete records and vacuum_defer_cleanup_age
On 11.03.2011 06:21, Fujii Masao wrote: On Thu, Dec 9, 2010 at 4:55 PM, Simon Riggssi...@2ndquadrant.com wrote: On Thu, 2010-12-09 at 00:16 +0100, Heikki Linnakangas wrote: On 09.12.2010 00:10, Heikki Linnakangas wrote: On 08.12.2010 16:00, Simon Riggs wrote: Heikki pointed out to me that the btree delete record processing does not respect vacuum_defer_cleanup_age. It should. Attached patch to implement that. Looking to commit in next few hours barring objections/suggestions, to both HEAD and 9_0_STABLE, in time for next minor release. Please note that it was Noah Misch that raised this a while ago: http://archives.postgresql.org/pgsql-hackers/2010-11/msg01919.php On closer look, that's not actually the same issue, sorry for the noise.. Heikki, this one *is* important. Will fix. Thanks for the analysis Noah. Is this an open item for 9.1? Simon fixed it, commit b9075a6d2f9b07a00262a670dd60272904c79dce. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On Tue, Dec 7, 2010 at 3:42 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Ok, I've committed this patch now. I can confirm that I could continue replaying the logfiles on the standby host with this patch. Thanks a lot, Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby tuning for btree_xlog_vacuum()
Just wanted to say thanks for the review, since I haven't yet managed to fix and commit this. I expect to later this month. On Mon, 2010-09-27 at 23:06 -0400, Robert Haas wrote: On Thu, Apr 29, 2010 at 4:12 PM, Simon Riggs si...@2ndquadrant.com wrote: Simple tuning of btree_xlog_vacuum() using an idea I had a while back, just never implemented. XXX comments removed. Allows us to avoid reading in blocks during VACUUM replay that are only required for correctness of index scans. Review: 1. The block comment in XLogConfirmBufferIsUnpinned appears to be copied from somewhere else, and doesn't really seem appropriate for a new function since it refers to the original coding of this routine. I think you could just delete the parenthesized portion of the comment. 2. This bit from ConfirmBufferIsUnpinned looks odd to me. + /* + * Found it. Now, pin/unpin the buffer to prove it's unpinned. + */ + if (PinBuffer(buf, NULL)) + UnpinBuffer(buf, false); I don't think pinning and unpinning the buffer is sufficient to provide that it isn't otherwise pinned. If the buffer isn't in shared buffers at all, it seems clear that no one can have it pinned. But if it's present in shared buffers, it seems like you still need LockBufferForCleanup(). -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby btree delete records and vacuum_defer_cleanup_age
On 08.12.2010 16:00, Simon Riggs wrote: Heikki pointed out to me that the btree delete record processing does not respect vacuum_defer_cleanup_age. It should. Attached patch to implement that. Looking to commit in next few hours barring objections/suggestions, to both HEAD and 9_0_STABLE, in time for next minor release. Please note that it was Noah Misch that raised this a while ago: http://archives.postgresql.org/pgsql-hackers/2010-11/msg01919.php -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby btree delete records and vacuum_defer_cleanup_age
On 09.12.2010 00:10, Heikki Linnakangas wrote: On 08.12.2010 16:00, Simon Riggs wrote: Heikki pointed out to me that the btree delete record processing does not respect vacuum_defer_cleanup_age. It should. Attached patch to implement that. Looking to commit in next few hours barring objections/suggestions, to both HEAD and 9_0_STABLE, in time for next minor release. Please note that it was Noah Misch that raised this a while ago: http://archives.postgresql.org/pgsql-hackers/2010-11/msg01919.php On closer look, that's not actually the same issue, sorry for the noise.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby btree delete records and vacuum_defer_cleanup_age
On 08.12.2010 16:00, Simon Riggs wrote: Heikki pointed out to me that the btree delete record processing does not respect vacuum_defer_cleanup_age. It should. Attached patch to implement that. This doesn't look right to me. btree_xlog_delete_get_latestRemovedXid() function calculates the latest XID present on the tuples that we're removing b-tree pointers for. btree_xlog_delete_get_latestRemovedXid() is used during recovery. vacuum_defer_cleanup_age should take effect in the master, not during recovery. With the patch, btree_xlog_delete_get_latestRemovedXid() returns a value that's much smaller than it should. That's just wrong, it means that recovery in the standby will incorrectly think that all the removed tuples are old and not visible to any running read-only queries anymore, and will go ahead and remove the index tuples for them. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby btree delete records and vacuum_defer_cleanup_age
On Thu, 2010-12-09 at 00:39 +0100, Heikki Linnakangas wrote: vacuum_defer_cleanup_age should take effect in the master, not during recovery. Hmmm, more to the point, it does take effect on the master and so there is no need for this at all. What were you thinking? What was I? Doh. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby btree delete records and vacuum_defer_cleanup_age
On Thu, 2010-12-09 at 00:16 +0100, Heikki Linnakangas wrote: On 09.12.2010 00:10, Heikki Linnakangas wrote: On 08.12.2010 16:00, Simon Riggs wrote: Heikki pointed out to me that the btree delete record processing does not respect vacuum_defer_cleanup_age. It should. Attached patch to implement that. Looking to commit in next few hours barring objections/suggestions, to both HEAD and 9_0_STABLE, in time for next minor release. Please note that it was Noah Misch that raised this a while ago: http://archives.postgresql.org/pgsql-hackers/2010-11/msg01919.php On closer look, that's not actually the same issue, sorry for the noise.. Heikki, this one *is* important. Will fix. Thanks for the analysis Noah. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On 02.12.2010 12:31, Heikki Linnakangas wrote: On 02.12.2010 13:25, Simon Riggs wrote: On Thu, 2010-12-02 at 12:41 +0200, Heikki Linnakangas wrote: On 02.12.2010 11:02, Simon Riggs wrote: The cause of the issue is that replay starts at one LSN and there is a delay until the RunningXacts WAL record occurs. If there was no delay, there would be no issue at all. In CreateCheckpoint() we start by grabbing the WAInsertLock and later recording that pointer as part of the checkpoint record. My proposal is to replace the grab the lock code with the insert of the RunningXacts WAL record (when wal_level set), so that recovery always starts with that record type. Oh, interesting idea. But AFAICS closing the gap between acquiring the running-xacts snapshot and writing it to the log is sufficient, I don't see what moving the running-xacts record buys us. Does it allow some further simplifications somewhere? Your patch is quite long and you do a lot more than just alter the locking. I don't think we need those changes at all and especially would not wish to backpatch that. Most of the changes to procarray.c were about removing code that's no longer necessary when we close the gap between acquiring and writing the running-xacts WAL record. You can leave it as it is as a historical curiosity, but I'd prefer to simplify it, given that we now know that it doesn't actually work correctly if the gap is not closed. Ok, I've committed this patch now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On 01.12.2010 20:51, Heikki Linnakangas wrote: Another approach would be to revisit the way the running-xacts snapshot is taken. Currently, we first take a snapshot, and then WAL-log it. There is a small window between the steps where backends can begin/end transactions, and recovery has to deal with that. When this was designed, there was long discussion on whether we should instead grab WALInsertLock and ProcArrayLock at the same time, to ensure that the running-xacts snapshot represents an up-to-date situation at the point in WAL where it's inserted. We didn't want to do that because both locks can be heavily contended. But maybe we should after all. It would make the recovery code simpler. If we want to get fancy, we wouldn't necessarily need to hold both locks for the whole duration. We could first grab ProcArrayLock and construct the snapshot. Then grab WALInsertLock and release ProcArrayLock, and finally write the WAL record and release WALInsertLock. But that would require small changes to XLogInsert. I took a look at that approach. We don't actually need to hold ProcArrayLock while the WAL-record is written, we need to hold XidGenLock. I believe that's less severe than holding the ProcArrayLock as there's already precedence for writing a WAL record while holding that: we do that when we advance to a new clog page and write a zero-clog-page record. So this is what we should do IMHO. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** *** 6095,6102 StartupXLOG(void) StartupSUBTRANS(oldestActiveXID); StartupMultiXact(); - ProcArrayInitRecoveryInfo(oldestActiveXID); - /* * If we're beginning at a shutdown checkpoint, we know that * nothing was running on the master at this point. So fake-up an --- 6095,6100 *** a/src/backend/storage/ipc/procarray.c --- b/src/backend/storage/ipc/procarray.c *** *** 435,453 ProcArrayClearTransaction(PGPROC *proc) } /* - * ProcArrayInitRecoveryInfo - * - * When trying to assemble our snapshot we only care about xids after this value. - * See comments for LogStandbySnapshot(). - */ - void - ProcArrayInitRecoveryInfo(TransactionId oldestActiveXid) - { - latestObservedXid = oldestActiveXid; - TransactionIdRetreat(latestObservedXid); - } - - /* * ProcArrayApplyRecoveryInfo -- apply recovery info about xids * * Takes us through 3 states: Initialized, Pending and Ready. --- 435,440 *** *** 519,533 ProcArrayApplyRecoveryInfo(RunningTransactions running) Assert(standbyState == STANDBY_INITIALIZED); /* ! * OK, we need to initialise from the RunningTransactionsData record ! */ ! ! /* ! * Remove all xids except xids later than the snapshot. We don't know ! * exactly which ones that is until precisely now, so that is why we allow ! * xids to be added only to remove most of them again here. */ - ExpireOldKnownAssignedTransactionIds(running-nextXid); StandbyReleaseOldLocks(running-nextXid); /* --- 506,514 Assert(standbyState == STANDBY_INITIALIZED); /* ! * Release any locks belonging to old transactions that are not ! * running according to the running-xacts record. */ StandbyReleaseOldLocks(running-nextXid); /* *** *** 536,544 ProcArrayApplyRecoveryInfo(RunningTransactions running) LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); /* - * Combine the running xact data with already known xids, if any exist. * KnownAssignedXids is sorted so we cannot just add new xids, we have to ! * combine them first, sort them and then re-add to KnownAssignedXids. * * Some of the new xids are top-level xids and some are subtransactions. * We don't call SubtransSetParent because it doesn't matter yet. If we --- 517,524 LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); /* * KnownAssignedXids is sorted so we cannot just add new xids, we have to ! * sort them first. * * Some of the new xids are top-level xids and some are subtransactions. * We don't call SubtransSetParent because it doesn't matter yet. If we *** *** 547,597 ProcArrayApplyRecoveryInfo(RunningTransactions running) * xids to subtrans. If RunningXacts is overflowed then we don't have * enough information to correctly update subtrans anyway. */ /* ! * Allocate a temporary array so we can combine xids. The total of both ! * arrays should never normally exceed TOTAL_MAX_CACHED_SUBXIDS. */ ! xids = palloc(sizeof(TransactionId) * TOTAL_MAX_CACHED_SUBXIDS); /* ! * Get the remaining KnownAssignedXids. In most cases there won't be any ! * at all since this exists only to catch a theoretical race condition. ! */ ! nxids = KnownAssignedXidsGet(xids, InvalidTransactionId); ! if (nxids 0) !
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On Thu, 2010-12-02 at 10:39 +0200, Heikki Linnakangas wrote: On 01.12.2010 20:51, Heikki Linnakangas wrote: Another approach would be to revisit the way the running-xacts snapshot is taken. Currently, we first take a snapshot, and then WAL-log it. There is a small window between the steps where backends can begin/end transactions, and recovery has to deal with that. When this was designed, there was long discussion on whether we should instead grab WALInsertLock and ProcArrayLock at the same time, to ensure that the running-xacts snapshot represents an up-to-date situation at the point in WAL where it's inserted. We didn't want to do that because both locks can be heavily contended. But maybe we should after all. It would make the recovery code simpler. If we want to get fancy, we wouldn't necessarily need to hold both locks for the whole duration. We could first grab ProcArrayLock and construct the snapshot. Then grab WALInsertLock and release ProcArrayLock, and finally write the WAL record and release WALInsertLock. But that would require small changes to XLogInsert. I took a look at that approach. We don't actually need to hold ProcArrayLock while the WAL-record is written, we need to hold XidGenLock. I believe that's less severe than holding the ProcArrayLock as there's already precedence for writing a WAL record while holding that: we do that when we advance to a new clog page and write a zero-clog-page record. So this is what we should do IMHO. Oh, thanks for looking at this. I've been looking at this also and as we might expect had a slightly different design. First, your assessment of the locking above is better than mine. I agree with your analysis so we should do it that way. The locking issue was the reason I haven't patched this yet so I'm glad you've improved this. In terms of the rest of the patch, it seems we have different designs, I think I have a much simpler, less invasive solution: The cause of the issue is that replay starts at one LSN and there is a delay until the RunningXacts WAL record occurs. If there was no delay, there would be no issue at all. In CreateCheckpoint() we start by grabbing the WAInsertLock and later recording that pointer as part of the checkpoint record. My proposal is to replace the grab the lock code with the insert of the RunningXacts WAL record (when wal_level set), so that recovery always starts with that record type. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On 02.12.2010 11:02, Simon Riggs wrote: The cause of the issue is that replay starts at one LSN and there is a delay until the RunningXacts WAL record occurs. If there was no delay, there would be no issue at all. In CreateCheckpoint() we start by grabbing the WAInsertLock and later recording that pointer as part of the checkpoint record. My proposal is to replace the grab the lock code with the insert of the RunningXacts WAL record (when wal_level set), so that recovery always starts with that record type. Oh, interesting idea. But AFAICS closing the gap between acquiring the running-xacts snapshot and writing it to the log is sufficient, I don't see what moving the running-xacts record buys us. Does it allow some further simplifications somewhere? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On Thu, 2010-12-02 at 12:41 +0200, Heikki Linnakangas wrote: On 02.12.2010 11:02, Simon Riggs wrote: The cause of the issue is that replay starts at one LSN and there is a delay until the RunningXacts WAL record occurs. If there was no delay, there would be no issue at all. In CreateCheckpoint() we start by grabbing the WAInsertLock and later recording that pointer as part of the checkpoint record. My proposal is to replace the grab the lock code with the insert of the RunningXacts WAL record (when wal_level set), so that recovery always starts with that record type. Oh, interesting idea. But AFAICS closing the gap between acquiring the running-xacts snapshot and writing it to the log is sufficient, I don't see what moving the running-xacts record buys us. Does it allow some further simplifications somewhere? Your patch is quite long and you do a lot more than just alter the locking. I don't think we need those changes at all and especially would not wish to backpatch that. Earlier on this thread, we discussed: On Wed, 2010-11-24 at 15:19 +, Simon Riggs wrote: On Wed, 2010-11-24 at 12:48 +0200, Heikki Linnakangas wrote: When recovery starts, we fetch the oldestActiveXid from the checkpoint record. Let's say that it's 100. We then start replaying WAL records from the Redo pointer, and the first record (heap insert in your case) contains an Xid that's much larger than 100, say 1. We call RecordKnownAssignedXids() to make note that all xids between that range are in-progress, but there isn't enough room in the array for that. Agreed. The current code fails because of the gap between the redo pointer and the XLOG_RUNNING_XACTS WAL record. If there is no gap, there is no problem. So my preferred solution would: * Log XLOG_RUNNING_XACTS while holding XidGenLock, as you suggest * Move logging to occur at the Redo pointer That is a much smaller patch with a smaller footprint. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On 02.12.2010 13:25, Simon Riggs wrote: On Thu, 2010-12-02 at 12:41 +0200, Heikki Linnakangas wrote: On 02.12.2010 11:02, Simon Riggs wrote: The cause of the issue is that replay starts at one LSN and there is a delay until the RunningXacts WAL record occurs. If there was no delay, there would be no issue at all. In CreateCheckpoint() we start by grabbing the WAInsertLock and later recording that pointer as part of the checkpoint record. My proposal is to replace the grab the lock code with the insert of the RunningXacts WAL record (when wal_level set), so that recovery always starts with that record type. Oh, interesting idea. But AFAICS closing the gap between acquiring the running-xacts snapshot and writing it to the log is sufficient, I don't see what moving the running-xacts record buys us. Does it allow some further simplifications somewhere? Your patch is quite long and you do a lot more than just alter the locking. I don't think we need those changes at all and especially would not wish to backpatch that. Most of the changes to procarray.c were about removing code that's no longer necessary when we close the gap between acquiring and writing the running-xacts WAL record. You can leave it as it is as a historical curiosity, but I'd prefer to simplify it, given that we now know that it doesn't actually work correctly if the gap is not closed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On 24.11.2010 12:48, Heikki Linnakangas wrote: When recovery starts, we fetch the oldestActiveXid from the checkpoint record. Let's say that it's 100. We then start replaying WAL records from the Redo pointer, and the first record (heap insert in your case) contains an Xid that's much larger than 100, say 1. We call RecordKnownAssignedXids() to make note that all xids between that range are in-progress, but there isn't enough room in the array for that. We normally get away with a smallish array because the array is trimmed at commit and abort records, and the special xid-assignment record to handle the case of a lot of subtransactions. We initialize the array from the running-xacts record that's written at a checkpoint. That mechanism fails in this case because the heap insert record is seen before the running-xacts record, causing all those xids in the range 100-1 to be considered in-progress. The running-xacts record that comes later would prune them, but we don't have enough slots to hold them until that. Hmm. I'm not sure off the top of my head how to fix that. Perhaps stash the xids we see during WAL replay in private memory instead of putting them in the KnownAssignedXids array until we see the running-xacts record. So, here's a patch using that approach. Another approach would be to revisit the way the running-xacts snapshot is taken. Currently, we first take a snapshot, and then WAL-log it. There is a small window between the steps where backends can begin/end transactions, and recovery has to deal with that. When this was designed, there was long discussion on whether we should instead grab WALInsertLock and ProcArrayLock at the same time, to ensure that the running-xacts snapshot represents an up-to-date situation at the point in WAL where it's inserted. We didn't want to do that because both locks can be heavily contended. But maybe we should after all. It would make the recovery code simpler. If we want to get fancy, we wouldn't necessarily need to hold both locks for the whole duration. We could first grab ProcArrayLock and construct the snapshot. Then grab WALInsertLock and release ProcArrayLock, and finally write the WAL record and release WALInsertLock. But that would require small changes to XLogInsert. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** *** 6095,6102 StartupXLOG(void) StartupSUBTRANS(oldestActiveXID); StartupMultiXact(); - ProcArrayInitRecoveryInfo(oldestActiveXID); - /* * If we're beginning at a shutdown checkpoint, we know that * nothing was running on the master at this point. So fake-up an --- 6095,6100 *** a/src/backend/storage/ipc/procarray.c --- b/src/backend/storage/ipc/procarray.c *** *** 435,453 ProcArrayClearTransaction(PGPROC *proc) } /* - * ProcArrayInitRecoveryInfo - * - * When trying to assemble our snapshot we only care about xids after this value. - * See comments for LogStandbySnapshot(). - */ - void - ProcArrayInitRecoveryInfo(TransactionId oldestActiveXid) - { - latestObservedXid = oldestActiveXid; - TransactionIdRetreat(latestObservedXid); - } - - /* * ProcArrayApplyRecoveryInfo -- apply recovery info about xids * * Takes us through 3 states: Initialized, Pending and Ready. --- 435,440 *** *** 519,533 ProcArrayApplyRecoveryInfo(RunningTransactions running) Assert(standbyState == STANDBY_INITIALIZED); /* ! * OK, we need to initialise from the RunningTransactionsData record ! */ ! ! /* ! * Remove all xids except xids later than the snapshot. We don't know ! * exactly which ones that is until precisely now, so that is why we allow ! * xids to be added only to remove most of them again here. */ - ExpireOldKnownAssignedTransactionIds(running-nextXid); StandbyReleaseOldLocks(running-nextXid); /* --- 506,514 Assert(standbyState == STANDBY_INITIALIZED); /* ! * Release any locks belonging to old transactions that are not ! * running according to the running-xacts record. */ StandbyReleaseOldLocks(running-nextXid); /* *** *** 536,544 ProcArrayApplyRecoveryInfo(RunningTransactions running) LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); /* - * Combine the running xact data with already known xids, if any exist. * KnownAssignedXids is sorted so we cannot just add new xids, we have to ! * combine them first, sort them and then re-add to KnownAssignedXids. * * Some of the new xids are top-level xids and some are subtransactions. * We don't call SubtransSetParent because it doesn't matter yet. If we --- 517,524 LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); /* * KnownAssignedXids is sorted so we cannot just add new xids, we have to ! * sort them first. * * Some of the new xids are
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On 24.11.2010 06:56, Joachim Wieland wrote: On Tue, Nov 23, 2010 at 8:45 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 19.11.2010 23:46, Joachim Wieland wrote: FATAL: too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978, pArray-maxKnownAssignedXids: 6890 Hmm, that's a lot of entries in KnownAssignedXids. Can you recompile with WAL_DEBUG, and run the recovery again with wal_debug=on ? That will print all the replayed WAL records, which is a lot of data, but it might give a hint what's going on. Sure, but this gives me only one more line: [...] LOG: redo starts at 1F8/FC00E978 LOG: REDO @ 1F8/FC00E978; LSN 1F8/FC00EE90: prev 1F8/FC00E930; xid 385669; len 21; bkpb1: Heap - insert: rel 1663/16384/18373; tid 3829898/23 FATAL: too many KnownAssignedXids CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23 LOG: startup process (PID 4587) exited with exit code 1 LOG: terminating any other active server processes Thanks, I can reproduce this now. This happens when you have a wide gap between the oldest still active xid and the latest xid. When recovery starts, we fetch the oldestActiveXid from the checkpoint record. Let's say that it's 100. We then start replaying WAL records from the Redo pointer, and the first record (heap insert in your case) contains an Xid that's much larger than 100, say 1. We call RecordKnownAssignedXids() to make note that all xids between that range are in-progress, but there isn't enough room in the array for that. We normally get away with a smallish array because the array is trimmed at commit and abort records, and the special xid-assignment record to handle the case of a lot of subtransactions. We initialize the array from the running-xacts record that's written at a checkpoint. That mechanism fails in this case because the heap insert record is seen before the running-xacts record, causing all those xids in the range 100-1 to be considered in-progress. The running-xacts record that comes later would prune them, but we don't have enough slots to hold them until that. Hmm. I'm not sure off the top of my head how to fix that. Perhaps stash the xids we see during WAL replay in private memory instead of putting them in the KnownAssignedXids array until we see the running-xacts record. To reproduce this, I did this in the master: postgres=# CREATE FUNCTION insertfunc(n integer) RETURNS VOID AS $$ declare i integer; begin FOR i IN 1..n LOOP BEGIN INSERT INTO foo VALUES (1); EXCEPTION WHEN division_by_zero THEN RAISE NOTICE 'divbyzero'; END; END LOOP; end; $$ LANGUAGE plpgsql; postgres=# SELECT insertfunc(1); After letting that run for a while, so that a couple of checkpoints have occurred, kill the master and start standby to recover that from archive. After it has replayed all the WAL, stop the standby and restart it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On 24.11.2010 12:48, Heikki Linnakangas wrote: On 24.11.2010 06:56, Joachim Wieland wrote: On Tue, Nov 23, 2010 at 8:45 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 19.11.2010 23:46, Joachim Wieland wrote: FATAL: too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978, pArray-maxKnownAssignedXids: 6890 Hmm, that's a lot of entries in KnownAssignedXids. Can you recompile with WAL_DEBUG, and run the recovery again with wal_debug=on ? That will print all the replayed WAL records, which is a lot of data, but it might give a hint what's going on. Sure, but this gives me only one more line: [...] LOG: redo starts at 1F8/FC00E978 LOG: REDO @ 1F8/FC00E978; LSN 1F8/FC00EE90: prev 1F8/FC00E930; xid 385669; len 21; bkpb1: Heap - insert: rel 1663/16384/18373; tid 3829898/23 FATAL: too many KnownAssignedXids CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23 LOG: startup process (PID 4587) exited with exit code 1 LOG: terminating any other active server processes Thanks, I can reproduce this now. This happens when you have a wide gap between the oldest still active xid and the latest xid. When recovery starts, we fetch the oldestActiveXid from the checkpoint record. Let's say that it's 100. We then start replaying WAL records from the Redo pointer, and the first record (heap insert in your case) contains an Xid that's much larger than 100, say 1. We call RecordKnownAssignedXids() to make note that all xids between that range are in-progress, but there isn't enough room in the array for that. We normally get away with a smallish array because the array is trimmed at commit and abort records, and the special xid-assignment record to handle the case of a lot of subtransactions. We initialize the array from the running-xacts record that's written at a checkpoint. That mechanism fails in this case because the heap insert record is seen before the running-xacts record, causing all those xids in the range 100-1 to be considered in-progress. The running-xacts record that comes later would prune them, but we don't have enough slots to hold them until that. Hmm. I'm not sure off the top of my head how to fix that. Perhaps stash the xids we see during WAL replay in private memory instead of putting them in the KnownAssignedXids array until we see the running-xacts record. Looking closer at RecordKnownAssignedTransactionIds(), there's a related much more serious bug there too. When latestObservedXid is initialized to the oldest still-running xid, oldestActiveXid, at WAL recovery, we zero the CLOG starting from the oldestActiveXid. That will zap away the clog bits of any old transactions that had already committed before the checkpoint started, but were younger than the oldest still running transaction. The transactions will be lost :-(. It's dangerous to initialize latestObservedXid to anything to an older value. The idea of keeping the seen xids in a temporary list private to the startup process until the running-xacts record would solve that problem too. ProcArrayInitRecoveryInfo() would not be needed anymore, the KnownAssignedXids tracking would start at the first running-xacts record (or shutdown checkpoint) we see, not any sooner than that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On 24.11.2010 13:38, Heikki Linnakangas wrote: It's dangerous to initialize latestObservedXid to anything to an older value. older value than the nextXid-1 from the checkpoint record, I meant to say. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On Wed, 2010-11-24 at 12:48 +0200, Heikki Linnakangas wrote: When recovery starts, we fetch the oldestActiveXid from the checkpoint record. Let's say that it's 100. We then start replaying WAL records from the Redo pointer, and the first record (heap insert in your case) contains an Xid that's much larger than 100, say 1. We call RecordKnownAssignedXids() to make note that all xids between that range are in-progress, but there isn't enough room in the array for that. Agreed. Hmm. I'm not sure off the top of my head how to fix that. Perhaps stash the xids we see during WAL replay in private memory instead of putting them in the KnownAssignedXids array until we see the running-xacts record. Moving LogStandbySnapshot() earlier will help but won't solve it fully. Will think. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On Tue, Nov 23, 2010 at 8:45 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 19.11.2010 23:46, Joachim Wieland wrote: FATAL: too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978, pArray-maxKnownAssignedXids: 6890 Hmm, that's a lot of entries in KnownAssignedXids. Can you recompile with WAL_DEBUG, and run the recovery again with wal_debug=on ? That will print all the replayed WAL records, which is a lot of data, but it might give a hint what's going on. Sure, but this gives me only one more line: [...] LOG: redo starts at 1F8/FC00E978 LOG: REDO @ 1F8/FC00E978; LSN 1F8/FC00EE90: prev 1F8/FC00E930; xid 385669; len 21; bkpb1: Heap - insert: rel 1663/16384/18373; tid 3829898/23 FATAL: too many KnownAssignedXids CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23 LOG: startup process (PID 4587) exited with exit code 1 LOG: terminating any other active server processes Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On Sun, Nov 21, 2010 at 11:48 PM, Fujii Masao masao.fu...@gmail.com wrote: -- If you suspect a bug in Hot Standby, please set trace_recovery_messages = DEBUG2 in postgresql.conf and repeat the action Always useful to know * max_connections * current number of sessions * whether we have two phase commits happening -- The trace_recovery_messages parameter does not give more output... max_connections is set to 100 there have been no sessions on the standby itself, but a few on the primary database, I don't know how much but probably not more than 10. The sessions there were doing quite some load however, among them slony synchronization, since the hot standby master database was actually a slony replica. max_prepared_transactions has not been changed from its default value of 0. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On 19.11.2010 23:46, Joachim Wieland wrote: FATAL: too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978, pArray-maxKnownAssignedXids: 6890 Hmm, that's a lot of entries in KnownAssignedXids. Can you recompile with WAL_DEBUG, and run the recovery again with wal_debug=on ? That will print all the replayed WAL records, which is a lot of data, but it might give a hint what's going on. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: too many KnownAssignedXids
On Sat, Nov 20, 2010 at 6:46 AM, Joachim Wieland j...@mcknight.de wrote: I still have the server, if you want me to debug anything or send a patch against 9.0.1 that gives more output, just let me know. Per previous Simon's comment, the following information would be useful. http://archives.postgresql.org/pgsql-general/2010-10/msg00154.php -- If you suspect a bug in Hot Standby, please set trace_recovery_messages = DEBUG2 in postgresql.conf and repeat the action Always useful to know * max_connections * current number of sessions * whether we have two phase commits happening -- Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby b-tree delete records review
On Tue, 2010-11-09 at 13:34 +0200, Heikki Linnakangas wrote: (cleaning up my inbox, and bumped into this..) On 22.04.2010 12:31, Simon Riggs wrote: On Thu, 2010-04-22 at 12:18 +0300, Heikki Linnakangas wrote: Simon Riggs wrote: On Thu, 2010-04-22 at 11:56 +0300, Heikki Linnakangas wrote: If none of the removed heap tuples were present anymore, we currently return InvalidTransactionId, which kills/waits out all read-only queries. But if none of the tuples were present anymore, the read-only queries wouldn't have seen them anyway, so ISTM that we should treat InvalidTransactionId return value as we don't need to kill anyone. That's not the point. The tuples were not themselves the sole focus, Yes, they were. We're replaying a b-tree deletion record, which removes pointers to some heap tuples, making them unreachable to any read-only queries. If any of them still need to be visible to read-only queries, we have a conflict. But if all of the heap tuples are gone already, removing the index pointers to them can'ẗ change the situation for any query. If any of them should've been visible to a query, the damage was done already by whoever pruned the heap tuples leaving just the tombstone LP_DEAD item pointers (in the heap) behind. You're missing my point. Those tuples are indicators of what may lie elsewhere in the database, completely unreferenced by this WAL record. Just because these referenced tuples are gone doesn't imply that all tuple versions written by the as yet-unknown-xids are also gone. We can't infer anything about the whole database just from one small group of records. Have you got an example of that? I don't need one, I have suggested the safe route. In order to infer anything, and thereby further optimise things, we would need proof that no cases can exist, which I don't have. Perhaps we can add yet, not sure about that either. It's good to be safe rather than sorry, but I'd still like to know because I'm quite surprised by that, and got me worried that I don't understand how hot standby works as well as I thought I did. I thought the point of stopping replay/killing queries at a b-tree deletion record is precisely that it makes some heap tuples invisible to running read-only queries. If it doesn't make any tuples invisible, why do any queries need to be killed? And why was it OK for them to be running just before replaying the b-tree deletion record? I'm sorry but I'm too busy to talk further on this today. Since we are discussing a further optimisation rather than a bug, I hope it is OK to come back to this again later. Would now be a good time to revisit this? I still don't see why a b-tree deletion record should conflict with anything, if all the removed index tuples point to just LP_DEAD tombstones in the heap. I want what you say to be true. The question is: is it? We just need to explain why that will never be a problem. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers