Re: [PATCHES] [HACKERS] Subtransaction commits and Hot Standby
On Thu, 2008-09-18 at 15:59 +0100, Simon Riggs wrote: > On Tue, 2008-09-16 at 10:11 -0400, Alvaro Herrera wrote: > > > I wonder if the improved clog API required to mark multiple > > transactions as committed at once would be also useful to > > TransactionIdCommitTree which is used in regular transaction commit. > > I've hacked together this concept patch (WIP). > > Not fully tested yet, but it gives a flavour of the API rearrangements > required for atomic clog updates. It passes make check, but that's not > saying enough for a serious review yet. I expect to pick this up again > next week. I've tested this some more and am much happier with it now. Also added README details; there are no user interface or behaviour changes. The patch removes the need for RecordSubTransactionCommit() which * reduces response times of subtransaction queries because we are able to apply these changes in batches at commit time. This requires a batch-style API that now works atomically, so there is much change in transam.c * reduces the path length for visibility tests for all users viewing concurrent subtransaction activity since we are much less likely to waste time following a long trail to an uncommitted higher-level transaction * removes the need for additional WAL logging to implement subtransaction commits for Hot Standby So half the patch is refactoring, half rearranging of clog access functions to support batched-access. An early review would greatly help my work on Hot Standby. Thanks. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support Index: src/backend/access/transam/README === RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/README,v retrieving revision 1.11 diff -c -r1.11 README *** src/backend/access/transam/README 21 Mar 2008 13:23:28 - 1.11 --- src/backend/access/transam/README 23 Sep 2008 21:23:02 - *** *** 342,351 an XID. A transaction can be in progress, committed, aborted, or "sub-committed". This last state means that it's a subtransaction that's no longer running, but its parent has not updated its state yet (either it is ! still running, or the backend crashed without updating its status). A ! sub-committed transaction's status will be updated again to the final value as ! soon as the parent commits or aborts, or when the parent is detected to be ! aborted. Savepoints are implemented using subtransactions. A subtransaction is a transaction inside a transaction; its commit or abort status is not only --- 342,360 an XID. A transaction can be in progress, committed, aborted, or "sub-committed". This last state means that it's a subtransaction that's no longer running, but its parent has not updated its state yet (either it is ! still running, or the backend crashed without updating its status). Prior ! to 8.4 we updated the status to sub-committed in clog as soon as ! sub-commit had happened. It was later realised that this is not actually ! required for any purpose and the action can be deferred until transaction ! commit. The main role of marking transactions as sub-committed is to ! provide an atomic commit protocol when transaction status is spread across ! multiple clog pages. As a result whenever transaction status spreads ! across multiple pages we must use a two-phase commit protocol. The first ! phase is to mark the subtransactions as sub-committed, then we mark the ! top level transaction and all its subtransactions committed (in that order). ! So in 8.4 sub-committed state still exists, but as a transitory state as ! part of final commit. Subtransaction abort is always marked in clog as ! soon as it occurs, to allow locks to be released. Savepoints are implemented using subtransactions. A subtransaction is a transaction inside a transaction; its commit or abort status is not only Index: src/backend/access/transam/clog.c === RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/clog.c,v retrieving revision 1.47 diff -c -r1.47 clog.c *** src/backend/access/transam/clog.c 1 Aug 2008 13:16:08 - 1.47 --- src/backend/access/transam/clog.c 23 Sep 2008 20:41:17 - *** *** 80,89 static bool CLOGPagePrecedes(int page1, int page2); static void WriteZeroPageXlogRec(int pageno); static void WriteTruncateXlogRec(int pageno); ! ! ! /* ! * Record the final state of a transaction in the commit log. * * lsn must be the WAL location of the commit record when recording an async * commit. For a synchronous commit it can be InvalidXLogRecPtr, since the --- 80,105 static bool CLOGPagePrecedes(int page1, int page2); static void WriteZeroPageXlogRec(int pageno); static void WriteTruncateXlogRec(int pageno); ! static void TransactionIdSetPageStatus(TransactionId xid, int nsubxids, ! Transa
Re: [PATCHES] [HACKERS] Infrastructure changes for recovery
On Mon, 2008-09-22 at 23:06 +0100, Simon Riggs wrote: > On Thu, 2008-09-18 at 10:09 -0400, Tom Lane wrote: > > Simon Riggs <[EMAIL PROTECTED]> writes: > > > On Thu, 2008-09-18 at 09:06 -0400, Tom Lane wrote: > > >> Do we really need a checkpoint there at all? > > > > > "Timelines only change at shutdown checkpoints". > > > > Hmm. I *think* that that is just a debugging crosscheck rather than a > > critical property. But yeah, it would take some close investigation, > > which maybe isn't warranted if you have a less-invasive solution. > > OK, new patch, version 6. Some major differences to previous patch. > Ready for serious review prior to commit. I will be performing further > testing also. Version 7 I've removed the concept of interrupting a restartpoint half way through, I found a fault there. It was more ugly than the alternative and less robust. The code now waits at the end of recovery if we are in the middle of a restartpoint, but forces a do-it-more-quickly also. That means we won't always get a fast start even though we skip the shutdown checkpoint, but at least we're sure there's no chance of breakage because of concurrent activiy, state changes etc.. I'm happy with this now. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support Index: src/backend/access/transam/multixact.c === RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/multixact.c,v retrieving revision 1.28 diff -c -r1.28 multixact.c *** src/backend/access/transam/multixact.c 1 Aug 2008 13:16:08 - 1.28 --- src/backend/access/transam/multixact.c 22 Sep 2008 19:28:56 - *** *** 1543,1549 * SimpleLruTruncate would get confused. It seems best not to risk * removing any data during recovery anyway, so don't truncate. */ ! if (!InRecovery) TruncateMultiXact(); TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true); --- 1543,1549 * SimpleLruTruncate would get confused. It seems best not to risk * removing any data during recovery anyway, so don't truncate. */ ! if (!IsRecoveryProcessingMode()) TruncateMultiXact(); TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true); Index: src/backend/access/transam/xlog.c === RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.317 diff -c -r1.317 xlog.c *** src/backend/access/transam/xlog.c 11 Aug 2008 11:05:10 - 1.317 --- src/backend/access/transam/xlog.c 23 Sep 2008 14:56:37 - *** *** 66,76 bool fullPageWrites = true; bool log_checkpoints = false; int sync_method = DEFAULT_SYNC_METHOD; - #ifdef WAL_DEBUG bool XLOG_DEBUG = false; #endif - /* * XLOGfileslop is the maximum number of preallocated future XLOG segments. * When we are done with an old XLOG segment file, we will recycle it as a --- 66,74 *** *** 119,124 --- 117,123 /* Are we doing recovery from XLOG? */ bool InRecovery = false; + bool reachedSafeStopPoint = false; /* Are we recovering using offline XLOG archives? */ static bool InArchiveRecovery = false; *** *** 131,137 static bool recoveryTarget = false; static bool recoveryTargetExact = false; static bool recoveryTargetInclusive = true; - static bool recoveryLogRestartpoints = false; static TransactionId recoveryTargetXid; static TimestampTz recoveryTargetTime; static TimestampTz recoveryLastXTime = 0; --- 130,135 *** *** 286,295 --- 284,295 /* * Total shared-memory state for XLOG. */ + #define XLOGCTL_BUFFER_SPACING 128 typedef struct XLogCtlData { /* Protected by WALInsertLock: */ XLogCtlInsert Insert; + char InsertPadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogCtlInsert)]; /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; *** *** 297,305 --- 297,312 uint32 ckptXidEpoch; /* nextXID & epoch of latest checkpoint */ TransactionId ckptXid; XLogRecPtr asyncCommitLSN; /* LSN of newest async commit */ + /* add data structure padding for above info_lck declarations */ + char InfoPadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogwrtRqst) + - sizeof(XLogwrtResult) + - sizeof(uint32) + - sizeof(TransactionId) + - sizeof(XLogRecPtr)]; /* Protected by WALWriteLock: */ XLogCtlWrite Write; + char WritePadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogCtlWrite)]; /* * These values do not change after startup, although the pointed-to pages *** *** 311,316 --- 318,342 int XLogCacheBlck; /* highest allocated xlog buffer index */ TimeLineID ThisTimeLineID; + /* + * IsRecoveryProcessingMode shows whether the postmaster is in a + * postmaster state earlier than PM_RUN, or not. This is a globally + * accessib
Re: [PATCHES] hash index improving v3
On Tue, 2008-09-23 at 09:13 -0400, Tom Lane wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: > > Thinks: Why not just sort all of the time and skip the debate entirely? > > The sort is demonstrably a loser for smaller indexes. Admittedly, > if the index is small then the sort can't cost all that much, but if > the (correct) threshold is some large fraction of shared_buffers then > it could still take awhile on installations with lots-o-buffers. The other realisation is that for large indexes, giving them more maintenance_work_mem probably will make them build faster 'cos we'll be sorting. So "give big indexes more memory" is still true *enough* to be broadly consistent, explainable and understandable. I do explain things in more detail on some courses, but pithy rules help busy people. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Tue, 2008-09-23 at 09:13 -0400, Tom Lane wrote: > The other side of that coin is that it's not clear this is really worth > arguing about, much less exposing a separate parameter for. Agreed. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
Simon Riggs <[EMAIL PROTECTED]> writes: > Thinks: Why not just sort all of the time and skip the debate entirely? The sort is demonstrably a loser for smaller indexes. Admittedly, if the index is small then the sort can't cost all that much, but if the (correct) threshold is some large fraction of shared_buffers then it could still take awhile on installations with lots-o-buffers. The other side of that coin is that it's not clear this is really worth arguing about, much less exposing a separate parameter for. regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
On Tue, 2008-09-23 at 08:16 -0400, Tom Lane wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: > > maintenance_work_mem is already used for 3 separate operations that bear > > little resemblance to each other. If it's appropriate for all of those > > then its appropriate for this usage also. > > No, it isn't. > > The fundamental point here is that this isn't a memory allocation > parameter; it's a switchover threshold between two different behaviors. That's a little confusing since sorts switch their behaviour also, but use (some form of) work_mem, which is *also* their max allocation. I see the difficulty in understanding the algorithm's behaviour now. So shared_buffers is the wrong parameter, but even if we had a parameter it would be very difficult to set it. Thinks: Why not just sort all of the time and skip the debate entirely? I thought the main use case was for larger indexes, since that's when the number of levels in the index is significantly less than btrees? Do we need to optimise creation time of smaller hash indexes at all? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] hash index improving v3
Simon Riggs <[EMAIL PROTECTED]> writes: > maintenance_work_mem is already used for 3 separate operations that bear > little resemblance to each other. If it's appropriate for all of those > then its appropriate for this usage also. No, it isn't. The fundamental point here is that this isn't a memory allocation parameter; it's a switchover threshold between two different behaviors. regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches