Re: [PATCHES] [HACKERS] Subtransaction commits and Hot Standby

2008-09-23 Thread Simon Riggs

On Thu, 2008-09-18 at 15:59 +0100, Simon Riggs wrote:
> On Tue, 2008-09-16 at 10:11 -0400, Alvaro Herrera wrote:
> 
> > I wonder if the improved clog API required to mark multiple
> > transactions as committed at once would be also useful to
> > TransactionIdCommitTree which is used in regular transaction commit.
> 
> I've hacked together this concept patch (WIP).
> 
> Not fully tested yet, but it gives a flavour of the API rearrangements
> required for atomic clog updates. It passes make check, but that's not
> saying enough for a serious review yet. I expect to pick this up again
> next week.

I've tested this some more and am much happier with it now.
Also added README details; there are no user interface or behaviour
changes.

The patch removes the need for RecordSubTransactionCommit() which

* reduces response times of subtransaction queries because we are able
to apply these changes in batches at commit time. This requires a
batch-style API that now works atomically, so there is much change in
transam.c

* reduces the path length for visibility tests for all users viewing
concurrent subtransaction activity since we are much less likely to
waste time following a long trail to an uncommitted higher-level
transaction

* removes the need for additional WAL logging to implement
subtransaction commits for Hot Standby

So half the patch is refactoring, half rearranging of clog access
functions to support batched-access.

An early review would greatly help my work on Hot Standby. Thanks.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support
Index: src/backend/access/transam/README
===
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/README,v
retrieving revision 1.11
diff -c -r1.11 README
*** src/backend/access/transam/README	21 Mar 2008 13:23:28 -	1.11
--- src/backend/access/transam/README	23 Sep 2008 21:23:02 -
***
*** 342,351 
  an XID.  A transaction can be in progress, committed, aborted, or
  "sub-committed".  This last state means that it's a subtransaction that's no
  longer running, but its parent has not updated its state yet (either it is
! still running, or the backend crashed without updating its status).  A
! sub-committed transaction's status will be updated again to the final value as
! soon as the parent commits or aborts, or when the parent is detected to be
! aborted.
  
  Savepoints are implemented using subtransactions.  A subtransaction is a
  transaction inside a transaction; its commit or abort status is not only
--- 342,360 
  an XID.  A transaction can be in progress, committed, aborted, or
  "sub-committed".  This last state means that it's a subtransaction that's no
  longer running, but its parent has not updated its state yet (either it is
! still running, or the backend crashed without updating its status).  Prior
! to 8.4 we updated the status to sub-committed in clog as soon as
! sub-commit had happened.  It was later realised that this is not actually
! required for any purpose and the action can be deferred until transaction
! commit. The main role of marking transactions as sub-committed is to 
! provide an atomic commit protocol when transaction status is spread across
! multiple clog pages. As a result whenever transaction status spreads
! across multiple pages we must use a two-phase commit protocol. The first
! phase is to mark the subtransactions as sub-committed, then we mark the
! top level transaction and all its subtransactions committed (in that order).
! So in 8.4 sub-committed state still exists, but as a transitory state as
! part of final commit. Subtransaction abort is always marked in clog as
! soon as it occurs, to allow locks to be released.
  
  Savepoints are implemented using subtransactions.  A subtransaction is a
  transaction inside a transaction; its commit or abort status is not only
Index: src/backend/access/transam/clog.c
===
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/clog.c,v
retrieving revision 1.47
diff -c -r1.47 clog.c
*** src/backend/access/transam/clog.c	1 Aug 2008 13:16:08 -	1.47
--- src/backend/access/transam/clog.c	23 Sep 2008 20:41:17 -
***
*** 80,89 
  static bool CLOGPagePrecedes(int page1, int page2);
  static void WriteZeroPageXlogRec(int pageno);
  static void WriteTruncateXlogRec(int pageno);
! 
! 
! /*
!  * Record the final state of a transaction in the commit log.
   *
   * lsn must be the WAL location of the commit record when recording an async
   * commit.	For a synchronous commit it can be InvalidXLogRecPtr, since the
--- 80,105 
  static bool CLOGPagePrecedes(int page1, int page2);
  static void WriteZeroPageXlogRec(int pageno);
  static void WriteTruncateXlogRec(int pageno);
! static void TransactionIdSetPageStatus(TransactionId xid, int nsubxids, 
! 	Transa

Re: [PATCHES] [HACKERS] Infrastructure changes for recovery

2008-09-23 Thread Simon Riggs

On Mon, 2008-09-22 at 23:06 +0100, Simon Riggs wrote:
> On Thu, 2008-09-18 at 10:09 -0400, Tom Lane wrote:
> > Simon Riggs <[EMAIL PROTECTED]> writes:
> > > On Thu, 2008-09-18 at 09:06 -0400, Tom Lane wrote:
> > >> Do we really need a checkpoint there at all?
> > 
> > > "Timelines only change at shutdown checkpoints".
> > 
> > Hmm.  I *think* that that is just a debugging crosscheck rather than a
> > critical property.  But yeah, it would take some close investigation,
> > which maybe isn't warranted if you have a less-invasive solution.
> 
> OK, new patch, version 6. Some major differences to previous patch.

> Ready for serious review prior to commit. I will be performing further
> testing also.

Version 7

I've removed the concept of interrupting a restartpoint half way
through, I found a fault there. It was more ugly than the alternative
and less robust. The code now waits at the end of recovery if we are in
the middle of a restartpoint, but forces a do-it-more-quickly also. That
means we won't always get a fast start even though we skip the shutdown
checkpoint, but at least we're sure there's no chance of breakage
because of concurrent activiy, state changes etc..

I'm happy with this now.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support
Index: src/backend/access/transam/multixact.c
===
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/multixact.c,v
retrieving revision 1.28
diff -c -r1.28 multixact.c
*** src/backend/access/transam/multixact.c	1 Aug 2008 13:16:08 -	1.28
--- src/backend/access/transam/multixact.c	22 Sep 2008 19:28:56 -
***
*** 1543,1549 
  	 * SimpleLruTruncate would get confused.  It seems best not to risk
  	 * removing any data during recovery anyway, so don't truncate.
  	 */
! 	if (!InRecovery)
  		TruncateMultiXact();
  
  	TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
--- 1543,1549 
  	 * SimpleLruTruncate would get confused.  It seems best not to risk
  	 * removing any data during recovery anyway, so don't truncate.
  	 */
! 	if (!IsRecoveryProcessingMode())
  		TruncateMultiXact();
  
  	TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
Index: src/backend/access/transam/xlog.c
===
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.317
diff -c -r1.317 xlog.c
*** src/backend/access/transam/xlog.c	11 Aug 2008 11:05:10 -	1.317
--- src/backend/access/transam/xlog.c	23 Sep 2008 14:56:37 -
***
*** 66,76 
  bool		fullPageWrites = true;
  bool		log_checkpoints = false;
  int 		sync_method = DEFAULT_SYNC_METHOD;
- 
  #ifdef WAL_DEBUG
  bool		XLOG_DEBUG = false;
  #endif
- 
  /*
   * XLOGfileslop is the maximum number of preallocated future XLOG segments.
   * When we are done with an old XLOG segment file, we will recycle it as a
--- 66,74 
***
*** 119,124 
--- 117,123 
  
  /* Are we doing recovery from XLOG? */
  bool		InRecovery = false;
+ bool		reachedSafeStopPoint = false;
  
  /* Are we recovering using offline XLOG archives? */
  static bool InArchiveRecovery = false;
***
*** 131,137 
  static bool recoveryTarget = false;
  static bool recoveryTargetExact = false;
  static bool recoveryTargetInclusive = true;
- static bool recoveryLogRestartpoints = false;
  static TransactionId recoveryTargetXid;
  static TimestampTz recoveryTargetTime;
  static TimestampTz recoveryLastXTime = 0;
--- 130,135 
***
*** 286,295 
--- 284,295 
  /*
   * Total shared-memory state for XLOG.
   */
+ #define	XLOGCTL_BUFFER_SPACING	128
  typedef struct XLogCtlData
  {
  	/* Protected by WALInsertLock: */
  	XLogCtlInsert Insert;
+ 	char	InsertPadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogCtlInsert)];
  
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
***
*** 297,305 
--- 297,312 
  	uint32		ckptXidEpoch;	/* nextXID & epoch of latest checkpoint */
  	TransactionId ckptXid;
  	XLogRecPtr	asyncCommitLSN; /* LSN of newest async commit */
+ 	/* add data structure padding for above info_lck declarations */
+ 	char	InfoPadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogwrtRqst) 
+ - sizeof(XLogwrtResult)
+ - sizeof(uint32)
+ - sizeof(TransactionId)
+ - sizeof(XLogRecPtr)];
  
  	/* Protected by WALWriteLock: */
  	XLogCtlWrite Write;
+ 	char	WritePadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogCtlWrite)];
  
  	/*
  	 * These values do not change after startup, although the pointed-to pages
***
*** 311,316 
--- 318,342 
  	int			XLogCacheBlck;	/* highest allocated xlog buffer index */
  	TimeLineID	ThisTimeLineID;
  
+ 	/*
+ 	 * IsRecoveryProcessingMode shows whether the postmaster is in a
+ 	 * postmaster state earlier than PM_RUN, or not. This is a globally
+ 	 * accessib

Re: [PATCHES] hash index improving v3

2008-09-23 Thread Simon Riggs

On Tue, 2008-09-23 at 09:13 -0400, Tom Lane wrote:
> Simon Riggs <[EMAIL PROTECTED]> writes:
> > Thinks: Why not just sort all of the time and skip the debate entirely?
> 
> The sort is demonstrably a loser for smaller indexes.  Admittedly,
> if the index is small then the sort can't cost all that much, but if
> the (correct) threshold is some large fraction of shared_buffers then
> it could still take awhile on installations with lots-o-buffers.

The other realisation is that for large indexes, giving them more
maintenance_work_mem probably will make them build faster 'cos we'll be
sorting. So "give big indexes more memory" is still true *enough* to be
broadly consistent, explainable and understandable. I do explain things
in more detail on some courses, but pithy rules help busy people.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] hash index improving v3

2008-09-23 Thread Simon Riggs

On Tue, 2008-09-23 at 09:13 -0400, Tom Lane wrote:

> The other side of that coin is that it's not clear this is really worth
> arguing about, much less exposing a separate parameter for.

Agreed.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] hash index improving v3

2008-09-23 Thread Tom Lane
Simon Riggs <[EMAIL PROTECTED]> writes:
> Thinks: Why not just sort all of the time and skip the debate entirely?

The sort is demonstrably a loser for smaller indexes.  Admittedly,
if the index is small then the sort can't cost all that much, but if
the (correct) threshold is some large fraction of shared_buffers then
it could still take awhile on installations with lots-o-buffers.

The other side of that coin is that it's not clear this is really worth
arguing about, much less exposing a separate parameter for.

regards, tom lane

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] hash index improving v3

2008-09-23 Thread Simon Riggs

On Tue, 2008-09-23 at 08:16 -0400, Tom Lane wrote:
> Simon Riggs <[EMAIL PROTECTED]> writes:
> > maintenance_work_mem is already used for 3 separate operations that bear
> > little resemblance to each other. If it's appropriate for all of those
> > then its appropriate for this usage also.
> 
> No, it isn't.
> 
> The fundamental point here is that this isn't a memory allocation
> parameter; it's a switchover threshold between two different behaviors.

That's a little confusing since sorts switch their behaviour also, but
use (some form of) work_mem, which is *also* their max allocation. I see
the difficulty in understanding the algorithm's behaviour now.

So shared_buffers is the wrong parameter, but even if we had a parameter
it would be very difficult to set it. 

Thinks: Why not just sort all of the time and skip the debate entirely?
I thought the main use case was for larger indexes, since that's when
the number of levels in the index is significantly less than btrees? Do
we need to optimise creation time of smaller hash indexes at all?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] hash index improving v3

2008-09-23 Thread Tom Lane
Simon Riggs <[EMAIL PROTECTED]> writes:
> maintenance_work_mem is already used for 3 separate operations that bear
> little resemblance to each other. If it's appropriate for all of those
> then its appropriate for this usage also.

No, it isn't.

The fundamental point here is that this isn't a memory allocation
parameter; it's a switchover threshold between two different behaviors.

regards, tom lane

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches