Re: [PATCHES] configure option for XLOG_BLCKSZ

2008-05-02 Thread Mark Wong
On Fri, May 2, 2008 at 9:16 AM, Tom Lane <[EMAIL PROTECTED]> wrote:
> "Mark Wong" <[EMAIL PROTECTED]> writes:
>
> > I still believe it makes sense to have them separated.  I did have
>  > some data, which has since been destroyed, that suggested there were
>  > some system characterization differences for OLTP workloads with
>  > PostgreSQL.  Let's hope those disks get delivered to Portland soon. :)
>
>  Fair enough.  It's not that much more code to have another configure
>  switch --- will go do that.
>
>  If we are allowing blocksize and relation seg size to have configure
>  switches, seems that symmetry would demand that XLOG_SEG_SIZE be
>  configurable as well.  Thoughts?

I don't have a feel for this one, but when we get the disks set up we
can certainly test to see what effects it has. :)

Regards,
Mark

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] configure option for XLOG_BLCKSZ

2008-05-02 Thread Mark Wong
On Fri, May 2, 2008 at 8:50 AM, Tom Lane <[EMAIL PROTECTED]> wrote:
> "Mark Wong" <[EMAIL PROTECTED]> writes:
>
> > As someone who has tested varying both those parameters it feels
>  > awkward to have a configure option for one and not the other, or vice
>  > versa.  I have slightly stronger feelings for having them both as
>  > configure options because it's easier to script, but feel a little
>  > more strongly about having BLCKSZ and XLOG_BLCKSZ both as either
>  > configure options or in pg_config_manual.h.  To have them such that
>  > one needs to change them in different manners makes a tad more work in
>  > automating testing.  So my case is just for ease of testing.
>
>  Well, that's a fair point.  Another issue though is whether it makes
>  sense for XLOG_BLCKSZ to be different from BLCKSZ at all, at least in
>  the default case.  They are both the unit of I/O and it's not clear
>  why you'd want different units.  Mark, has your testing shown any
>  indication that they really ought to be separately configurable?
>  I could see having the same configure switch set both of 'em.

I still believe it makes sense to have them separated.  I did have
some data, which has since been destroyed, that suggested there were
some system characterization differences for OLTP workloads with
PostgreSQL.  Let's hope those disks get delivered to Portland soon. :)

Regards,
Mark

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] configure option for XLOG_BLCKSZ

2008-05-02 Thread Mark Wong
On Fri, May 2, 2008 at 12:04 AM, Joshua D. Drake <[EMAIL PROTECTED]> wrote:
>
> Tom Lane wrote:
>
> > "Mark Wong" <[EMAIL PROTECTED]> writes:
> >
> > > I saw a that a patch was committed that exposed a configure switch for
> > > BLCKSZ.  I was hoping that I could do that same for XLOG_BLCKSZ.
> > >
> >
> > Well, we certainly *could*, but what's the use-case really?  The case
> > for varying BLCKSZ is marginal already, and I've seen none at all for
> > varying XLOG_BLCKSZ.  Why do we need to make it easier than "edit
> > pg_config_manual.h"?
> >
>
>  The use case I could see is for performance testing but I would concur that
> it doesn't take much to modify pg_config_manual.h. In thinking about it,
> this might actually be a foot gun. You have a new pg guy, download source
> and think to himself..., "Hey I have a 4k block size as formatted on my hard
> disk". Then all of a sudden they have an incompatible PostgreSQL with
> everything else.

As someone who has tested varying both those parameters it feels
awkward to have a configure option for one and not the other, or vice
versa.  I have slightly stronger feelings for having them both as
configure options because it's easier to script, but feel a little
more strongly about having BLCKSZ and XLOG_BLCKSZ both as either
configure options or in pg_config_manual.h.  To have them such that
one needs to change them in different manners makes a tad more work in
automating testing.  So my case is just for ease of testing.

Regards,
Mark

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


[PATCHES] configure option for XLOG_BLCKSZ

2008-05-01 Thread Mark Wong
Hi all,

I saw a that a patch was committed that exposed a configure switch for
BLCKSZ.  I was hoping that I could do that same for XLOG_BLCKSZ.  I
think I got the configure.in, sgml, pg_config_manual.h, and
pg_config.h.in changes correct.

Regards,
Mark
Index: configure
===
RCS file: /projects/cvsroot/pgsql/configure,v
retrieving revision 1.592
diff -c -r1.592 configure
*** configure	2 May 2008 01:08:22 -	1.592
--- configure	2 May 2008 04:39:34 -
***
*** 1374,1379 
--- 1374,1380 
--with-libs=DIRSalternative spelling of --with-libraries
--with-pgport=PORTNUM   set default port number [5432]
--with-blocksize=BLOCKSIZE  set block size in kB [8]
+   --with-xlog-blocksize=BLOCKSIZE  set xlog block size in kB [8]
--with-segsize=SEGSIZE  set segment size in GB [1]
--with-tcl  build Tcl modules (PL/Tcl)
--with-tclconfig=DIRtclConfig.sh is in DIR
***
*** 2602,2607 
--- 2603,2658 
  _ACEOF
  
  
+ { echo "$as_me:$LINENO: checking for xlog block size" >&5
+ echo $ECHO_N "checking for xlog block size... $ECHO_C" >&6; }
+ 
+ pgac_args="$pgac_args with_xlog_blocksize"
+ 
+ 
+ # Check whether --with-xlog-blocksize was given.
+ if test "${with_xlog_blocksize+set}" = set; then
+   withval=$with_xlog_blocksize;
+   case $withval in
+ yes)
+   { { echo "$as_me:$LINENO: error: argument required for --with-xlog-blocksize option" >&5
+ echo "$as_me: error: argument required for --with-xlog-blocksize option" >&2;}
+{ (exit 1); exit 1; }; }
+   ;;
+ no)
+   { { echo "$as_me:$LINENO: error: argument required for --with-xlog-blocksize option" >&5
+ echo "$as_me: error: argument required for --with-xlog-blocksize option" >&2;}
+{ (exit 1); exit 1; }; }
+   ;;
+ *)
+   xlog_blocksize=$withval
+   ;;
+   esac
+ 
+ else
+   xlog_blocksize=8
+ fi
+ 
+ 
+ case ${xlog_blocksize} in
+   1) XLOG_BLCKSZ=1024;;
+   2) XLOG_BLCKSZ=2048;;
+   4) XLOG_BLCKSZ=4096;;
+   8) XLOG_BLCKSZ=8192;;
+  16) XLOG_BLCKSZ=16384;;
+  32) XLOG_BLCKSZ=32768;;
+   *) { { echo "$as_me:$LINENO: error: Invalid block size. Allowed values are 1,2,4,8,16,32." >&5
+ echo "$as_me: error: Invalid block size. Allowed values are 1,2,4,8,16,32." >&2;}
+{ (exit 1); exit 1; }; }
+ esac
+ { echo "$as_me:$LINENO: result: ${xlog_blocksize}kB" >&5
+ echo "${ECHO_T}${xlog_blocksize}kB" >&6; }
+ 
+ 
+ cat >>confdefs.h <<_ACEOF
+ #define XLOG_BLCKSZ ${XLOG_BLCKSZ}
+ _ACEOF
+ 
+ 
  #
  # File segment size
  #
Index: configure.in
===
RCS file: /projects/cvsroot/pgsql/configure.in,v
retrieving revision 1.558
diff -c -r1.558 configure.in
*** configure.in	2 May 2008 01:08:26 -	1.558
--- configure.in	2 May 2008 04:39:34 -
***
*** 249,254 
--- 249,278 
   Changing BLCKSZ requires an initdb.
  ]) 
  
+ AC_MSG_CHECKING([for xlog block size])
+ PGAC_ARG_REQ(with, xlog-blocksize, [  --with-xlog-blocksize=BLOCKSIZE  set xlog block size in kB [[8]]],
+  [xlog_blocksize=$withval],
+  [xlog_blocksize=8])
+ case ${xlog_blocksize} in
+   1) XLOG_BLCKSZ=1024;;
+   2) XLOG_BLCKSZ=2048;;
+   4) XLOG_BLCKSZ=4096;;
+   8) XLOG_BLCKSZ=8192;;
+  16) XLOG_BLCKSZ=16384;;
+  32) XLOG_BLCKSZ=32768;;
+   *) AC_MSG_ERROR([Invalid block size. Allowed values are 1,2,4,8,16,32.])
+ esac
+ AC_MSG_RESULT([${xlog_blocksize}kB])
+ 
+ AC_DEFINE_UNQUOTED([XLOG_BLCKSZ], ${XLOG_BLCKSZ}, [
+  Size of a WAL file block.  This need have no particular relation to BLCKSZ.
+  XLOG_BLCKSZ must be a power of 2, and if your system supports O_DIRECT I/O,
+  XLOG_BLCKSZ must be a multiple of the alignment requirement for direct-I/O
+  buffers, else direct I/O may fail.
+ 
+  Changing XLOG_BLCKSZ requires an initdb.
+ ]) 
+ 
  #
  # File segment size
  #
Index: doc/src/sgml/installation.sgml
===
RCS file: /projects/cvsroot/pgsql/doc/src/sgml/installation.sgml,v
retrieving revision 1.308
diff -c -r1.308 installation.sgml
*** doc/src/sgml/installation.sgml	2 May 2008 01:08:26 -	1.308
--- doc/src/sgml/installation.sgml	2 May 2008 04:39:36 -
***
*** 1104,1109 
--- 1104,1123 

  

+--with-xlog-blocksize=BLOCKSIZE
+
+ 
+  Set the xlog block size, in kilobytes.  This is the unit
+  of storage and I/O within the WAL files.  The default, 8 kilobytes,
+  is suitable for most situations; but other values may be useful
+  in special cases.
+  The value must be a power of 2 between 1 and 32 (kilobytes).
+  Note that changing this value requires an initdb.
+ 
+
+   
+ 
+   
 --disable-spinlocks
 
  
Index: src/include/pg_config.h.in
===
RCS fil

[PATCHES] Proposing correction to posix_fadvise() usage in xlog.c

2008-02-29 Thread Mark Wong
I believe I have a correction to the usage of posix_fadvise() in
xlog.c.  Basically posix_fadvise() is being called right before the
WAL segment file is closed, which effectively doesn't do anything as
opposed to when the file is opened.  This proposed correction calls
posix_fadvise() in three locations in order to make sure
POSIX_FADV_DONTNEED is set correctly since there are three cases for
opening a WAL segment file for writing.

I'm hesitant to post any data I have because I only have a little pc
with a SATA drive in it.  My hardware knowledge on SATA controllers
and drives is a little weak, but my testing with dbt-2 is showing the
performance dropping.  I am guessing that SATA drives have write cache
enabled by default so it seems to make sense that using
POSIX_FADV_DONTNEED will cause writes to be slower by writing through
the disk cache.  Again, assuming that is possible with SATA hardware.

If memory serves, one of the wins here is suppose to be that in a
scenario where we are not expecting to re-read writes to the WAL we
also do not want the writes to disk to flush out other data from the
operating system disk cache.  But I'm not sure how best to test
correctness.

Anyway, I hope I'm not way off but I'm sure someone will correct me. :)

Regards,
Mark


pgsql-log-fadvise.patch
Description: Binary data

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] WIP: splitting BLCKSZ

2006-04-03 Thread Mark Wong
Here's an updated patch with help from Simon.  Once I get a test system
going again in the lab I'll start posting some data.  I'm planning a
combination of block sizes (BLCKSZ and XLOG_BLCKSZ) and number of WAL
buffers.

Thanks,
MarkIndex: src/backend/access/transam/xlog.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.229
diff -c -r1.229 xlog.c
*** src/backend/access/transam/xlog.c	28 Mar 2006 22:01:16 -	1.229
--- src/backend/access/transam/xlog.c	3 Apr 2006 15:03:09 -
***
*** 113,122 
  
  /*
   * Limitation of buffer-alignment for direct IO depends on OS and filesystem,
!  * but BLCKSZ is assumed to be enough for it.
   */
  #ifdef O_DIRECT
! #define ALIGNOF_XLOG_BUFFER		BLCKSZ
  #else
  #define ALIGNOF_XLOG_BUFFER		ALIGNOF_BUFFER
  #endif
--- 113,122 
  
  /*
   * Limitation of buffer-alignment for direct IO depends on OS and filesystem,
!  * but XLOG_BLCKSZ is assumed to be enough for it.
   */
  #ifdef O_DIRECT
! #define ALIGNOF_XLOG_BUFFER		XLOG_BLCKSZ
  #else
  #define ALIGNOF_XLOG_BUFFER		ALIGNOF_BUFFER
  #endif
***
*** 130,136 
  
  /* User-settable parameters */
  int			CheckPointSegments = 3;
! int			XLOGbuffers = 8;
  char	   *XLogArchiveCommand = NULL;
  char	   *XLOG_sync_method = NULL;
  const char	XLOG_sync_method_default[] = DEFAULT_SYNC_METHOD_STR;
--- 130,136 
  
  /* User-settable parameters */
  int			CheckPointSegments = 3;
! int			XLOGbuffers = 16;
  char	   *XLogArchiveCommand = NULL;
  char	   *XLOG_sync_method = NULL;
  const char	XLOG_sync_method_default[] = DEFAULT_SYNC_METHOD_STR;
***
*** 374,380 
  	 * and xlblocks values depends on WALInsertLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
! 	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + BLCKSZ */
  	Size		XLogCacheByte;	/* # bytes in xlog buffers */
  	int			XLogCacheBlck;	/* highest allocated xlog buffer index */
  	TimeLineID	ThisTimeLineID;
--- 374,380 
  	 * and xlblocks values depends on WALInsertLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
! 	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
  	Size		XLogCacheByte;	/* # bytes in xlog buffers */
  	int			XLogCacheBlck;	/* highest allocated xlog buffer index */
  	TimeLineID	ThisTimeLineID;
***
*** 397,403 
  
  /* Free space remaining in the current xlog page buffer */
  #define INSERT_FREESPACE(Insert)  \
! 	(BLCKSZ - ((Insert)->currpos - (char *) (Insert)->currpage))
  
  /* Construct XLogRecPtr value for current insertion point */
  #define INSERT_RECPTR(recptr,Insert,curridx)  \
--- 397,403 
  
  /* Free space remaining in the current xlog page buffer */
  #define INSERT_FREESPACE(Insert)  \
! 	(XLOG_BLCKSZ - ((Insert)->currpos - (char *) (Insert)->currpage))
  
  /* Construct XLogRecPtr value for current insertion point */
  #define INSERT_RECPTR(recptr,Insert,curridx)  \
***
*** 441,447 
  static uint32 readSeg = 0;
  static uint32 readOff = 0;
  
! /* Buffer for currently read page (BLCKSZ bytes) */
  static char *readBuf = NULL;
  
  /* Buffer for current ReadRecord result (expandable) */
--- 441,447 
  static uint32 readSeg = 0;
  static uint32 readOff = 0;
  
! /* Buffer for currently read page (XLOG_BLCKSZ bytes) */
  static char *readBuf = NULL;
  
  /* Buffer for current ReadRecord result (expandable) */
***
*** 706,712 
  	 * If cache is half filled then try to acquire write lock and do
  	 * XLogWrite. Ignore any fractional blocks in performing this check.
  	 */
! 	LogwrtRqst.Write.xrecoff -= LogwrtRqst.Write.xrecoff % BLCKSZ;
  	if (LogwrtRqst.Write.xlogid != LogwrtResult.Write.xlogid ||
  		(LogwrtRqst.Write.xrecoff >= LogwrtResult.Write.xrecoff +
  		 XLogCtl->XLogCacheByte / 2))
--- 706,712 
  	 * If cache is half filled then try to acquire write lock and do
  	 * XLogWrite. Ignore any fractional blocks in performing this check.
  	 */
! 	LogwrtRqst.Write.xrecoff -= LogwrtRqst.Write.xrecoff % XLOG_BLCKSZ;
  	if (LogwrtRqst.Write.xlogid != LogwrtResult.Write.xlogid ||
  		(LogwrtRqst.Write.xrecoff >= LogwrtResult.Write.xrecoff +
  		 XLogCtl->XLogCacheByte / 2))
***
*** 1228,1239 
  	{
  		/* crossing a logid boundary */
  		NewPageEndPtr.xlogid += 1;
! 		NewPageEndPtr.xrecoff = BLCKSZ;
  	}
  	else
! 		NewPageEndPtr.xrecoff += BLCKSZ;
  	XLogCtl->xlblocks[nextidx] = NewPageEndPtr;
! 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) BLCKSZ);
  
  	Insert->curridx = nextidx;
  	Insert->currpage = NewPage;
--- 1228,1239 
  	{
  		/* crossing a logid boundary */
  		NewPageEndPtr.xlogid += 1;
! 		NewPageEndPtr.xrecoff = XLOG_BLCKSZ;
  	}
  	else
! 		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
  	XLogCtl->xlblocks[nextidx] = NewPageEndPtr;
! 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size)

Re: [PATCHES] WIP: splitting BLCKSZ

2006-03-23 Thread Mark Wong
On Wed, 22 Mar 2006 14:19:48 -0500
Tom Lane <[EMAIL PROTECTED]> wrote:

> Mark Wong <[EMAIL PROTECTED]> writes:
> > I proposed to explore splitting BLCKSZ into separate values for logging
> > and data to see if there might be anything to gain:
> > http://archives.postgresql.org/pgsql-hackers/2006-03/msg00745.php
> > My first pass was to do more or less a search and replace (attached) and
> > I am already running into trouble with a 'make check' (below).  I'm
> > guessing that when initdb is run, I'm not properly saving the values
> > that I've defined for DATA_BLCKSZ and possibly LOG_BLCKSZ.
> 
> I'd suggest leaving BLCKSZ as-is and inventing XLOG_BLCKSZ to be used
> only within the WAL code; should make for a *far* smaller patch.
> Offhand I don't think that anything except xlog.c knows the WAL block
> size --- it should be fairly closely associated with dependencies on
> XLOG_SEG_SIZE, if you are looking for something to grep for.

Ok, I have attached something much smaller.  Appears to pass a 'make
check' but I'll keep going to make sure it's really correct and works.

Thanks,
MarkIndex: src/backend/access/transam/xlog.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.227
diff -c -r1.227 xlog.c
*** src/backend/access/transam/xlog.c	5 Mar 2006 15:58:22 -	1.227
--- src/backend/access/transam/xlog.c	23 Mar 2006 19:13:31 -
***
*** 113,122 
  
  /*
   * Limitation of buffer-alignment for direct IO depends on OS and filesystem,
!  * but BLCKSZ is assumed to be enough for it.
   */
  #ifdef O_DIRECT
! #define ALIGNOF_XLOG_BUFFER		BLCKSZ
  #else
  #define ALIGNOF_XLOG_BUFFER		ALIGNOF_BUFFER
  #endif
--- 113,122 
  
  /*
   * Limitation of buffer-alignment for direct IO depends on OS and filesystem,
!  * but XLOG_BLCKSZ is assumed to be enough for it.
   */
  #ifdef O_DIRECT
! #define ALIGNOF_XLOG_BUFFER		XLOG_BLCKSZ
  #else
  #define ALIGNOF_XLOG_BUFFER		ALIGNOF_BUFFER
  #endif
***
*** 374,380 
  	 * and xlblocks values depends on WALInsertLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
! 	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + BLCKSZ */
  	Size		XLogCacheByte;	/* # bytes in xlog buffers */
  	int			XLogCacheBlck;	/* highest allocated xlog buffer index */
  	TimeLineID	ThisTimeLineID;
--- 374,380 
  	 * and xlblocks values depends on WALInsertLock and WALWriteLock.
  	 */
  	char	   *pages;			/* buffers for unwritten XLOG pages */
! 	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
  	Size		XLogCacheByte;	/* # bytes in xlog buffers */
  	int			XLogCacheBlck;	/* highest allocated xlog buffer index */
  	TimeLineID	ThisTimeLineID;
***
*** 397,403 
  
  /* Free space remaining in the current xlog page buffer */
  #define INSERT_FREESPACE(Insert)  \
! 	(BLCKSZ - ((Insert)->currpos - (char *) (Insert)->currpage))
  
  /* Construct XLogRecPtr value for current insertion point */
  #define INSERT_RECPTR(recptr,Insert,curridx)  \
--- 397,403 
  
  /* Free space remaining in the current xlog page buffer */
  #define INSERT_FREESPACE(Insert)  \
! 	(XLOG_BLCKSZ - ((Insert)->currpos - (char *) (Insert)->currpage))
  
  /* Construct XLogRecPtr value for current insertion point */
  #define INSERT_RECPTR(recptr,Insert,curridx)  \
***
*** 441,447 
  static uint32 readSeg = 0;
  static uint32 readOff = 0;
  
! /* Buffer for currently read page (BLCKSZ bytes) */
  static char *readBuf = NULL;
  
  /* Buffer for current ReadRecord result (expandable) */
--- 441,447 
  static uint32 readSeg = 0;
  static uint32 readOff = 0;
  
! /* Buffer for currently read page (XLOG_BLCKSZ bytes) */
  static char *readBuf = NULL;
  
  /* Buffer for current ReadRecord result (expandable) */
***
*** 662,668 
  			{
  COMP_CRC32(rdata_crc,
  		   page,
! 		   BLCKSZ);
  			}
  			else
  			{
--- 662,668 
  			{
  COMP_CRC32(rdata_crc,
  		   page,
! 		   XLOG_BLCKSZ);
  			}
  			else
  			{
***
*** 672,678 
  		   bkpb->hole_offset);
  COMP_CRC32(rdata_crc,
  		   page + (bkpb->hole_offset + bkpb->hole_length),
! 		   BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
  			}
  		}
  	}
--- 672,678 
  		   bkpb->hole_offset);
  COMP_CRC32(rdata_crc,
  		   page + (bkpb->hole_offset + bkpb->hole_length),
! 		   XLOG_BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
  			}
  		}
  	}
***
*** 705,711 
  	 * If cache is half filled then try to acquire write lock and do
  	 * XLogWrite. Ignore any fractional blocks in performing this check.
  	 */
! 	LogwrtRqst.Write.xrecoff -= LogwrtRqst.Write.xrecoff % BLCKSZ

Re: [PATCHES] [HACKERS] Autovacuum loose ends

2005-08-12 Thread Mark Wong
On Fri, 12 Aug 2005 18:42:09 -0400
Alvaro Herrera <[EMAIL PROTECTED]> wrote:

> On Fri, Aug 12, 2005 at 03:16:04PM -0700, Mark Wong wrote:
> > On Fri, 12 Aug 2005 17:49:41 -0400
> > Alvaro Herrera <[EMAIL PROTECTED]> wrote:
> > 
> > > Notice how the subindexes are wrong ... I think it should be 1:3 for
> > > i_orders, no?  Apparently indexes_scan.data has the same problem.
> > 
> > Whoops!  I think I fixed it for real now and the charts should be
> > updated now.  It was broken slightly more previously.
> 
> Hmm, did you fix the 42 case only?  The other one is broken too ...

The other dev4-015 cases should be fixed too.
 
> Also, it seems the "tran_lock.out" file captured wrong input -- I think
> you mean "WHERE transactionid IS NULL" in the query instead of "WHERE
> transaction IS NULL".

Hmm, ok I can try that in a future test run.  I'm not very familiar with
this table, what's the difference between transaction and transactionid?

> I wonder what the big down-spikes (?) at minutes ~45 and ~85 correspond
> to.  Are those checkpoints?  The IO vmstat chart would indicate that, I
> think.

That's correct, those should be checkpoints. 
 
> Anyway, it's interesting to see the performance go up with autovacuum
> on.  I certainly didn't expect that in this kind of test.

I think in Mary's case it was hurting, but she's running the workload
dramatically different.  I think she was planning to revisit that after
we sort out what's going on with the grouped WAL writes.

Mark

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] Autovacuum loose ends

2005-08-12 Thread Mark Wong
On Fri, 12 Aug 2005 17:49:41 -0400
Alvaro Herrera <[EMAIL PROTECTED]> wrote:

> On Fri, Aug 12, 2005 at 10:49:43AM -0700, Mark Wong wrote:
> > I thought I'd run a couple of tests to see if it would be helpful
> > against CVS from Aug 3, 2005.
> > 
> > Here's a run with autovacuum turned off:
> > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/42/
> > 5186.55 notpm
> > 
> > Autvacuum on with default settings:
> > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/38/
> > 5462.23 notpm
> 
> Just noticed what seems to be a bug: in
> 
> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/42/db/index_info.input
> 
> plot "index_info.data" using 1:2 title "i_customer" with lines, \
> "index_info.data" using 1:2 title "i_orders" with lines, \
> "index_info.data" using 1:3 title "pk_customer" with lines, \
> "index_info.data" using 1:4 title "pk_district" with lines, \
> "index_info.data" using 1:5 title "pk_item" with lines, \
> "index_info.data" using 1:6 title "pk_new_order" with lines, \
> "index_info.data" using 1:7 title "pk_order_line" with lines, \
> "index_info.data" using 1:8 title "pk_orders" with lines, \
> "index_info.data" using 1:9 title "pk_stock" with lines, \
> "index_info.data" using 1:11 title "pk_warehouse" with lines
> 
> Notice how the subindexes are wrong ... I think it should be 1:3 for
> i_orders, no?  Apparently indexes_scan.data has the same problem.

Whoops!  I think I fixed it for real now and the charts should be
updated now.  It was broken slightly more previously.

> It called my attention that the pk_warehouse index seems to have a very
> different usage in both runs in index_info, but in indexes_scan they
> seem similar.

Thanks,
Mark

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] [HACKERS] Autovacuum loose ends

2005-08-12 Thread Mark Wong
I thought I'd run a couple of tests to see if it would be helpful
against CVS from Aug 3, 2005.

Here's a run with autovacuum turned off:
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/42/
5186.55 notpm

Autvacuum on with default settings:
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/38/
5462.23 notpm

Would it help more to try a series of parameter changes?

Mark

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-08-11 Thread Mark Wong
Ok, I finally got a couple of tests done against CVS from Aug 3, 2005.
I'm not sure if I'm showing anything insightful though.  I've learned
that fdatasync and O_DSYNC are simply fsync and O_SYNC respectively on
Linux, which you guys may have already known.  There appears to be a
fair performance decrease in using open_sync.  Just to double check, am
I correct in understanding only open_sync uses O_DIRECT?

fdatasync
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/38/
5462 notpm

open_sync
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/40/
4860 notpm

Mark

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-08-06 Thread Mark Wong
Here are comments that Daniel McNeil made earlier, which I've neglected
to forward earlier.  I've cc'ed him and Mark Havercamp, which some of
you got to meet the other day.

Mark

-

With O_DIRECT on Linux, when the write() returns the i/o has been
transferred to the disk.  

Normally, this i/o will be DMAed directly from user-space to the
device.  The current exception is when doing an O_DIRECT write to a 
hole in a file.  (If an program does a truncate() or lseek()/write()
that makes a file larger, the file system does not allocated space
between the old end of file and the new end of file.)  An O_DIRECT
write to hole like this, requires the file system to allocated space,
but there is a race condition between the O_DIRECT write doing the
allocate and then write to initialized the newly allocated data and
any other process that attempts a buffered (page cache) read of the
same area in the file -- it was possible for the read to data from
the allocated region before the O_DIRECT write().  The fix in Linux
is for the O_DIRECT write() to fall back to use buffer i/o to do
the write() and flush the data from the page cache to the disk.

A write() with O_DIRECT only means the data has been transferred to
the disk.   Depending on the file system and mount options, it does
not mean the meta data for the file has been written to disk (see
fsync man page).  Fsync() will guarantee the data and metadata have
been written to disk.

Lastly, if a disk has a write back cache, an O_DIRECT write() does not
guarantee that the disk has put the data on the physical media.
I think some of the journal file systems now support i/o barriers
on commit which will flush the disk write back cache.  (I'm still
looking the kernel code to see how this is done).

Conclusion:

O_DIRECT + fsync() can make sense.  It avoids the copying of data
to the page cache before being written and will also guarantee
that the file's metadata is also written to disk.  It also
prevents the page cache from filling up with write data that
will never be read (I assume it is only read if a recovery
is necessary - which should be rare).  It can also
helps disks with write back cache when using the journaling
file system that use i/o barriers.  You would want to use
large writes, since the kernel page cache won't be writing
multiple pages for you.

I need to look at the kernel code more to comment on O_DIRECT with
O_SYNC.

Questions:

Does the database transaction logger preallocate the log file?

Does the logger care about the order in which each write hits the disk?

Now someone else can comment on my comments.

Daniel

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] COPY FROM performance improvements

2005-07-21 Thread Mark Wong
I just ran through a few tests with the v14 patch against 100GB of data
from dbt3 and found a 30% improvement; 3.6 hours vs 5.3 hours.  Just to
give a few details, I only loaded data and started a COPY in parallel
for each the data files:
http://www.testing.osdl.org/projects/dbt3testing/results/fast_copy/

Here's a visual of my disk layout, for those familiar with the database schema:

http://www.testing.osdl.org/projects/dbt3testing/results/fast_copy/layout-dev4-010-dbt3.html

I have 6 arrays of fourteen 15k rpm drives in a split-bus configuration
attached to a 4-way itanium2 via 6 compaq smartarray pci-x controllers.

Let me know if you have any questions.

Mark

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] COPY FROM performance improvements

2005-07-19 Thread Mark Wong
Whoopsies, yeah good point about the PRIMARY KEY.  I'll fix that.

Mark

On Tue, 19 Jul 2005 18:17:52 -0400
Andrew Dunstan <[EMAIL PROTECTED]> wrote:

> Mark,
> 
> You should definitely not be doing this sort of thing, I believe:
> 
> CREATE TABLE orders (
>   o_orderkey INTEGER,
>   o_custkey INTEGER,
>   o_orderstatus CHAR(1),
>   o_totalprice REAL,
>   o_orderDATE DATE,
>   o_orderpriority CHAR(15),
>   o_clerk CHAR(15),
>   o_shippriority INTEGER,
>   o_comment VARCHAR(79),
>   PRIMARY KEY (o_orderkey))
> 
> Create the table with no constraints, load the data, then set up primary keys 
> and whatever other constraints you want using ALTER TABLE. Last time I did a 
> load like this (albeit 2 orders of magnitude smaller) I saw a 50% speedup 
> from deferring constarint creation.
> 
> 
> cheers
> 
> andrew
> 
> 
> 
> Mark Wong wrote:
> 
> >Hi Alon,
> >
> >Yeah, that helps.  I just need to break up my scripts a little to just
> >load the data and not build indexes.
> >
> >Is the following information good enough to give a guess about the data
> >I'm loading, if you don't mind? ;)  Here's a link to my script to create
> >tables:
> >http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f7f23437e432&path=scripts/pgsql/create_tables.sh.in
> >
> >File sizes:
> >-rw-r--r--  1 markw 50 2.3G Jul  8 15:03 customer.tbl
> >-rw-r--r--  1 markw 50  74G Jul  8 15:03 lineitem.tbl
> >-rw-r--r--  1 markw 50 2.1K Jul  8 15:03 nation.tbl
> >-rw-r--r--  1 markw 50  17G Jul  8 15:03 orders.tbl
> >-rw-r--r--  1 markw 50 2.3G Jul  8 15:03 part.tbl
> >-rw-r--r--  1 markw 50  12G Jul  8 15:03 partsupp.tbl
> >-rw-r--r--  1 markw 50  391 Jul  8 15:03 region.tbl
> >-rw-r--r--  1 markw 50 136M Jul  8 15:03 supplier.tbl
> >
> >Number of rows:
> ># wc -l *.tbl
> >1500 customer.tbl
> >   600037902 lineitem.tbl
> >  25 nation.tbl
> >   15000 orders.tbl
> >2000 part.tbl
> >8000 partsupp.tbl
> >   5 region.tbl
> > 100 supplier.tbl
> >
> >Thanks,
> >Mark
> >
> >On Tue, 19 Jul 2005 14:05:56 -0700
> >"Alon Goldshuv" <[EMAIL PROTECTED]> wrote:
> >
> >  
> >
> >>Hi Mark,
> >>
> >>I improved the data *parsing* capabilities of COPY, and didn't touch the
> >>data conversion or data insertion parts of the code. The parsing improvement
> >>will vary largely depending on the ratio of parsing -to- converting and
> >>inserting. 
> >>
> >>Therefore, the speed increase really depends on the nature of your data:
> >>
> >>100GB file with
> >>long data rows (lots of parsing)
> >>Small number of columns (small number of attr conversions per row)
> >>less rows (less tuple insertions)
> >>
> >>Will show the best performance improvements.
> >>
> >>However, same file size 100GB with
> >>Short data rows (minimal parsing)
> >>large number of columns (large number of attr conversions per row)
> >>AND/OR
> >>more rows (more tuple insertions)
> >>
> >>Will show improvements but not as significant.
> >>In general I'll estimate 40%-95% improvement in load speed for the 1st case
> >>and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
> >>etc... This is for TEXT format. As for CSV, it may be faster but not as much
> >>as I specified here. BINARY will stay the same as before.
> >>
> >>HTH
> >>Alon.
> >>
> >>
> >>
> >>
> >>
> >>
> >>On 7/19/05 12:54 PM, "Mark Wong" <[EMAIL PROTECTED]> wrote:
> >>
> >>
> >>
> >>>On Thu, 14 Jul 2005 17:22:18 -0700
> >>>"Alon Goldshuv" <[EMAIL PROTECTED]> wrote:
> >>>
> >>>  
> >>>
> >>>>I revisited my patch and removed the code duplications that were there, 
> >>>>and
> >>>>added support for CSV with buffered input, so CSV now runs faster too
> >>>>(although it is not as optimized as the TEXT format parsing). So now
> >>>>TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original 
> >>>>file.
> >>>>
> >>>>
> >>>Hi Alon,
> >>>
> >>>I'm curious, what kind of system are you testing this on?  I'm trying to
> >>>load 100GB of data in our dbt3 workload on a 4-way itanium2.  I'm
> >>>interested in the results you would expect.
> >>>
> >>>Mark
> >>>
> >>>  
> >>>
> >

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] COPY FROM performance improvements

2005-07-19 Thread Mark Wong
Hi Alon,

Yeah, that helps.  I just need to break up my scripts a little to just
load the data and not build indexes.

Is the following information good enough to give a guess about the data
I'm loading, if you don't mind? ;)  Here's a link to my script to create
tables:
http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f7f23437e432&path=scripts/pgsql/create_tables.sh.in

File sizes:
-rw-r--r--  1 markw 50 2.3G Jul  8 15:03 customer.tbl
-rw-r--r--  1 markw 50  74G Jul  8 15:03 lineitem.tbl
-rw-r--r--  1 markw 50 2.1K Jul  8 15:03 nation.tbl
-rw-r--r--  1 markw 50  17G Jul  8 15:03 orders.tbl
-rw-r--r--  1 markw 50 2.3G Jul  8 15:03 part.tbl
-rw-r--r--  1 markw 50  12G Jul  8 15:03 partsupp.tbl
-rw-r--r--  1 markw 50  391 Jul  8 15:03 region.tbl
-rw-r--r--  1 markw 50 136M Jul  8 15:03 supplier.tbl

Number of rows:
# wc -l *.tbl
1500 customer.tbl
   600037902 lineitem.tbl
  25 nation.tbl
   15000 orders.tbl
2000 part.tbl
8000 partsupp.tbl
   5 region.tbl
 100 supplier.tbl

Thanks,
Mark

On Tue, 19 Jul 2005 14:05:56 -0700
"Alon Goldshuv" <[EMAIL PROTECTED]> wrote:

> Hi Mark,
> 
> I improved the data *parsing* capabilities of COPY, and didn't touch the
> data conversion or data insertion parts of the code. The parsing improvement
> will vary largely depending on the ratio of parsing -to- converting and
> inserting. 
> 
> Therefore, the speed increase really depends on the nature of your data:
> 
> 100GB file with
> long data rows (lots of parsing)
> Small number of columns (small number of attr conversions per row)
> less rows (less tuple insertions)
> 
> Will show the best performance improvements.
> 
> However, same file size 100GB with
> Short data rows (minimal parsing)
> large number of columns (large number of attr conversions per row)
> AND/OR
> more rows (more tuple insertions)
> 
> Will show improvements but not as significant.
> In general I'll estimate 40%-95% improvement in load speed for the 1st case
> and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
> etc... This is for TEXT format. As for CSV, it may be faster but not as much
> as I specified here. BINARY will stay the same as before.
> 
> HTH
> Alon.
> 
> 
> 
> 
> 
> 
> On 7/19/05 12:54 PM, "Mark Wong" <[EMAIL PROTECTED]> wrote:
> 
> > On Thu, 14 Jul 2005 17:22:18 -0700
> > "Alon Goldshuv" <[EMAIL PROTECTED]> wrote:
> > 
> >> I revisited my patch and removed the code duplications that were there, and
> >> added support for CSV with buffered input, so CSV now runs faster too
> >> (although it is not as optimized as the TEXT format parsing). So now
> >> TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original 
> >> file.
> > 
> > Hi Alon,
> > 
> > I'm curious, what kind of system are you testing this on?  I'm trying to
> > load 100GB of data in our dbt3 workload on a 4-way itanium2.  I'm
> > interested in the results you would expect.
> > 
> > Mark
> > 
> 

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] COPY FROM performance improvements

2005-07-19 Thread Mark Wong
On Thu, 14 Jul 2005 17:22:18 -0700
"Alon Goldshuv" <[EMAIL PROTECTED]> wrote:

> I revisited my patch and removed the code duplications that were there, and
> added support for CSV with buffered input, so CSV now runs faster too
> (although it is not as optimized as the TEXT format parsing). So now
> TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file.

Hi Alon,

I'm curious, what kind of system are you testing this on?  I'm trying to
load 100GB of data in our dbt3 workload on a 4-way itanium2.  I'm
interested in the results you would expect.

Mark

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] A couple of patches for PostgreSQL 64bit support

2005-07-18 Thread Mark Wong
Hi,

I grabbed the patches to try, but I was wondering if it would be more
interesting to try them against CVS rather than 8.0.3 (and if it would
be easy to port :)?

Mark

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [HACKERS] WAL: O_DIRECT and multipage-writer

2005-03-22 Thread Mark Wong
On Tue, Jan 25, 2005 at 06:06:23PM +0900, ITAGAKI Takahiro wrote:
> Environment:
>   OS : Linux kernel 2.6.9
>   CPU: Pentium 4 3GHz
>   disk   : ATA 5400rpm (Data and WAL are placed on same partition.)
>   memory : 1GB
>   config : shared_buffers=1, wal_buffers=256,
>XLOG_SEG_SIZE=256MB, checkpoint_segment=4

Hi Itagaki,

In light of this thread, have you compared the performance on
Linux-2.4?

Direct io on block device has performance regression on 2.6.x kernel
http://www.ussg.iu.edu/hypermail/linux/kernel/0503.1/0328.html

Mark

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PATCHES] WIP: buffer manager rewrite (take 2)

2005-03-02 Thread Mark Wong
On Wed, Mar 02, 2005 at 08:48:35PM -0500, Tom Lane wrote:
> Mark Wong <[EMAIL PROTECTED]> writes:
> > CVS from 20050301 [ plus clock-sweep buffer manager ] :
> > http://www.osdl.org/projects/dbt2dev/results/dev4-010/314/
> > throughput 5483.01
> > I only ran this for 30 minutes, as opposed to 60, but it looks
> > promising.
> 
> > So about a 50% increase in throughput for my test.  Not to shabby. ;)
> 
> Sweet ... and the response-time improvement is just stunning.  I think
> we may have a winner.  Is there a reason you didn't run the test longer
> though?

I would normally run for 1 hour, but I guess my fingers were thinking
something different at the time. =P

Mark

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PATCHES] WIP: buffer manager rewrite (take 2)

2005-03-02 Thread Mark Wong
On Wed, Feb 16, 2005 at 07:50:28PM -0500, Tom Lane wrote:
> Second iteration of buffer manager rewrite.  This uses the idea of a
> usage counter instead of just a recently-used flag bit.  I allowed the
> counter to go up to 5, but some playing around with that value would
> be interesting.  (Tweak BM_MAX_USAGE_COUNT in
> src/include/storage/buf_internals.h, then recompile the files in
> src/backend/storage/buffer/.)  Also there are more GUC variables now
> for controlling the bgwriter.
> 

I see a huge performance increase, when applied to CVS from 20050301.

Baseline against 8.0.1:
http://www.osdl.org/projects/dbt2dev/results/dev4-010/309/
throughput 3639.97

CVS from 20050301:
http://www.osdl.org/projects/dbt2dev/results/dev4-010/314/
throughput 5483.01
I only ran this for 30 minutes, as opposed to 60, but it looks
promising.

So about a 50% increase in throughput for my test.  Not to shabby. ;)

Mark

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] [HACKERS] WAL: O_DIRECT and multipage-writer (+ memory leak)

2005-03-01 Thread Mark Wong
On Thu, Feb 03, 2005 at 07:25:55PM +0900, ITAGAKI Takahiro wrote:
> Hello everyone.
> 
> I fixed two bugs in the patch that I sent before.
> Check and test new one, please.

Ok, finally got back into the office and was able to run 1 set of
tests.

So the new baseline result with 8.0.1:
http://www.osdl.org/projects/dbt2dev/results/dev4-010/309/
Throughput: 3639.97

Results with the patch but open_direct not set:
http://www.osdl.org/projects/dbt2dev/results/dev4-010/308/
Throughput: 3494.72

Results with the patch and open_direct set:
http://www.osdl.org/projects/dbt2dev/results/dev4-010/312/
Throughput: 3489.69

You can verify that the wall_sync_method is set to open_direct under
the "database parameters" link, but I'm wondering if I missed
something.  It looks a little odd the the performance dropped.

Mark

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PATCHES] [HACKERS] WAL: O_DIRECT and multipage-writer

2005-01-27 Thread Mark Wong
Hmm... I don't remember specifying a datatype.  I suppose whatever the
default one is. :)

I'll be happy to test again, just let me know.

Mark

On Fri, Jan 28, 2005 at 06:28:32AM +0900, ITAGAKI Takahiro wrote:
> Thanks for testing, Mark!
> 
> > I gave this a try with DBT-2, but got a core dump on our ia64 system.
> > I hope this isn't a random thing, like I ran into previously.  Maybe
> > I'll try again, but postgres dumped core.
> 
> Sorry, this seems to be my patch's bug.
> Which datatype did you compile with? LP64, ILP64, or LLP64?
> If you used LLP64, I think the cause is buffer alignment routine
> because of sizeof(long) != sizeof(void*).
> 
> I'll fix it soon...
> 
> 
> ITAGAKI Takahiro

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] WAL: O_DIRECT and multipage-writer

2005-01-27 Thread Mark Wong
Hi everyone,

I gave this a try with DBT-2, but got a core dump on our ia64 system.
I hope this isn't a random thing, like I ran into previously.  Maybe
I'll try again, but postgres dumped core.  Binary and core here:
http://developer.osdl.org/markw/pgsql/core/2morefiles.tar.bz2

#0  FunctionCall2 (flinfo=0x0, arg1=0, arg2=0) at fmgr.c:1141
1141result = FunctionCallInvoke(&fcinfo);
(gdb) bt
#0  FunctionCall2 (flinfo=0x0, arg1=0, arg2=0) at fmgr.c:1141
#1  0x403bdb80 in FunctionCall2 (flinfo=Cannot access memory at address 
0x0
) at fmgr.c:1141
#2  0x403bdb80 in FunctionCall2 (flinfo=Cannot access memory at address 
0x0
) at fmgr.c:1141

Over and over again, so I'll keep the backtrace short.

Mark

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-27 Thread Mark Wong
On Mon, Oct 25, 2004 at 11:34:25AM -0400, Jan Wieck wrote:
> On 10/22/2004 4:09 PM, Kenneth Marshall wrote:
> 
> > On Fri, Oct 22, 2004 at 03:35:49PM -0400, Jan Wieck wrote:
> >> On 10/22/2004 2:50 PM, Simon Riggs wrote:
> >> 
> >> >I've been using the ARC debug options to analyse memory usage on the
> >> >PostgreSQL 8.0 server. This is a precursor to more complex performance
> >> >analysis work on the OSDL test suite.
> >> >
> >> >I've simplified some of the ARC reporting into a single log line, which
> >> >is enclosed here as a patch on freelist.c. This includes reporting of:
> >> >- the total memory in use, which wasn't previously reported
> >> >- the cache hit ratio, which was slightly incorrectly calculated
> >> >- a useful-ish value for looking at the "B" lists in ARC
> >> >(This is a patch against cvstip, but I'm not sure whether this has
> >> >potential for inclusion in 8.0...)
> >> >
> >> >The total memory in use is useful because it allows you to tell whether
> >> >shared_buffers is set too high. If it is set too high, then memory usage
> >> >will continue to grow slowly up to the max, without any corresponding
> >> >increase in cache hit ratio. If shared_buffers is too small, then memory
> >> >usage will climb quickly and linearly to its maximum.
> >> >
> >> >The last one I've called "turbulence" in an attempt to ascribe some
> >> >useful meaning to B1/B2 hits - I've tried a few other measures though
> >> >without much success. Turbulence is the hit ratio of B1+B2 lists added
> >> >together. By observation, this is zero when ARC gives smooth operation,
> >> >and goes above zero otherwise. Typically, turbulence occurs when
> >> >shared_buffers is too small for the working set of the database/workload
> >> >combination and ARC repeatedly re-balances the lengths of T1/T2 as a
> >> >result of "near-misses" on the B1/B2 lists. Turbulence doesn't usually
> >> >cut in until the cache is fully utilized, so there is usually some delay
> >> >after startup.
> >> >
> >> >We also recently discussed that I would add some further memory analysis
> >> >features for 8.1, so I've been trying to figure out how.
> >> >
> >> >The idea that B1, B2 represent something really useful doesn't seem to
> >> >have been borne out - though I'm open to persuasion there.
> >> >
> >> >I originally envisaged a "shadow list" operating in extension of the
> >> >main ARC list. This will require some re-coding, since the variables and
> >> >macros are all hard-coded to a single set of lists. No complaints, just
> >> >it will take a little longer than we all thought (for me, that is...)
> >> >
> >> >My proposal is to alter the code to allow an array of memory linked
> >> >lists. The actual list would be [0] - other additional lists would be 
> >> >created dynamically as required i.e. not using IFDEFs, since I want this
> >> >to be controlled by a SIGHUP GUC to allow on-site tuning, not just lab
> >> >work. This will then allow reporting against the additional lists, so
> >> >that cache hit ratios can be seen with various other "prototype"
> >> >shared_buffer settings.
> >> 
> >> All the existing lists live in shared memory, so that dynamic approach 
> >> suffers from the fact that the memory has to be allocated during ipc_init.
> >> 
> >> What do you think about my other theory to make C actually 2x effective 
> >> cache size and NOT to keep T1 in shared buffers but to assume T1 lives 
> >> in the OS buffer cache?
> >> 
> >> 
> >> Jan
> >> 
> > Jan,
> > 
> >>From the articles that I have seen on the ARC algorithm, I do not think
> > that using the effective cache size to set C would be a win. The design
> > of the ARC process is to allow the cache to optimize its use in response
> > to the actual workload. It may be the best use of the cache in some cases
> > to have the entire cache allocated to T1 and similarly for T2. If fact,
> > the ability to alter the behavior as needed is one of the key advantages.
> 
> Only the "working set" of the database, that is the pages that are very 
> frequently used, are worth holding in shared memory at all. The rest 
> should be copied in and out of the OS disc buffers.
> 
> The problem is, with a too small directory ARC cannot guesstimate what 
> might be in the kernel buffers. Nor can it guesstimate what recently was 
> in the kernel buffers and got pushed out from there. That results in a 
> way too small B1 list, and therefore we don't get B1 hits when in fact 
> the data was found in memory. B1 hits is what increases the T1target, 
> and since we are missing them with a too small directory size, our 
> implementation of ARC is propably using a T2 size larger than the 
> working set. That is not optimal.
> 
> If we would replace the dynamic T1 buffers with a max_backends*2 area of 
> shared buffers, use a C value representing the effective cache size and 
> limit the T1target on the lower bound to effective cache size - shared 
> buffers, then we basically moved the T1 cache into the OS buffers