Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-05 Thread Heikki Linnakangas

On 06.02.2012 05:48, Bruce Momjian wrote:

On Sun, Feb 05, 2012 at 08:40:09PM +, Simon Riggs wrote:

In the proposed scheme there are two flag bits set on the page to
indicate whether the field is used as a checksum rather than a version
number. So its possible the checksum could look like a valid page
version number, but we'd still be able to tell the difference.


Thanks.  Clearly we don't need 16 bits to represent our page version
number because we rarely change it. The other good thing is I don't
remember anyone wanting additional per-page storage in the past few
years except for a checksum.


There's this idea that I'd like to see implemented: 
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php


In any case, I think it's a very bad idea to remove the page version 
field. How are we supposed to ever be able to change the page format 
again if we don't even have a version number on the page? I strongly 
oppose removing it.


I'm also not very comfortable with the idea of having flags on the page 
indicating whether it has a checksum or not. It's not hard to imagine a 
software of firmware bug or hardware failure that would cause pd_flags 
field to be zeroed out altogether. It would be more robust if the 
expected bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to 
deal with that at upgrade somehow. And it still feels a bit whacky anyway.



I wonder if we should just dedicate 3 page header bits, call that the
page version number, and set this new version number to 1, and assume
all previous versions were zero, and have them look in the old page
version location if the new version number is zero.  I am basically
thinking of how we can plan ahead to move the version number to a new
location and have a defined way of finding the page version number using
old and new schemes.


Three bits seems short-sighted, but yeah, something like 6-8 bits should 
be enough. On the whole, though. I think we should bite the bullet and 
invent a way to extend the page header at upgrade.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

2012-02-05 Thread Robert Haas
On Sun, Feb 5, 2012 at 4:17 PM, Dan Scales  wrote:
> Thanks for the detailed followup.  I do see how Postgres is tuned for
> having a bunch of memory available that is not in shared_buffers, both
> for the OS buffer cache and other memory allocations.  However, Postgres
> seems to run fine in many "large shared_memory" configurations that I
> gave performance numbers for, including 5G shared_buffers for an 8G
> machine, 3G shared_buffers for a 6G machine, etc.  There just has to be
> sufficient extra memory beyond the shared_buffers cache.

I agree that you could probably set shared_buffers to 3GB on a 6GB
machine and get decent performance - but would it be the optimal
performance, and for what workload?  To really figure out whether this
patch is a win, you need to get the system optimally tuned for the
unpatched sources (which we can't tell whether you've done, since you
haven't posted the configuration settings or any comparative figures
for different settings, or any details on which commit you tested
against) and then get the system optimally tuned for the patched
sources with double_writes=on, and then see whether there's a gain.

> I think the pgbench run is pointing out a problem that this double_writes
> implementation has with BULK_WRITEs.  As you point out, the
> BufferAccessStrategy for BULK_WRITEs will cause lots of dirty evictions.

Bulk reads will have the same problem.  Consider loading a bunch of
data into a new data with COPY, and then scanning the table.  The
table scan will be a "bulk read" and every page will be dirtied
setting hint bits.  Another thing to worry about is vacuum, which also
uses a BufferAccessStrategy.  Greg Smith has done some previous
benchmarking showing that when the kernel is too aggressive about
flushing dirty data to disk, vacuum becomes painfully slow.  I suspect
this patch is going to have that problem in spades (but it would be
good to test that).  Checkpoints might be a problem, too, since they
flush a lot of dirty data, and that's going to require a lot of extra
fsyncing with this implementation.  It certainly seems that unless you
have a pg_xlog and the data separated and a battery-backed write cache
for each, checkpoints might be really slow.  I'm not entirely
convinced they'll be fast even if you have all that (but it would be
good to test that, too).

> I'm not sure if there is a great solution that always works for that
> issue.  However, I do notice that BULK_WRITE data isn't WAL-logged unless
> archiving/replication is happening.  As I understand it, if the
> BULK_WRITE data isn't being WAL-logged, then it doesn't have to be
> double-written either.  The BULK_WRITE data is not officially synced and
> committed until it is all written, so there doesn't have to be any
> torn-page protection for that data, which is why the WAL logging can be
> omitted.  The double-write implementation can be improved by marking each
> buffer if it doesn't need torn-page protection.  These buffers would be
> those new pages that are explicitly not WAL-logged, even when
> full_page_writes is enabled.  When such a buffer is eventually synced
> (perhaps because of an eviction), it would not be double-written.  This
> would often avoid double-writes for BULK_WRITE, etc., especially since
> the administrator is often not archiving or doing replication when doing
> bulk loads.

I agree - this optimization seems like a must.  I'm not sure that it's
sufficient, but it certainly seems necessary.  It's not going to help
with VACUUM, though, so I think that case needs some careful looking
at to determine how bad the regression is and what can be done to
mitigate it.  In particular, I note that I suggested an idea that
might help in the final paragraph of my last email.

My general feeling about this patch is that it needs a lot more work
before we should consider committing it.  Your tests so far overlook
quite a few important problem cases (bulk loads, SELECT on large
unhinted tables, vacuum speed, checkpoint duration, and others) and
still mostly show it losing to full_page_writes, sometimes by large
margins.  Even in the one case where you got an 8% speedup, it's not
really clear that the same speedup (or an even bigger one) couldn't
have been gotten by some other kind of tuning.  I think you really
need to spend some more time thinking about how to blunt the negative
impact on the cases where it hurts, and increase the benefit in the
cases where it helps.  The approach seems to have potential, but it
seems way to immature to think about shipping it at this point.  (You
may have been thinking along similar lines since I note that the patch
is marked "WIP".)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-05 Thread Bruce Momjian
On Sun, Feb 05, 2012 at 08:40:09PM +, Simon Riggs wrote:
> On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian  wrote:
> > On Sat, Dec 24, 2011 at 03:56:58PM +, Simon Riggs wrote:
> >> > Also, as far as I can see this patch usurps the page version field,
> >> > which I find unacceptably short-sighted.  Do you really think this is
> >> > the last page layout change we'll ever make?
> >>
> >> No, I don't. I hope and expect the next page layout change to
> >> reintroduce such a field.
> >>
> >> But since we're agreed now that upgrading is important, changing page
> >> format isn't likely to be happening until we get an online upgrade
> >> process. So future changes are much less likely. If they do happen, we
> >> have some flag bits spare that can be used to indicate later versions.
> >> It's not the prettiest thing in the world, but it's a small ugliness
> >> in return for an important feature. If there was a way without that, I
> >> would have chosen it.
> >
> > Have you considered the CRC might match a valuid page version number?
> > Is that safe?
> 
> In the proposed scheme there are two flag bits set on the page to
> indicate whether the field is used as a checksum rather than a version
> number. So its possible the checksum could look like a valid page
> version number, but we'd still be able to tell the difference.

Thanks.  Clearly we don't need 16 bits to represent our page version
number because we rarely change it.  The other good thing is I don't
remember anyone wanting additional per-page storage in the past few
years except for a checksum.  

I wonder if we should just dedicate 3 page header bits, call that the
page version number, and set this new version number to 1, and assume
all previous versions were zero, and have them look in the old page
version location if the new version number is zero.  I am basically
thinking of how we can plan ahead to move the version number to a new
location and have a defined way of finding the page version number using
old and new schemes.

I don't want to get to a point where we need a bit per page number
version.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] disable prompting by default in createuser

2012-02-05 Thread Josh Kupershmidt
On Wed, Feb 1, 2012 at 1:13 PM, Peter Eisentraut  wrote:
> On sön, 2012-01-15 at 18:14 -0500, Josh Kupershmidt wrote:
>> I see this patch includes a small change to dropuser, to make the
>> 'username' argument mandatory if --interactive is not set, for
>> symmetry with createuser's new behavior. That's dandy, though IMO we
>> shouldn't have "-i" be shorthand for "--interactive" with dropuser,
>> and something different with createuser (i.e. we should just get rid
>> of the "i" alias for dropuser).
>
> Well, all the other tools also support -i for prompting.

Taking a look at the current ./src/bin/scripts executables, I see only
2 out of 9 (`dropdb` and `dropuser`) which have "-i" mean
"--interactive", and `reindexdb` has another meaning for "-i"
entirely. So I'm not sure there's such a clear precedent for having
"-i" mean "--interactive" within our scripts, at least.

> I'd rather get
> rid of -i for --inherit, but I fear that will break things as well.  I'm
> not sure what to do.

I think breaking backwards compatibility probably won't fly (and
should probably be handled by another patch, anyway). I guess it's OK
to keep the patch's current behavior, given we are already
inconsistent about what "-i" means.

>> i.e. createuser tries taking either $PGUSER or the current username as
>> a default user to create, while dropuser just bails out. Personally, I
>> prefer just bailing out if no create/drop user is specified, but
>> either way I think they should be consistent.
>
> That is intentional long-standing behavior.  createdb/dropdb work the
> same way.

OK.

Josh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-02-05 Thread Shigeru Hanada

2012/2/2 Marko Kreen :

I think you want this instead:

 https://commitfest.postgresql.org/action/patch_view?id=769


Somehow I've missed this cool feature.  Thanks for the suggestion!

--
Shigeru Hanada


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Report: race conditions in WAL replay routines

2012-02-05 Thread Tom Lane
Simon Riggs  writes:
> Please post the patch rather than fixing directly. There's some subtle
> stuff there and it would be best to discuss first.

And here's a proposed patch for not throwing ERROR during replay of DROP
TABLESPACE.  I had originally thought this would be a one-liner
s/ERROR/LOG/, but on inspection destroy_tablespace_directories() really
needs to be changed too, so that it doesn't throw error for unremovable
directories.

regards, tom lane

diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 8ed6d06bb2370a54349f2a86ee13e08a381eacf2..ed0c9ea3ee5a372c6926f41fc0fc93ed7d80d6cd 100644
*** a/src/backend/commands/tablespace.c
--- b/src/backend/commands/tablespace.c
*** create_tablespace_directories(const char
*** 626,634 
  /*
   * destroy_tablespace_directories
   *
!  * Attempt to remove filesystem infrastructure
   *
!  * 'redo' indicates we are redoing a drop from XLOG; okay if nothing there
   *
   * Returns TRUE if successful, FALSE if some subdirectory is not empty
   */
--- 626,638 
  /*
   * destroy_tablespace_directories
   *
!  * Attempt to remove filesystem infrastructure for the tablespace.
   *
!  * 'redo' indicates we are redoing a drop from XLOG; in that case we should
!  * not throw an ERROR for problems, just LOG them.  The worst consequence of
!  * not removing files here would be failure to release some disk space, which
!  * does not justify throwing an error that would require manual intervention
!  * to get the database running again.
   *
   * Returns TRUE if successful, FALSE if some subdirectory is not empty
   */
*** destroy_tablespace_directories(Oid table
*** 681,686 
--- 685,700 
  			pfree(linkloc_with_version_dir);
  			return true;
  		}
+ 		else if (redo)
+ 		{
+ 			/* in redo, just log other types of error */
+ 			ereport(LOG,
+ 	(errcode_for_file_access(),
+ 	 errmsg("could not open directory \"%s\": %m",
+ 			linkloc_with_version_dir)));
+ 			pfree(linkloc_with_version_dir);
+ 			return false;
+ 		}
  		/* else let ReadDir report the error */
  	}
  
*** destroy_tablespace_directories(Oid table
*** 704,710 
  
  		/* remove empty directory */
  		if (rmdir(subfile) < 0)
! 			ereport(ERROR,
  	(errcode_for_file_access(),
  	 errmsg("could not remove directory \"%s\": %m",
  			subfile)));
--- 718,724 
  
  		/* remove empty directory */
  		if (rmdir(subfile) < 0)
! 			ereport(redo ? LOG : ERROR,
  	(errcode_for_file_access(),
  	 errmsg("could not remove directory \"%s\": %m",
  			subfile)));
*** destroy_tablespace_directories(Oid table
*** 716,738 
  
  	/* remove version directory */
  	if (rmdir(linkloc_with_version_dir) < 0)
! 		ereport(ERROR,
  (errcode_for_file_access(),
   errmsg("could not remove directory \"%s\": %m",
  		linkloc_with_version_dir)));
  
  	/*
  	 * Try to remove the symlink.  We must however deal with the possibility
  	 * that it's a directory instead of a symlink --- this could happen during
  	 * WAL replay (see TablespaceCreateDbspace), and it is also the case on
  	 * Windows where junction points lstat() as directories.
  	 */
  	linkloc = pstrdup(linkloc_with_version_dir);
  	get_parent_directory(linkloc);
  	if (lstat(linkloc, &st) == 0 && S_ISDIR(st.st_mode))
  	{
  		if (rmdir(linkloc) < 0)
! 			ereport(ERROR,
  	(errcode_for_file_access(),
  	 errmsg("could not remove directory \"%s\": %m",
  			linkloc)));
--- 730,759 
  
  	/* remove version directory */
  	if (rmdir(linkloc_with_version_dir) < 0)
! 	{
! 		ereport(redo ? LOG : ERROR,
  (errcode_for_file_access(),
   errmsg("could not remove directory \"%s\": %m",
  		linkloc_with_version_dir)));
+ 		pfree(linkloc_with_version_dir);
+ 		return false;
+ 	}
  
  	/*
  	 * Try to remove the symlink.  We must however deal with the possibility
  	 * that it's a directory instead of a symlink --- this could happen during
  	 * WAL replay (see TablespaceCreateDbspace), and it is also the case on
  	 * Windows where junction points lstat() as directories.
+ 	 *
+ 	 * Note: in the redo case, we'll return true if this final step fails;
+ 	 * there's no point in retrying it.
  	 */
  	linkloc = pstrdup(linkloc_with_version_dir);
  	get_parent_directory(linkloc);
  	if (lstat(linkloc, &st) == 0 && S_ISDIR(st.st_mode))
  	{
  		if (rmdir(linkloc) < 0)
! 			ereport(redo ? LOG : ERROR,
  	(errcode_for_file_access(),
  	 errmsg("could not remove directory \"%s\": %m",
  			linkloc)));
*** destroy_tablespace_directories(Oid table
*** 740,746 
  	else
  	{
  		if (unlink(linkloc) < 0)
! 			ereport(ERROR,
  	(errcode_for_file_access(),
  	 errmsg("could not remove symbolic link \"%s\": %m",
  			linkloc)));
--- 761,767 
  	else
  	{
  		if (unlink(linkloc) < 0)
! 			ereport(redo ? LOG : ERROR,
  	(errcode_for_file_access(),
  	 er

Re: [HACKERS] initdb and fsync

2012-02-05 Thread Noah Misch
On Sun, Feb 05, 2012 at 10:53:20AM -0800, Jeff Davis wrote:
> On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote:
> > If we add fsync calls to the initdb process, they should cover the entire 
> > data
> > directory tree.  This patch syncs files that initdb.c writes, but we ought 
> > to
> > also sync files that bootstrap-mode backends had written.
> 
> It doesn't make sense for initdb to take responsibility to sync files
> created by the backend. If there are important files that the backend
> creates, it should be the backend's responsibility to fsync them (and
> their parent directory, if needed). And if they are unimportant to the
> backend, then there is no reason for initdb to fsync them.

I meant primarily to illustrate the need to be comprehensive, not comment on
which executable should fsync a particular file.  Bootstrap-mode backends do
not sync anything during an initdb run on my system.  With your patch, we'll
fsync a small handful of files and leave nearly everything else vulnerable.

That being said, having each backend fsync its own writes will mean syncing
certain files several times within a single initdb run.  If the penalty from
that proves high enough, we may do well to instead have initdb.c sync
everything just once.

> > initdb should do these syncs by default and offer an option to disable them.
> 
> For test frameworks that run initdb often, that makes sense.
> 
> But for developers, it doesn't make sense to spend 0.5s typing an option
> that saves you 0.3s. So, we'd need some more convenient way to choose
> the no-fsync option, like an environment variable that developers can
> set. Or maybe developers don't care about 0.3s?

Developers have shell aliases/functions/scripts and command history.  I
wouldn't object to having an environment variable control it, but I would not
personally find that more convenient than a command-line switch.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Report: race conditions in WAL replay routines

2012-02-05 Thread Tom Lane
Simon Riggs  writes:
> Please post the patch rather than fixing directly. There's some subtle
> stuff there and it would be best to discuss first.

Here's a proposed patch for the issues around unlocked updates of
shared-memory state.  After going through this I believe that there is
little risk of any real problems in the current state of the code; this
is more in the nature of future-proofing against foreseeable changes.
(One such is that we'd discussed fixing the age() function to work
during Hot Standby.)  So I suggest applying this to HEAD but not
back-patching.

Except for one thing.  I realized while looking at the NEXTOID replay
code that it is completely broken: it only advances
ShmemVariableCache->nextOid when that's less than the value in the WAL
record.  So that comparison fails if the OID counter wraps around during
replay.  I've fixed this in the attached patch by just forcibly
assigning the new value instead of trying to be smart, and I think
probably that aspect of it needs to be back-patched.

regards, tom lane

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 454ca310f339db34a4cc352899fea1d663abf93b..0f4cea124d7436380c730203e6cfd1518bc5d3b2 100644
*** a/src/backend/access/transam/multixact.c
--- b/src/backend/access/transam/multixact.c
*** CheckPointMultiXact(void)
*** 1655,1662 
   * Set the next-to-be-assigned MultiXactId and offset
   *
   * This is used when we can determine the correct next ID/offset exactly
!  * from a checkpoint record.  We need no locking since it is only called
!  * during bootstrap and XLog replay.
   */
  void
  MultiXactSetNextMXact(MultiXactId nextMulti,
--- 1655,1663 
   * Set the next-to-be-assigned MultiXactId and offset
   *
   * This is used when we can determine the correct next ID/offset exactly
!  * from a checkpoint record.  Although this is only called during bootstrap
!  * and XLog replay, we take the lock in case any hot-standby backends are
!  * examining the values.
   */
  void
  MultiXactSetNextMXact(MultiXactId nextMulti,
*** MultiXactSetNextMXact(MultiXactId nextMu
*** 1664,1671 
--- 1665,1674 
  {
  	debug_elog4(DEBUG2, "MultiXact: setting next multi to %u offset %u",
  nextMulti, nextMultiOffset);
+ 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
  	MultiXactState->nextMXact = nextMulti;
  	MultiXactState->nextOffset = nextMultiOffset;
+ 	LWLockRelease(MultiXactGenLock);
  }
  
  /*
*** MultiXactSetNextMXact(MultiXactId nextMu
*** 1674,1685 
   *
   * This is used when we can determine minimum safe values from an XLog
   * record (either an on-line checkpoint or an mxact creation log entry).
!  * We need no locking since it is only called during XLog replay.
   */
  void
  MultiXactAdvanceNextMXact(MultiXactId minMulti,
  		  MultiXactOffset minMultiOffset)
  {
  	if (MultiXactIdPrecedes(MultiXactState->nextMXact, minMulti))
  	{
  		debug_elog3(DEBUG2, "MultiXact: setting next multi to %u", minMulti);
--- 1677,1690 
   *
   * This is used when we can determine minimum safe values from an XLog
   * record (either an on-line checkpoint or an mxact creation log entry).
!  * Although this is only called during XLog replay, we take the lock in case
!  * any hot-standby backends are examining the values.
   */
  void
  MultiXactAdvanceNextMXact(MultiXactId minMulti,
  		  MultiXactOffset minMultiOffset)
  {
+ 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
  	if (MultiXactIdPrecedes(MultiXactState->nextMXact, minMulti))
  	{
  		debug_elog3(DEBUG2, "MultiXact: setting next multi to %u", minMulti);
*** MultiXactAdvanceNextMXact(MultiXactId mi
*** 1691,1696 
--- 1696,1702 
  	minMultiOffset);
  		MultiXactState->nextOffset = minMultiOffset;
  	}
+ 	LWLockRelease(MultiXactGenLock);
  }
  
  /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6e84cd0a21671486693e7f94d5fda8efdedf4bb4..3d08e92d3a747cc9426206dc0623ff390f18b09c 100644
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
*** PrescanPreparedTransactions(TransactionI
*** 1715,1720 
--- 1715,1724 
  			/*
  			 * Examine subtransaction XIDs ... they should all follow main
  			 * XID, and they may force us to advance nextXid.
+ 			 *
+ 			 * We don't expect anyone else to modify nextXid, hence we don't
+ 			 * need to hold a lock while examining it.  We still acquire the
+ 			 * lock to modify it, though.
  			 */
  			subxids = (TransactionId *)
  (buf + MAXALIGN(sizeof(TwoPhaseFileHeader)));
*** PrescanPreparedTransactions(TransactionI
*** 1726,1733 
--- 1730,1739 
  if (TransactionIdFollowsOrEquals(subxid,
   ShmemVariableCache->nextXid))
  {
+ 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
  	ShmemVariableCache->nextXid = subxid;
  	TransactionIdAdvance(ShmemVariableC

Re: [HACKERS] Report: race conditions in WAL replay routines

2012-02-05 Thread Simon Riggs
On Sun, Feb 5, 2012 at 9:03 PM, Tom Lane  wrote:
>>> * Not exactly a race condition, but: tblspc_redo does ereport(ERROR)
>>> if it fails to clean out tablespace directories.  This seems to me to be
>>> the height of folly, especially when the failure is more or less an
>>> expected case.  If the error occurs the database is dead in the water,
>>> because that error is actually a PANIC and will recur on subsequent
>>> restart attempts.  Therefore there is no way to recover short of manual
>>> intervention to clean out the non-empty directory.  And why are we
>>> pulling the fire alarm like this?  Well, uh, it's because we might fail
>>> to recover some disk space in the dropped tablespace.  Seems to me to be
>>> a lot better to just elog(LOG) and move on.  This is quite analogous to
>>> the case of failing to unlink a file after commit --- wasting disk space
>>> might be bad, but it's very much the lesser evil compared to this.
>
>> If the sysadmin is managing the db properly then this shouldn't ever
>> happen - the only cause is if the tablespace being dropped is being
>> used as a temp tablespace on the standby.
>
> Right, but that is an expected/foreseeable situation.  It should not
> lead to a dead-and-unrestartable database.
>
>> If you just LOG, when exactly would we get rid of the tablespace?
>
> The tablespace *is* gone, or at least its catalog entries are.  All we
> are trying to do here is release some underlying disk space.  It's
> exactly analogous to the case where we drop a table and then find (post
> commit) that unlinking the disk file fails for some weird reason.
> We've done what we can to clean the disk space and should just let it
> go --- there is no risk to database integrity in leaving some files
> behind, so killing the server is a huge overreaction.

I agree the tablespace entries are gone, but that won't stop existing
users from continuing.

If we're not sure of the reason why tablespace removal fails it
doesn't seem safe to continue to me.

But since this is a rare corner case, and we already try to remove
users, then LOG seems OK.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

2012-02-05 Thread Dan Scales
Thanks for the detailed followup.  I do see how Postgres is tuned for
having a bunch of memory available that is not in shared_buffers, both
for the OS buffer cache and other memory allocations.  However, Postgres
seems to run fine in many "large shared_memory" configurations that I
gave performance numbers for, including 5G shared_buffers for an 8G
machine, 3G shared_buffers for a 6G machine, etc.  There just has to be
sufficient extra memory beyond the shared_buffers cache.

I think the pgbench run is pointing out a problem that this double_writes
implementation has with BULK_WRITEs.  As you point out, the
BufferAccessStrategy for BULK_WRITEs will cause lots of dirty evictions.
I'm not sure if there is a great solution that always works for that
issue.  However, I do notice that BULK_WRITE data isn't WAL-logged unless
archiving/replication is happening.  As I understand it, if the
BULK_WRITE data isn't being WAL-logged, then it doesn't have to be
double-written either.  The BULK_WRITE data is not officially synced and
committed until it is all written, so there doesn't have to be any
torn-page protection for that data, which is why the WAL logging can be
omitted.  The double-write implementation can be improved by marking each
buffer if it doesn't need torn-page protection.  These buffers would be
those new pages that are explicitly not WAL-logged, even when
full_page_writes is enabled.  When such a buffer is eventually synced
(perhaps because of an eviction), it would not be double-written.  This
would often avoid double-writes for BULK_WRITE, etc., especially since
the administrator is often not archiving or doing replication when doing
bulk loads.

However, overall, I think the idea is that double writes are an optional
optimization.  The user would only turn it on in existing configurations
where it helps or only slightly hurts performance, and where greatly
reducing the size of the WAL logs is beneficial.  It might also be
especially beneficial when there is a small amount of FLASH or other
kind of fast storage that the double-write files can be stored on.

Thanks,

Dan


- Original Message -
From: "Robert Haas" 
To: "Dan Scales" 
Cc: "PG Hackers" 
Sent: Friday, February 3, 2012 1:48:54 PM
Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

On Fri, Feb 3, 2012 at 3:14 PM, Dan Scales  wrote:
> Thanks for the feedback!  I think you make a good point about the small size 
> of dirty data in the OS cache.  I think what you can say about this 
> double-write patch is that it will work not work well for configurations that 
> have a small Postgres cache and a large OS cache, since every write from the 
> Postgres cache requires double-writes and an fsync.

The general guidance for setting shared_buffers these days is 25% of
RAM up to a maximum of 8GB, so the configuration that you're
describing as not optimal for this patch is the one normally used when
running PostgreSQL.  I've run across several cases where larger values
of shared_buffers are a huge win, because the entire working set can
then be accommodated in shared_buffers.  But it's certainly not the
case that all working sets fit.

And in this case, I think that's beside the point anyway.  I had
shared_buffers set to 8GB on a machine with much more memory than
that, but the database created by pgbench -i -s 10 is about 156 MB, so
the problem isn't that there is too little PostgreSQL cache available.
 The entire database fits in shared_buffers, with most of it left
over.  However, because of the BufferAccessStrategy stuff, pages start
to get forced out to the OS pretty quickly.  Of course, we could
disable the BufferAccessStrategy stuff when double_writes is in use,
but bear in mind that the reason we have it in the first place is to
prevent cache trashing effects.  It would be imprudent of us to throw
that out the window without replacing it with something else that
would provide similar protection.  And even if we did, that would just
delay the day of reckoning.  You'd be able to blast through and dirty
the entirety of shared_buffers at top speed, but then as soon as you
started replacing pages performance would slow to an utter crawl, just
as it did here, only you'd need a bigger scale factor to trigger the
problem.

The more general point here is that there are MANY aspects of
PostgreSQL's design that assume that shared_buffers accounts for a
relatively small percentage of system memory.  Here's another one: we
assume that backends that need temporary memory for sorts and hashes
(i.e. work_mem) can just allocate it from the OS.  If we were to start
recommending setting shared_buffers to large percentages of the
available memory, we'd probably have to rethink that.  Most likely,
we'd need some kind of in-core mechanism for allocating temporary
memory from the shared memory segment.  And here's yet another one: we
assume that it is better to recycle old WAL files and overwrite the
contents rather than create n

Re: [HACKERS] Report: race conditions in WAL replay routines

2012-02-05 Thread Tom Lane
Simon Riggs  writes:
> On Sun, Feb 5, 2012 at 7:18 PM, Tom Lane  wrote:
>> * seq_redo() has the same disease, since we allow SELECT * FROM
>> sequences.

> Why do we do that?

It's the only existing way to obtain the parameters of a sequence.
Even if we felt like inventing a different API for doing that, it'd take
years to change every client that knows about this.

>> * Not exactly a race condition, but: tblspc_redo does ereport(ERROR)
>> if it fails to clean out tablespace directories.  This seems to me to be
>> the height of folly, especially when the failure is more or less an
>> expected case.  If the error occurs the database is dead in the water,
>> because that error is actually a PANIC and will recur on subsequent
>> restart attempts.  Therefore there is no way to recover short of manual
>> intervention to clean out the non-empty directory.  And why are we
>> pulling the fire alarm like this?  Well, uh, it's because we might fail
>> to recover some disk space in the dropped tablespace.  Seems to me to be
>> a lot better to just elog(LOG) and move on.  This is quite analogous to
>> the case of failing to unlink a file after commit --- wasting disk space
>> might be bad, but it's very much the lesser evil compared to this.

> If the sysadmin is managing the db properly then this shouldn't ever
> happen - the only cause is if the tablespace being dropped is being
> used as a temp tablespace on the standby.

Right, but that is an expected/foreseeable situation.  It should not
lead to a dead-and-unrestartable database.

> If you just LOG, when exactly would we get rid of the tablespace?

The tablespace *is* gone, or at least its catalog entries are.  All we
are trying to do here is release some underlying disk space.  It's
exactly analogous to the case where we drop a table and then find (post
commit) that unlinking the disk file fails for some weird reason.
We've done what we can to clean the disk space and should just let it
go --- there is no risk to database integrity in leaving some files
behind, so killing the server is a huge overreaction.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Report: race conditions in WAL replay routines

2012-02-05 Thread Simon Riggs
On Sun, Feb 5, 2012 at 7:18 PM, Tom Lane  wrote:
> Pursuant to the recent discussion about bugs 6200 and 6245, I went
> trawling through all the WAL redo routines looking for bugs such as
> inadequate locking or transiently trashing shared buffers.  Here's
> what I found:
>
> * As we already knew, RestoreBkpBlocks is broken because it transiently
> trashes a shared buffer, which another process could be accessing while
> holding only a pin.

Agreed

> * seq_redo() has the same disease, since we allow SELECT * FROM
> sequences.

Why do we do that?

> * Everything else seems to be free of that specific issue; in particular
> the index-related replay routines are at fairly low risk since we don't
> have any coding rules allowing index pages to be examined without
> holding a buffer lock.

Yep

> * There are assorted replay routines that assume they can whack fields
> of ShmemVariableCache around without any lock.  However, it's pretty
> inconsistent; about half do it like that, while the other half assume
> that they can read ShmemVariableCache without lock but should acquire
> lock to modify it.  I think the latter coding rule is a whole lot safer
> in the presence of Hot Standby and should be adopted throughout.

Agreed

> * Same goes for MultiXactSetNextMXact and MultiXactAdvanceNextMXact.
> It's not entirely clear to me that no read-only transaction can ever
> examine the shared-memory variables they change.  In any case, if there
> is in fact no other process examining those variables, there can be no
> contention for the lock so it should be cheap to get.

Row locking requires a WAL record to be written, so that whole path is
dead during HS.

> * Not exactly a race condition, but: tblspc_redo does ereport(ERROR)
> if it fails to clean out tablespace directories.  This seems to me to be
> the height of folly, especially when the failure is more or less an
> expected case.  If the error occurs the database is dead in the water,
> because that error is actually a PANIC and will recur on subsequent
> restart attempts.  Therefore there is no way to recover short of manual
> intervention to clean out the non-empty directory.  And why are we
> pulling the fire alarm like this?  Well, uh, it's because we might fail
> to recover some disk space in the dropped tablespace.  Seems to me to be
> a lot better to just elog(LOG) and move on.  This is quite analogous to
> the case of failing to unlink a file after commit --- wasting disk space
> might be bad, but it's very much the lesser evil compared to this.

If the sysadmin is managing the db properly then this shouldn't ever
happen - the only cause is if the tablespace being dropped is being
used as a temp tablespace on the standby.

The ERROR is appropriate because we first try to remove the files. If
they won't go we then raise an all-session conflict and then try
again. Only when we fail the second time does it ERROR, which seems
OK.

If you just LOG, when exactly would we get rid of the tablespace?

> Barring objections I'm going to fix all this stuff and back-patch as
> far as 9.0 where hot standby was added.

Please post the patch rather than fixing directly. There's some subtle
stuff there and it would be best to discuss first.

Thanks

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #6425: Bus error in slot_deform_tuple

2012-02-05 Thread Simon Riggs
On Sat, Feb 4, 2012 at 6:49 PM, Tom Lane  wrote:
> Simon Riggs  writes:
>> The cause here is data changing underneath the user. Your patch solves
>> the most obvious error, but it still allows other problems if applying
>> the backup block changes data. If the backup block doesn't do anything
>> at all then we don't need to apply it either.
>
> This is nonsense.  What applying the backup block does is to apply the
> change that the WAL record would otherwise have applied, except we
> decided to make it store a full-page image instead.

Yep, you're right, my bad.

Got a head cold, so will lay off a few days from too much thinking.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-05 Thread Simon Riggs
On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian  wrote:
> On Sat, Dec 24, 2011 at 03:56:58PM +, Simon Riggs wrote:
>> > Also, as far as I can see this patch usurps the page version field,
>> > which I find unacceptably short-sighted.  Do you really think this is
>> > the last page layout change we'll ever make?
>>
>> No, I don't. I hope and expect the next page layout change to
>> reintroduce such a field.
>>
>> But since we're agreed now that upgrading is important, changing page
>> format isn't likely to be happening until we get an online upgrade
>> process. So future changes are much less likely. If they do happen, we
>> have some flag bits spare that can be used to indicate later versions.
>> It's not the prettiest thing in the world, but it's a small ugliness
>> in return for an important feature. If there was a way without that, I
>> would have chosen it.
>
> Have you considered the CRC might match a valuid page version number?
> Is that safe?

In the proposed scheme there are two flag bits set on the page to
indicate whether the field is used as a checksum rather than a version
number. So its possible the checksum could look like a valid page
version number, but we'd still be able to tell the difference.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-02-05 Thread Kohei KaiGai
2012年2月1日12:15 Shigeru Hanada :
> (2012/01/30 4:39), Kohei KaiGai wrote:
>> I checked the "fdw_helper_funcs_v3.patch", "pgsql_fdw_v5.patch" and
>> "pgsql_fdw_pushdown_v1.patch". My comments are below.
>
> Thanks for the review!
>
>> [BUG]
>> Even though pgsql_fdw tries to push-down qualifiers being executable
>> on the remove side at the deparseSql(), it does not remove qualifiers
>> being pushed down from the baserel->baserestrictinfo, thus, these
>> qualifiers are eventually executed twice.
>>
>> See the result of this EXPLAIN.
>>postgres=# EXPLAIN SELECT * FROM ft1 WHERE a>  2 AND f_leak(b);
>>  QUERY PLAN
>>
>> --
>> Foreign Scan on ft1  (cost=107.43..122.55 rows=410 width=36)
>>   Filter: (f_leak(b) AND (a>  2))
>>   Remote SQL: DECLARE pgsql_fdw_cursor_0 SCROLL CURSOR FOR SELECT
>> a, b FROM public.t1 WHERE (a>  2)
>>(3 rows)
>>
>> My expectation is (a>  2) being executed on the remote-side and f_leak(b)
>> being executed on the local-side. But, filter of foreign-scan on ft1 has both
>> of qualifiers. It has to be removed, if a RestrictInfo get pushed-down.
>
> It's intentional that pgsql_fdw keeps pushed-down qualifier in
> baserestrictinfo, because I saw some undesirable behavior when I
> implemented so that with such optimization when plan is reused, but it's
> not clear to me now.  I'll try to recall what I saw...
>
> BTW, I think evaluating pushed-down qualifiers again on local side is
> safe and has no semantic problem, though we must pay little for such
> overhead.  Do you have concern about performance?
>
Yes. In my opinion, one significant benefit of pgsql_fdw is to execute
qualifiers on the distributed nodes; that enables to utilize multiple
CPU resources efficiently.
Duplicate checks are reliable way to keep invisible tuples being filtered
out, indeed. But it shall degrade one competitive characteristics of the
pgsql_fdw.

https://github.com/kaigai/pg_strom/blob/master/plan.c#L693
In my module, qualifiers being executable on device side are detached
from the baserel->baserestrictinfo, and remaining qualifiers are chained
to the list.
The is_device_executable_qual() is equivalent to is_foreign_expr() in
the pgsql_fdw module.

Of course, it is your decision, and I might miss something.

BTW, what is the undesirable behavior on your previous implementation?

>> [Design comment]
>> I'm not sure the reason why store_result() uses MessageContext to save
>> the Tuplestorestate within PgsqlFdwExecutionState.
>> The source code comment says it is used to avoid memory leaks in error
>> cases. I also have a similar experience on implementation of my fdw module,
>> so, I could understand per-scan context is already cleared at the timing of
>> resource-release-callback, thus, handlers to external resource have to be
>> saved on separated memory context.
>> In my personal opinion, the right design is to declare a memory context for
>> pgsql_fdw itself, instead of the abuse of existing memory context.
>> (More wise design is to define sub-memory-context for each foreign-scan,
>> then, remove the sub-memory-context after release handlers.)
>
> I simply chose built-in context which has enough lifespan, but now I
> think that using MessageContext directly is not recommended way.  As you
> say, creating new context as child of MessageContext for each scan in
> BeginForeignScan (or first IterateForeignScan) would be better.  Please
> see attached patch.
>
> One other option is getting rid of tuplestore by holding result rows as
> PGresult, and track it for error cases which might happen.
> ResourceOwner callback can be used to release PGresult on error, similar
> to PGconn.
>
If we could have set of results on per-query memory context (thus,
no need to explicit release on error timing), it is more ideal design.
It it possible to implement based on the libpq APIs?

Please note that per-query memory context is already released on
ResourceOwner callback is launched, so, it is unavailable to implement
if libpq requires to release some resource.

>> [Design comment]
>> When "BEGIN" should be issued on the remote-side?
>> The connect_pg_server() is an only chance to issue "BEGIN" command
>> at the remote-side on connection being opened. However, the connection
>> shall be kept unless an error is not raised. Thus, the remote-side will
>> continue to work within a single transaction block, even if local-side 
>> iterates
>> a pair of "BEGIN" and "COMMIT".
>> I'd like to suggest to close the transaction block at the timing of either
>> end of the scan, transaction or sub-transaction.
>
> Indeed, remote transactions should be terminated at some timing.
> Terminating at the end of a scan seems troublesome because a connection
> might be shared by multiple scans in a query.  I'd prefer aborting
> remote transaction at the end of local query.  Please se

Re: [HACKERS] JSON output functions.

2012-02-05 Thread Andrew Dunstan



On 02/05/2012 02:31 PM, Stefan Keller wrote:

Hi Andrew

Nice work!

Just for completeness: Did you also think of including geometry types
in JSON output functions in later releases? There's a nice extension
of JSON called GeoJSON for a starting point.



[side note: please don't top-reply on -hackers. See 
]


Currently, in array_to_json and row_to_json the only special cases are:

 * record types are output as JSON records
 * array types are output as JSON arrays
 * numeric types are output without quoting
 * boolean types are output as unquoted true or false
 * NULLs are output as NULL


Everything else is output as its text representation, suitably quoted 
and escaped.


If you want to change how those operate, now rather than later would be 
the best time. But later we could certainly add various other 
foo_to_json functions for things like geometry types and hstores.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-02-05 Thread Tom Lane
Jeff Davis  writes:
> On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote:
>> If we add fsync calls to the initdb process, they should cover the entire 
>> data
>> directory tree.  This patch syncs files that initdb.c writes, but we ought to
>> also sync files that bootstrap-mode backends had written.

> It doesn't make sense for initdb to take responsibility to sync files
> created by the backend.

No, but the more interesting question is whether bootstrap mode troubles
to fsync its writes.  I'm not too sure about that either way ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON output functions.

2012-02-05 Thread Stefan Keller
Hi Andrew

Nice work!

Just for completeness: Did you also think of including geometry types
in JSON output functions in later releases? There's a nice extension
of JSON called GeoJSON for a starting point.

Yours, Stefan


2012/2/3 Andrew Dunstan :
>
>
> On 02/02/2012 12:20 PM, Pavel Stehule wrote:
>>
>> 2012/2/2 Andrew Dunstan:
>>>
>>>
>>> On 02/02/2012 04:35 AM, Abhijit Menon-Sen wrote:

 At 2012-02-01 18:48:28 -0500, andrew.duns...@pgexperts.com wrote:
>
> For now I'm inclined not to proceed with that, and leave it as an
> optimization to be considered later if necessary. Thoughts?

 I agree, there doesn't seem to be a pressing need to do it now.

>>>
>>> OK, here's my final version of the patch for constructor functions. If
>>> there's no further comment I'll go with this.
>>
>> These function are super, Thank you
>>
>> Do you plan to fix a issue with row attribute names in 9.2?
>
>
>
>
> Yeah. Tom did some initial work which he published here:
> ,
> noting:
>
>   It's not really ideal with respect to
>   the ValuesScan case, because what you get seems to always be the
>   hard-wired "columnN" names for VALUES columns, even if you try to
>   override that with an alias
>   ...
>   Curiously, it works just fine if the VALUES can be folded
>
> and later he said:
>
>   Upon further review, this patch would need some more work even for the
>   RowExpr case, because there are several places that build RowExprs
>   without bothering to build a valid colnames list.  It's clearly soluble
>   if anyone cares to put in the work, but I'm not personally excited
>   enough to pursue it ..
>
> I'm going to look at that issue first, since the unfolded VALUES clause
> seems like something of an obscure corner case. Feel free to chime in if you
> can.
>
>
> cheers
>
>
> andrew
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Report: race conditions in WAL replay routines

2012-02-05 Thread Tom Lane
Pursuant to the recent discussion about bugs 6200 and 6245, I went
trawling through all the WAL redo routines looking for bugs such as
inadequate locking or transiently trashing shared buffers.  Here's
what I found:

* As we already knew, RestoreBkpBlocks is broken because it transiently
trashes a shared buffer, which another process could be accessing while
holding only a pin.

* seq_redo() has the same disease, since we allow SELECT * FROM
sequences.

* Everything else seems to be free of that specific issue; in particular
the index-related replay routines are at fairly low risk since we don't
have any coding rules allowing index pages to be examined without
holding a buffer lock.

* There are assorted replay routines that assume they can whack fields
of ShmemVariableCache around without any lock.  However, it's pretty
inconsistent; about half do it like that, while the other half assume
that they can read ShmemVariableCache without lock but should acquire
lock to modify it.  I think the latter coding rule is a whole lot safer
in the presence of Hot Standby and should be adopted throughout.

* Same goes for MultiXactSetNextMXact and MultiXactAdvanceNextMXact.
It's not entirely clear to me that no read-only transaction can ever
examine the shared-memory variables they change.  In any case, if there
is in fact no other process examining those variables, there can be no
contention for the lock so it should be cheap to get.

* Not exactly a race condition, but: tblspc_redo does ereport(ERROR)
if it fails to clean out tablespace directories.  This seems to me to be
the height of folly, especially when the failure is more or less an
expected case.  If the error occurs the database is dead in the water,
because that error is actually a PANIC and will recur on subsequent
restart attempts.  Therefore there is no way to recover short of manual
intervention to clean out the non-empty directory.  And why are we
pulling the fire alarm like this?  Well, uh, it's because we might fail
to recover some disk space in the dropped tablespace.  Seems to me to be
a lot better to just elog(LOG) and move on.  This is quite analogous to
the case of failing to unlink a file after commit --- wasting disk space
might be bad, but it's very much the lesser evil compared to this.

Barring objections I'm going to fix all this stuff and back-patch as
far as 9.0 where hot standby was added.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] plpgsql leaking memory when stringifying datums

2012-02-05 Thread Jan Urbański
While chasing a PL/Python memory leak, I did a few tests with PL/pgSQL
and I think there are places where memory is not freed sufficiently early.

Attached are two functions that when run will make the backend's memory
consumption increase until they finish. With both, the cause is
convert_value_to_string that calls a datum's output function, which for
toasted data results in an allocation.

The proposed patch changes convert_value_to_string to call the output
function in the per-tuple memory context and then copy the result string
back to the original context.

The comment in that function says that callers generally pfree its
result, but that wasn't the case with exec_stmt_raise, so I added a
pfree() there as well.

With that I was still left with a leak in the typecast() test function
and it turns out that sticking a exec_eval_cleanup into exec_move_row
fixed it. The regression tests pass, but I'm not 100% sure if it's
actually safe.

Since convert_value_to_string needed to access the PL/pgSQL's execution
state to get its hands on the per-tuple context, I needed to pass it to
every caller that didn't have it already, which means exec_cast_value
and exec_simple_cast_value. Anyone has a better idea?

The initial diagnosis and proposed solution are by Andres Freund - thanks!

Cheers,
Jan
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index bf952b6..a691905 100644
*** a/src/pl/plpgsql/src/pl_exec.c
--- b/src/pl/plpgsql/src/pl_exec.c
*** static void exec_move_row(PLpgSQL_execst
*** 188,201 
  static HeapTuple make_tuple_from_row(PLpgSQL_execstate *estate,
  	PLpgSQL_row *row,
  	TupleDesc tupdesc);
! static char *convert_value_to_string(Datum value, Oid valtype);
! static Datum exec_cast_value(Datum value, Oid valtype,
  Oid reqtype,
  FmgrInfo *reqinput,
  Oid reqtypioparam,
  int32 reqtypmod,
  bool isnull);
! static Datum exec_simple_cast_value(Datum value, Oid valtype,
  	   Oid reqtype, int32 reqtypmod,
  	   bool isnull);
  static void exec_init_tuple_store(PLpgSQL_execstate *estate);
--- 188,204 
  static HeapTuple make_tuple_from_row(PLpgSQL_execstate *estate,
  	PLpgSQL_row *row,
  	TupleDesc tupdesc);
! static char *convert_value_to_string(PLpgSQL_execstate *estate,
! 	Datum value, Oid valtype);
! static Datum exec_cast_value(PLpgSQL_execstate *estate,
! Datum value, Oid valtype,
  Oid reqtype,
  FmgrInfo *reqinput,
  Oid reqtypioparam,
  int32 reqtypmod,
  bool isnull);
! static Datum exec_simple_cast_value(PLpgSQL_execstate *estate,
! 	   Datum value, Oid valtype,
  	   Oid reqtype, int32 reqtypmod,
  	   bool isnull);
  static void exec_init_tuple_store(PLpgSQL_execstate *estate);
*** plpgsql_exec_function(PLpgSQL_function *
*** 430,436 
  		else
  		{
  			/* Cast value to proper type */
! 			estate.retval = exec_cast_value(estate.retval, estate.rettype,
  			func->fn_rettype,
  			&(func->fn_retinput),
  			func->fn_rettypioparam,
--- 433,440 
  		else
  		{
  			/* Cast value to proper type */
! 			estate.retval = exec_cast_value(&estate, estate.retval,
! 			estate.rettype,
  			func->fn_rettype,
  			&(func->fn_retinput),
  			func->fn_rettypioparam,
*** exec_stmt_fori(PLpgSQL_execstate *estate
*** 1757,1763 
  	 * Get the value of the lower bound
  	 */
  	value = exec_eval_expr(estate, stmt->lower, &isnull, &valtype);
! 	value = exec_cast_value(value, valtype, var->datatype->typoid,
  			&(var->datatype->typinput),
  			var->datatype->typioparam,
  			var->datatype->atttypmod, isnull);
--- 1761,1767 
  	 * Get the value of the lower bound
  	 */
  	value = exec_eval_expr(estate, stmt->lower, &isnull, &valtype);
! 	value = exec_cast_value(estate, value, valtype, var->datatype->typoid,
  			&(var->datatype->typinput),
  			var->datatype->typioparam,
  			var->datatype->atttypmod, isnull);
*** exec_stmt_fori(PLpgSQL_execstate *estate
*** 1772,1778 
  	 * Get the value of the upper bound
  	 */
  	value = exec_eval_expr(estate, stmt->upper, &isnull, &valtype);
! 	value = exec_cast_value(value, valtype, var->datatype->typoid,
  			&(var->datatype->typinput),
  			var->datatype->typioparam,
  			var->datatype->atttypmod, isnull);
--- 1776,1782 
  	 * Get the value of the upper bound
  	 */
  	value = exec_eval_expr(estate, stmt->upper, &isnull, &valtype);
! 	value = exec_cast_value(estate, value, valtype, var->datatype->typoid,
  			&(var->datatype->typinput),
  			var->datatype->typioparam,
  			var->datatype->atttypmod, isnull);
*** exec_stmt_fori(PLpgSQL_execstate *estate
*** 1789,1795 
  	if (stmt->step)
  	{
  		value = exec_eval_expr(estate, stmt->step, &isnull, &valtype);
! 		value = exec_cast_value(value, valtype, var->datatype->typoid,
  &(var->datatype->typinput),
  			

[HACKERS] pl/python long-lived allocations in datum->dict transformation

2012-02-05 Thread Jan Urbański
Consider this:

create table arrays as select array[random(), random(), random(),
random(), random(), random()] as a from generate_series(1, 100);

create or replace function plpython_outputfunc() returns void as $$
c = plpy.cursor('select a from arrays')
for row in c:
pass
$$ language plpythonu;

When running the function, every datum will get transformed into a
Python dict, which includes calling the type's output function,
resulting in a memory allocation. The memory is allocated in the SPI
context, so it accumulates until the function is finished.

This is annoying for functions that plough through large tables, doing
some calculation. Attached is a patch that does the conversion of
PostgreSQL Datums into Python dict objects in a scratch memory context
that gets reset every time.

Cheers,
Jan
diff --git a/src/pl/plpython/plpy_main.c b/src/pl/plpython/plpy_main.c
index ae9d87e..79d7784 100644
*** a/src/pl/plpython/plpy_main.c
--- b/src/pl/plpython/plpy_main.c
***
*** 12,17 
--- 12,18 
  #include "executor/spi.h"
  #include "miscadmin.h"
  #include "utils/guc.h"
+ #include "utils/memutils.h"
  #include "utils/syscache.h"
  
  #include "plpython.h"
*** plpython_inline_handler(PG_FUNCTION_ARGS
*** 268,273 
--- 269,279 
  
  	MemSet(&proc, 0, sizeof(PLyProcedure));
  	proc.pyname = PLy_strdup("__plpython_inline_block");
+ 	proc.tmp_ctx = AllocSetContextCreate(TopMemoryContext,
+ 		 "PL/Python temporary ctx",
+ 		 ALLOCSET_DEFAULT_MINSIZE,
+ 		 ALLOCSET_DEFAULT_INITSIZE,
+ 		 ALLOCSET_DEFAULT_MAXSIZE);
  	proc.result.out.d.typoid = VOIDOID;
  
  	PG_TRY();
diff --git a/src/pl/plpython/plpy_procedure.c b/src/pl/plpython/plpy_procedure.c
index 229966a..f539cec 100644
*** a/src/pl/plpython/plpy_procedure.c
--- b/src/pl/plpython/plpy_procedure.c
***
*** 12,17 
--- 12,18 
  #include "catalog/pg_type.h"
  #include "utils/builtins.h"
  #include "utils/hsearch.h"
+ #include "utils/memutils.h"
  #include "utils/syscache.h"
  
  #include "plpython.h"
*** PLy_procedure_create(HeapTuple procTup,
*** 169,174 
--- 170,180 
  	proc->setof = NULL;
  	proc->src = NULL;
  	proc->argnames = NULL;
+ 	proc->tmp_ctx = AllocSetContextCreate(TopMemoryContext,
+ 		  "PL/Python temporary ctx",
+ 		  ALLOCSET_DEFAULT_MINSIZE,
+ 		  ALLOCSET_DEFAULT_INITSIZE,
+ 		  ALLOCSET_DEFAULT_MAXSIZE);
  
  	PG_TRY();
  	{
*** PLy_procedure_delete(PLyProcedure *proc)
*** 411,416 
--- 417,424 
  		PLy_free(proc->src);
  	if (proc->argnames)
  		PLy_free(proc->argnames);
+ 	if (proc->tmp_ctx)
+ 		MemoryContextDelete(proc->tmp_ctx);
  }
  
  /*
diff --git a/src/pl/plpython/plpy_procedure.h b/src/pl/plpython/plpy_procedure.h
index e986c7e..5ee73fa 100644
*** a/src/pl/plpython/plpy_procedure.h
--- b/src/pl/plpython/plpy_procedure.h
*** typedef struct PLyProcedure
*** 30,35 
--- 30,36 
  	PyObject   *code;			/* compiled procedure code */
  	PyObject   *statics;		/* data saved across calls, local scope */
  	PyObject   *globals;		/* data saved across calls, global scope */
+ 	MemoryContext tmp_ctx;
  } PLyProcedure;
  
  /* the procedure cache entry */
diff --git a/src/pl/plpython/plpy_typeio.c b/src/pl/plpython/plpy_typeio.c
index d5cac9f..6f0eb46 100644
*** a/src/pl/plpython/plpy_typeio.c
--- b/src/pl/plpython/plpy_typeio.c
***
*** 23,28 
--- 23,29 
  #include "plpy_typeio.h"
  
  #include "plpy_elog.h"
+ #include "plpy_procedure.h"
  
  
  /* I/O function caching */
*** PLy_output_record_funcs(PLyTypeInfo *arg
*** 258,268 
  	Assert(arg->is_rowtype == 1);
  }
  
  PyObject *
  PLyDict_FromTuple(PLyTypeInfo *info, HeapTuple tuple, TupleDesc desc)
  {
! 	PyObject   *volatile dict;
! 	int			i;
  
  	if (info->is_rowtype != 1)
  		elog(ERROR, "PLyTypeInfo structure describes a datum");
--- 259,274 
  	Assert(arg->is_rowtype == 1);
  }
  
+ /*
+  * Transform a tuple into a Python dict object. The transformation can result
+  * in memory allocation, so it's done in a scratch memory context.
+  */
  PyObject *
  PLyDict_FromTuple(PLyTypeInfo *info, HeapTuple tuple, TupleDesc desc)
  {
! 	PyObject		*volatile dict;
! 	MemoryContext	oldcontext;
! 	inti;
  
  	if (info->is_rowtype != 1)
  		elog(ERROR, "PLyTypeInfo structure describes a datum");
*** PLyDict_FromTuple(PLyTypeInfo *info, Hea
*** 271,276 
--- 277,284 
  	if (dict == NULL)
  		PLy_elog(ERROR, "could not create new dictionary");
  
+ 	oldcontext = MemoryContextSwitchTo(PLy_curr_procedure->tmp_ctx);
+ 
  	PG_TRY();
  	{
  		for (i = 0; i < info->in.r.natts; i++)
*** PLyDict_FromTuple(PLyTypeInfo *info, Hea
*** 298,308 
--- 306,322 
  	}
  	PG_CATCH();
  	{
+ 		MemoryContextSwitchTo(oldcontext);
+ 		MemoryContextReset(PLy_curr_procedure->tmp_ctx);
+ 
  		Py_DECREF(dict);
  		PG_RE_THROW();
  	}
  	PG_END_TRY();
  
+ 	MemoryContext

Re: [HACKERS] initdb and fsync

2012-02-05 Thread Jeff Davis
On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote:
> If we add fsync calls to the initdb process, they should cover the entire data
> directory tree.  This patch syncs files that initdb.c writes, but we ought to
> also sync files that bootstrap-mode backends had written.

It doesn't make sense for initdb to take responsibility to sync files
created by the backend. If there are important files that the backend
creates, it should be the backend's responsibility to fsync them (and
their parent directory, if needed). And if they are unimportant to the
backend, then there is no reason for initdb to fsync them.

> An optimization
> like the pg_flush_data() call in copy_file() may reduce the speed penalty.

That worked pretty well. It took it down about 100ms on my machine,
which closes the gap significantly.

> initdb should do these syncs by default and offer an option to disable them.

For test frameworks that run initdb often, that makes sense.

But for developers, it doesn't make sense to spend 0.5s typing an option
that saves you 0.3s. So, we'd need some more convenient way to choose
the no-fsync option, like an environment variable that developers can
set. Or maybe developers don't care about 0.3s?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] basic pgbench runs with various performance-related patches

2012-02-05 Thread Robert Haas
On Sat, Feb 4, 2012 at 11:59 AM, Greg Smith  wrote:
> On 01/24/2012 08:58 AM, Robert Haas wrote:
>>
>> One somewhat odd thing about these numbers is that, on permanent
>> tables, all of the patches seemed to show regressions vs. master in
>> single-client throughput.  That's a slightly difficult result to
>> believe, though, so it's probably a testing artifact of some kind.
>
> It looks like you may have run the ones against master first, then the ones
> applying various patches.  The one test artifact I have to be very careful
> to avoid in that situation is that later files on the physical disk are
> slower than earlier ones.  There's a >30% differences between the fastest
> part of a regular hard drive, the logical beginning, and its end.  Multiple
> test runs tend to creep forward onto later sections of disk, and be biased
> toward the earlier run in that case.  To eliminate that bias when it gets
> bad, I normally either a) run each test 3 times, interleaved, or b) rebuild
> the filesystem in between each initdb.
>
> I'm not sure that's the problem you're running into, but it's the only one
> I've been hit by that matches the suspicious part of your results.

I don't think that's it, because tests on various branches were
interleaved; moreover, I don't believe master was the first one in the
rotation.  I think I had then in alphabetical order by branch name,
actually.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers