Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-09 Thread Thom Brown
On 9 September 2011 01:04, George Barnett gbarn...@atlassian.com wrote:
 After looking through the code I found that when postgres calls write() it 
 doesn't retry.  In order to address the issue with the PANIC in the WAL 
 writer I set the sync method to o_sync which solved the issue in that part of 
 the code, however I was still seeing failures in other areas of the code 
 (such as the FileWrite function).  Following this, I spoke to an NFS guru who 
 pointed out that writes under linux are not guaranteed to complete unless you 
 open up O_SYNC or similar on the file handle.

Have you run the test with varying wal_sync_method values?  On Linux
the default is fdatasync because historically Linux hasn't supported
O_DSYNC (a wal_sync_method value of open_datasync).  But I believe as
of Kernel 2.6.33 it does support it.  Have you tried modifying this
parameter in your tests?  Are you even using Linux? (you haven't
specified)

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-09 Thread Peter Eisentraut
On fre, 2011-09-09 at 10:04 +1000, George Barnett wrote:
 After looking through the code I found that when postgres calls
 write() it doesn't retry.  In order to address the issue with the
 PANIC in the WAL writer I set the sync method to o_sync which solved
 the issue in that part of the code, however I was still seeing
 failures in other areas of the code (such as the FileWrite function).
 Following this, I spoke to an NFS guru who pointed out that writes
 under linux are not guaranteed to complete unless you open up O_SYNC
 or similar on the file handle.

I've had this problem many years ago.  I recall that changing the mount
options for NFS also fixed it.  Could you post what mount options you
are using.

(We eventually moved away from NFS at that time, so I didn't pursue it
further, but my analysis back then matched yours.)



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Log crashed backend's query (activity string)

2011-09-09 Thread Marti Raudsepp
On Thu, Sep 8, 2011 at 03:22, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Sep 7, 2011 at 5:24 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
 I remember we had bugs whereby an encoding conversion would fail,
 leading to elog trying to report this problem, but this attempt would
 also incur a conversion step, failing recursively until elog's stack got
 full.  I'm not saying this is impossible to solve, just something to
 keep in mind.

Looking at elog.c, this only seems to apply to messages sent to the
client from a backend connection. No conversion is done for log
messages.

 Can we do something like: pass through ASCII characters unchanged, and
 output anything with the high-bit set as \xhexdigithexdigit?  That
 might be garbled in some cases, but the goal here is not perfection.
 We're just trying to give the admin (or PostgreSQL-guru-for-hire) a
 clue where to start looking for the problem.

Or we might just replace them with '?'. This has the advantage of not
expanding query length 4x if it does happen to be corrupted. The vast
majority of queries are ASCII-only anyway.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unite recovery.conf and postgresql.conf

2011-09-09 Thread Magnus Hagander
On Fri, Sep 9, 2011 at 11:56, Fujii Masao masao.fu...@gmail.com wrote:
 Hi,

 http://archives.postgresql.org/pgsql-hackers/2010-12/msg02343.php

 In previous discussion, we've reached the consensus that we should unite
 recovery.conf and postgresql.conf. The attached patch does that. The
 patch is WIP, I'll have to update the document, but if you notice something,
 please feel free to comment.

 This patch allows us to specify recovery parameters like primary_conninfo
 in postgresql.conf. You can change some of them without restarting
 the server (i.e., by using pg_ctl reload) and see the current settings by
 SHOW command. But note that recovery.conf is still required as a status
 file for archive recovery and standby server even if you put all the settings
 in postgresql.conf. Recovery parameters in postgresql.conf only have effect
 when recovery.conf exists.

Why exactly do we need to keep recovery.conf in this case?


 For backward compatibility, this patch still allows us to specify recovery
 parameters in recovery.conf. But, as in past years, you cannot reload
 recovery.conf and see the current settings by SHOW command. We should
 recommend to put all the settings in postgresql.conf and empty recovery.conf,
 but you can also put all recovery parameters in recovery.conf.

Perhaps we should just have people use recovery.conf as an include
file in this case?

As a whole, I'm not sure I like the idea of having such critical
parameters in two different places. Do we really need to care about
the backwards compatibility of something like that this much?


 If the same parameter is specified in both file, the setting in recovery.conf
 overrides that in postgresql.conf. In this case, SHOW command displays
 the settings in postgresql.conf even though they are not used at all. Even if

Now, *that* is just broken, IMHO. The SHOW command should show what is
actually in use, period. Everything is guaranteed to confuse the hell
out of anybody trying to use it.


 you change the settings in postgresql.conf and reload it, they have no effect
 because the settings in recovery.conf are used preferentially.

 These limitations on recovery.conf might confuse users. But since most
 users will put all the settings in postgresql.conf if we recommend to do that,
 I don't think that it's worth complicating the source code for getting rid of
 those limitations.

I think we should just get rid of supporting them in recovery.conf
completely. Recovery settings are critical, and making that confusing
to the user is not a good thing.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL low watermark during base backup

2011-09-09 Thread Magnus Hagander
On Tue, Sep 6, 2011 at 22:35, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, Sep 5, 2011 at 11:38 AM, Magnus Hagander mag...@hagander.net wrote:
 On Sun, Sep 4, 2011 at 19:02, Simon Riggs si...@2ndquadrant.com wrote:
 On Fri, Sep 2, 2011 at 6:52 PM, Magnus Hagander mag...@hagander.net wrote:

 Attached patch implements a low watermark wal location in the
 walsender shmem array. Setting this value in a walsender prevents
 transaction log removal prior to this point - similar to how
 wal_keep_segments work, except with an absolute number rather than
 relative. For now, this is set when running a base backup with WAL
 included - to prevent the required WAL to be recycled away while the
 backup is running, without having to guestimate the value for
 wal_keep_segments. (There could be other ways added to set it in the
 future, but that's the only one I've done for now)

 It obviously needs some documentation updates as well, but I wanted to
 get some comments on the way it's done before I work on those.

 I'm not yet fully available for a discussion on this, but not sure I like 
 this.

 You don't have to guess the setting of wal_keep_segments, you
 calculate it exactly from the size of your WAL disk. No other
 calculation is easy or accurate.

 Uh, no. What about the (very large number of) cases where pg is just
 sitting on one partition, possibly shared with a whole lot of other
 services? You'd need to set it to all-of-your-disk, which is something
 that will change over time.

 Maybe I wasn't entirely clear in the submission, but if it wasn't
 obvious: the use-case for this is the small and simple installations
 that need a simple way of doing a reliable online backup. This is the
 pg_basebackup -x usecase altogether - for example, anybody bigger
 likely has archiv elogging setup already, in which case this
 functionality is not interesting at all.

 I understand the need for a reliable backup, problem is they won't get
 one like this.

 If your disk fills, the backup cannot end correctly, so you must
 somehow avoid the disk filling while the backup is taken.

The same thing will happen if your archive_command stops working - the
disk fills up. There are plenty of scenarios whereby the disk can fill
up.

There are a lot of cases where this really isn't a risk, and I believe
this would be a helpful feature in many of those *simple* cases.


 Removing the safety that prevents the disk from filling doesn't
 actually prevent it filling.

 If you must have this then make pg_basebackup copy xlog files
 regularly during the backup. That way your backup can take forever and
 your primary disk won't fill up. In many cases it actually will take
 forever, but at least we don't take down the primary.

There is a patch to do something like that as well sitting on the CF
page. I don't believe one necessarily excludes the other.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unite recovery.conf and postgresql.conf

2011-09-09 Thread Fujii Masao
On Fri, Sep 9, 2011 at 7:21 PM, Magnus Hagander mag...@hagander.net wrote:
 On Fri, Sep 9, 2011 at 11:56, Fujii Masao masao.fu...@gmail.com wrote:
 Hi,

 http://archives.postgresql.org/pgsql-hackers/2010-12/msg02343.php

 In previous discussion, we've reached the consensus that we should unite
 recovery.conf and postgresql.conf. The attached patch does that. The
 patch is WIP, I'll have to update the document, but if you notice something,
 please feel free to comment.

 This patch allows us to specify recovery parameters like primary_conninfo
 in postgresql.conf. You can change some of them without restarting
 the server (i.e., by using pg_ctl reload) and see the current settings by
 SHOW command. But note that recovery.conf is still required as a status
 file for archive recovery and standby server even if you put all the settings
 in postgresql.conf. Recovery parameters in postgresql.conf only have effect
 when recovery.conf exists.

 Why exactly do we need to keep recovery.conf in this case?

PostgreSQL automatically reinitializes after a backend crash. Imagine the case
where this happens after archive recovery. Since restore_command is left set
in postgresql.conf and it still has effect (if we've completely removed
recovery.conf), the reinitialization will makes the server go into
archive recovery
mode again unexpectedly.

We can avoid this problem by using recovery.conf. At the end of
archive recovery,
recovery.conf is renamed to recovery.done. So when the reinitialization happens,
recovery.conf doesn't exist and restore_command has no effect. Which prevents
the server from going into archive recovery mode again.

 For backward compatibility, this patch still allows us to specify recovery
 parameters in recovery.conf. But, as in past years, you cannot reload
 recovery.conf and see the current settings by SHOW command. We should
 recommend to put all the settings in postgresql.conf and empty recovery.conf,
 but you can also put all recovery parameters in recovery.conf.

 Perhaps we should just have people use recovery.conf as an include
 file in this case?

Yeah, that's my first idea, but I've given up adopting that. The problem is that
recovery.conf is renamed to recovery.done at the end of recovery. This means
that all the settings in recovery.conf are reset when archive recovery ends.
If you run pg_ctl reload after archive recovery, you'll get something like the
following error message and the reload will always fail;

LOG:  parameter standby_mode cannot be changed without
restarting the server

 As a whole, I'm not sure I like the idea of having such critical
 parameters in two different places. Do we really need to care about
 the backwards compatibility of something like that this much?

I don't have strong opinion about this. But in previous discussion, Simon argued
that we should.
http://archives.postgresql.org/pgsql-hackers/2010-10/msg00017.php

 If the same parameter is specified in both file, the setting in recovery.conf
 overrides that in postgresql.conf. In this case, SHOW command displays
 the settings in postgresql.conf even though they are not used at all. Even if

 Now, *that* is just broken, IMHO. The SHOW command should show what is
 actually in use, period. Everything is guaranteed to confuse the hell
 out of anybody trying to use it.

I'm afraid that we have to add very complicated code to fix that problem.
Though of course we can easily fix the problem if we don't care about
the backward compatibility.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unite recovery.conf and postgresql.conf

2011-09-09 Thread Magnus Hagander
On Fri, Sep 9, 2011 at 13:15, Fujii Masao masao.fu...@gmail.com wrote:
 On Fri, Sep 9, 2011 at 7:21 PM, Magnus Hagander mag...@hagander.net wrote:
 On Fri, Sep 9, 2011 at 11:56, Fujii Masao masao.fu...@gmail.com wrote:
 Hi,

 http://archives.postgresql.org/pgsql-hackers/2010-12/msg02343.php

 In previous discussion, we've reached the consensus that we should unite
 recovery.conf and postgresql.conf. The attached patch does that. The
 patch is WIP, I'll have to update the document, but if you notice something,
 please feel free to comment.

 This patch allows us to specify recovery parameters like primary_conninfo
 in postgresql.conf. You can change some of them without restarting
 the server (i.e., by using pg_ctl reload) and see the current settings by
 SHOW command. But note that recovery.conf is still required as a status
 file for archive recovery and standby server even if you put all the 
 settings
 in postgresql.conf. Recovery parameters in postgresql.conf only have effect
 when recovery.conf exists.

 Why exactly do we need to keep recovery.conf in this case?

 PostgreSQL automatically reinitializes after a backend crash. Imagine the case
 where this happens after archive recovery. Since restore_command is left set
 in postgresql.conf and it still has effect (if we've completely removed
 recovery.conf), the reinitialization will makes the server go into
 archive recovery
 mode again unexpectedly.

 We can avoid this problem by using recovery.conf. At the end of
 archive recovery,
 recovery.conf is renamed to recovery.done. So when the reinitialization 
 happens,
 recovery.conf doesn't exist and restore_command has no effect. Which prevents
 the server from going into archive recovery mode again.

Ah, ugh. Good point.

I have to wonder though, if it wouldn't be less confusing to just get
rid of recovery.conf and use a *different* file for this. Just to make
it clear it's not a config file, but just a boolean exists/notexists
state.


 For backward compatibility, this patch still allows us to specify recovery
 parameters in recovery.conf. But, as in past years, you cannot reload
 recovery.conf and see the current settings by SHOW command. We should
 recommend to put all the settings in postgresql.conf and empty 
 recovery.conf,
 but you can also put all recovery parameters in recovery.conf.

 Perhaps we should just have people use recovery.conf as an include
 file in this case?

 Yeah, that's my first idea, but I've given up adopting that. The problem is 
 that
 recovery.conf is renamed to recovery.done at the end of recovery. This means
 that all the settings in recovery.conf are reset when archive recovery ends.
 If you run pg_ctl reload after archive recovery, you'll get something like 
 the
 following error message and the reload will always fail;

    LOG:  parameter standby_mode cannot be changed without
 restarting the server

Good point. And I believe another argument for actually using
completely different files ;)


 As a whole, I'm not sure I like the idea of having such critical
 parameters in two different places. Do we really need to care about
 the backwards compatibility of something like that this much?

 I don't have strong opinion about this. But in previous discussion, Simon 
 argued
 that we should.
 http://archives.postgresql.org/pgsql-hackers/2010-10/msg00017.php

I see his argument, but I think reduced confusion is a more important one.


 If the same parameter is specified in both file, the setting in 
 recovery.conf
 overrides that in postgresql.conf. In this case, SHOW command displays
 the settings in postgresql.conf even though they are not used at all. Even 
 if

 Now, *that* is just broken, IMHO. The SHOW command should show what is
 actually in use, period. Everything is guaranteed to confuse the hell
 out of anybody trying to use it.

 I'm afraid that we have to add very complicated code to fix that problem.
 Though of course we can easily fix the problem if we don't care about
 the backward compatibility.

That is an even bigger reason to drop backwards compatibility. Unless
someone else can come up with a neat way of fixing it.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL low watermark during base backup

2011-09-09 Thread Dimitri Fontaine
Magnus Hagander mag...@hagander.net writes:
 If you must have this then make pg_basebackup copy xlog files
 regularly during the backup. That way your backup can take forever and
 your primary disk won't fill up. In many cases it actually will take
 forever, but at least we don't take down the primary.

 There is a patch to do something like that as well sitting on the CF
 page. I don't believe one necessarily excludes the other.

I'm not getting why we need the later one when we have this older one?

-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL low watermark during base backup

2011-09-09 Thread Magnus Hagander
On Fri, Sep 9, 2011 at 13:40, Dimitri Fontaine dimi...@2ndquadrant.fr wrote:
 Magnus Hagander mag...@hagander.net writes:
 If you must have this then make pg_basebackup copy xlog files
 regularly during the backup. That way your backup can take forever and
 your primary disk won't fill up. In many cases it actually will take
 forever, but at least we don't take down the primary.

 There is a patch to do something like that as well sitting on the CF
 page. I don't believe one necessarily excludes the other.

 I'm not getting why we need the later one when we have this older one?

One of them is for the simple case. It requires a single connection to
the server, and it supports things like writing to tarfiles and
compression.

The other one is more compelx. It uses multiple connections (one for
the base, one for the xlog), and as such doesn't support writing to
files, only directories.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] regress test failed

2011-09-09 Thread Peter Eisentraut
On sön, 2011-09-04 at 12:06 -0400, Andrew Dunstan wrote:
 Maybe we need a few members that test a large number of locales.
 (Anyone feel like donating resources? I'm currently providing
 resources for seven, which I think is sufficient :-) ) 

If we're just testing different configuration combinations (as opposed
to different operating systems and architectures), we could also
consider running this off some cloud provider.  That way, we could add
dozens of members for $20 a month or so.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL low watermark during base backup

2011-09-09 Thread Florian Pflug
On Sep9, 2011, at 13:48 , Magnus Hagander wrote:
 On Fri, Sep 9, 2011 at 13:40, Dimitri Fontaine dimi...@2ndquadrant.fr wrote:
 Magnus Hagander mag...@hagander.net writes:
 If you must have this then make pg_basebackup copy xlog files
 regularly during the backup. That way your backup can take forever and
 your primary disk won't fill up. In many cases it actually will take
 forever, but at least we don't take down the primary.
 
 There is a patch to do something like that as well sitting on the CF
 page. I don't believe one necessarily excludes the other.
 
 I'm not getting why we need the later one when we have this older one?
 
 One of them is for the simple case. It requires a single connection to
 the server, and it supports things like writing to tarfiles and
 compression.
 
 The other one is more compelx. It uses multiple connections (one for
 the base, one for the xlog), and as such doesn't support writing to
 files, only directories.

I guess the real question is, why can't we stream the WALs as they are
generated instead of at the end even over a single connection and when
writing tarfiles?

Couldn't we send all available WAL after each single data-file instead
of waiting for all data files to be transferred before sending WAL?

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL low watermark during base backup

2011-09-09 Thread Dimitri Fontaine
Florian Pflug f...@phlo.org writes:
 Couldn't we send all available WAL after each single data-file instead
 of waiting for all data files to be transferred before sending WAL?

+1 (or maybe not at the file boundary but rather driven by archive
command with some internal hooking, as the backend needs some new
provisions here anyway).

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unite recovery.conf and postgresql.conf

2011-09-09 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes:
 I have to wonder though, if it wouldn't be less confusing to just get
 rid of recovery.conf and use a *different* file for this. Just to make
 it clear it's not a config file, but just a boolean exists/notexists
 state.

+1.  If it's not a configuration file anymore, it shouldn't be called
one.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL low watermark during base backup

2011-09-09 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes:
 On Fri, Sep 9, 2011 at 13:40, Dimitri Fontaine dimi...@2ndquadrant.fr wrote:
 I'm not getting why we need the later one when we have this older one?

 One of them is for the simple case. It requires a single connection to
 the server, and it supports things like writing to tarfiles and
 compression.

 The other one is more compelx. It uses multiple connections (one for
 the base, one for the xlog), and as such doesn't support writing to
 files, only directories.

I'm with Dimitri on this one: let's not invent two different ways to do
the same thing.  Let's pick the better one, or meld them somehow, so
we only have one implementation to support going forward.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-09 Thread Tom Lane
George Barnett gbarn...@atlassian.com writes:
 [ patch to retry writes on NFS ]

I'm having a hard time getting excited about this idea, because IMO
NFS is insufficiently reliable to run a database on, and no patch like
this can fix that.  However, some concrete points:

1. If writes need to be retried, why not reads?  (No, that's not an
invitation to expand the scope of the patch; it's a question about NFS
implementation.)

2. What is the rationale for supposing that a retry a nanosecond later
will help?  If it will help, why didn't the kernel just do that?

3. What about EINTR?  If you believe that this is necessary, then the
kernel logically has to return EINTR when it would like you to retry but
it hasn't been able to write any bytes yet.  If you get a zero return
you have to assume that means out-of-disk-space.

4. As coded, the patch behaves incorrectly if you get a zero return on a
retry.  If we were going to do this, I think we'd need to absorb the
errno-munging currently done by callers into the writeAll function.

On the whole I think you'd be better off lobbying your NFS implementors
to provide something closer to the behavior of every other filesystem on
the planet.  Or checking to see if you need to adjust your NFS
configuration, as the other responders mentioned.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] REVIEW proposal: a validator for configuration files

2011-09-09 Thread Alexey Klyukin
Hello,

On Sep 7, 2011, at 5:00 PM, Tom Lane wrote:

 Andy Colson a...@squeakycode.net writes:
 Where did the other warnings go?  Its right though, line 570 is bad.  It 
 also seems to have killed the server.  I have not gotten through the history 
 of messages regarding this patch, but is it supposed to kill the server if 
 there is a syntax error in the config file?
 
 The historical behavior is that a configuration file error detected
 during postmaster startup should prevent the server from starting, but
 an error detected during reload should only result in a LOG message and
 the reload not occurring.  I don't believe anyone will accept a patch
 that causes the server to quit on a failed reload.  There has however
 been some debate about the exact extent of ignoring bad values during
 reload --- currently the theory is ignore the whole file if anything is
 wrong, but there's some support for applying all non-bad values as long
 as the overall file syntax is okay.

This patch actually aims to do the latter, applying all good values and 
reporting the bad ones. It removes the calls to set_config_option with 
changeVal = false from ProcessConfigFile, trying to apply every option at once 
and incrementing the warnings counter if set_config_option returns false. After 
some investigation I've discovered that set_config_option returns false even if 
a variable value is unchanged, but is applied in the wrong GucContext. The 
particular case is trying to reload the configuration file variables during 
SIGHUP: if, say, shared_buffers are commented out, the call to 
set_config_option with changeVal = true will return false due to this code:

   if (prohibitValueChange)
   {
   if (*conf-variable != newval)
   ereport(elevel,
   
 (errcode(ERRCODE_CANT_CHANGE_RUNTIME_PARAM),

 errmsg(parameter \%s\ cannot be changed without restarting the server,
   
 name)));
   return false;
   }
 


Due to the above,  set_config_option returns false for unchanged PGC_POSTMASTER 
variables during SIGHUP, so it's impossible to distinguish between valid and 
non valid values and report the latter ones with a single call of this function 
per var. What do you think about changing the code above to return true if the 
variable is actually unchanged?

This explains the WARNINGs emitted during reload even for a pristine 
configuration file right after the installation. I'm looking into why the 
server gets killed during reload if there is a parse error, it might be as well 
related to the problem above.

--
Alexey Klyukinhttp://www.commandprompt.com
The PostgreSQL Company – Command Prompt, Inc.





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unite recovery.conf and postgresql.conf

2011-09-09 Thread Simon Riggs
On Fri, Sep 9, 2011 at 3:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Magnus Hagander mag...@hagander.net writes:
 I have to wonder though, if it wouldn't be less confusing to just get
 rid of recovery.conf and use a *different* file for this. Just to make
 it clear it's not a config file, but just a boolean exists/notexists
 state.

 +1.  If it's not a configuration file anymore, it shouldn't be called
 one.

+1 to rename file

+1 to overall concept, just thinking same myself, not looked at patch yet

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-09 Thread Bernd Helmle



--On 9. September 2011 10:27:22 -0400 Tom Lane t...@sss.pgh.pa.us wrote:


On the whole I think you'd be better off lobbying your NFS implementors
to provide something closer to the behavior of every other filesystem on
the planet.  Or checking to see if you need to adjust your NFS
configuration, as the other responders mentioned.


You really need at least mount options 'hard' _and_ 'nointr' on NFS mounts, 
otherwise you are out of luck. Oracle and DB2 guys recommend those settings and 
without them any millisecond of network glitch could disturb things 
unreasonably.


--
Thanks

Bernd

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_last_xact_insert_timestamp

2011-09-09 Thread Robert Haas
On Thu, Sep 8, 2011 at 8:42 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Another idea to avoid spinlock contention is save the timestamp in
 PgBackendStatus (which contains information for pg_stat_activity).
 This enables us to write and read the timestamp without spinlock.
 Comments?

That seems like a possibly promising approach, in that each backend
could update the information separately, and it's the reader's job to
go find the maximum of all those values when needed.  So the overhead
is (properly, in this case) placed on the reader instead of the
writer.  But it's a bit tricky, because when the reader wants that
maximum, it has to take into account inactive backends that may have
committed transactions before exiting, not just the ones that are
still connected.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] force_not_null option support for file_fdw

2011-09-09 Thread Kohei Kaigai
I marked this patch as Ready for Committer, and hope this patch being 
committed.

Thanks,
--
NEC Europe Ltd, SAP Global Competence Center
KaiGai Kohei kohei.kai...@emea.nec.com


 -Original Message-
 From: Shigeru Hanada [mailto:shigeru.han...@gmail.com]
 Sent: 9. September 2011 06:03
 To: Kohei Kaigai
 Cc: Kohei KaiGai; PostgreSQL-development
 Subject: Re: [HACKERS] force_not_null option support for file_fdw
 
 Thanks for the review, Kaigai-san.
 
 (2011/09/09 0:47), Kohei Kaigai wrote:
  I found one other point to be fixed:
  On get_force_not_null(), it makes a list of attribute names with 
  force_not_null option.
 
  +   foreach (cell, options)
  +   {
  +   DefElem*def = (DefElem *) lfirst(cell);
  +   const char *value = defGetString(def);
  +
  +   if (strcmp(def-defname, force_not_null) == 0
  +   strcmp(value, true) == 0)
  +   {
  +   columns = lappend(columns, 
  makeString(NameStr(attr-attname)));
  +   elog(DEBUG1, %s: force_not_null, NameStr(attr-attname));
  +   }
  +
  +   }
 
  makeString() does not copy the supplied string itself, so it is not
  preferable to reference
  NameStr(attr-attname) across ReleaseSysCache().
  I'd like to suggest to supply a copied attname using pstrdup for
  makeString
 
 Oops, fixed.
 [ I should check some of my projects for this issue... ]
 
 Attached patch also includes some cosmetic changes such as removing useless 
 blank lines.
 
 Regards,
 --
 Shigeru Hanada
 
 
  Click
 https://www.mailcontrol.com/sr/GQT1p8GGIBPTndxI!oX7UiqFKmKbqX6Rgam71Xs0JKL1UdBaye8DbxIe1v95RjzltL
 2w8BfevsSBchKRB0KEKw==  to report this email as spam.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving core timestamp typedefs/macros somewhere else

2011-09-09 Thread Tom Lane
I wrote:
 I propose moving the Timestamp/Interval typedefs, as well as some basic
 associated macros such as the ones mentioned above (basically, lines
 25-100 of utils/timestamp.h, plus the DT_NOBEGIN/DT_NOEND stuff, plus
 the Julian-date macros in datetime.h), into a separate header file that
 contains no backend-only declarations (eg, not fmgr.h stuff; right
 offhand I don't think it would depend on anything except c.h).

I've committed this patch, but there was one aspect that remains
unfinished.  I had hoped to remove the duplicative parts of ecpg's dt.h
header in favor of including datatype/timestamp.h, along the lines of
the attached patch.  However, while ecpg itself compiles with that
change, its test programs do not; apparently they include
pgtypes_timestamp.h without previously including anything that defines
typedef int32.  I'm unsure if there's a reasonable way to work around
that --- any thoughts?

regards, tom lane

*** src/interfaces/ecpg/include/pgtypes_timestamp.h.orig	Fri Sep  9 13:11:12 2011
--- src/interfaces/ecpg/include/pgtypes_timestamp.h	Fri Sep  9 13:08:18 2011
***
*** 6,18 
  /* pgtypes_interval.h includes ecpg_config.h */
  #include pgtypes_interval.h
  
! #ifdef HAVE_INT64_TIMESTAMP
! typedef int64 timestamp;
! typedef int64 TimestampTz;
! #else
! typedef double timestamp;
! typedef double TimestampTz;
! #endif
  
  #ifdef __cplusplus
  extern		C
--- 6,14 
  /* pgtypes_interval.h includes ecpg_config.h */
  #include pgtypes_interval.h
  
! #include datatype/timestamp.h
! 
! typedef TimestampTz timestamp;
  
  #ifdef __cplusplus
  extern		C
*** src/interfaces/ecpg/pgtypeslib/dt.h.orig	Fri Sep  9 13:11:12 2011
--- src/interfaces/ecpg/pgtypeslib/dt.h	Fri Sep  9 13:06:49 2011
***
*** 7,25 
  
  #define MAXTZLEN			 10
  
- #ifdef HAVE_INT64_TIMESTAMP
- 
- typedef int32 fsec_t;
- #else
- 
- typedef double fsec_t;
- 
- /* round off to MAX_TIMESTAMP_PRECISION decimal places */
- /* note: this is also used for rounding off intervals */
- #define TS_PREC_INV 100.0
- #define TSROUND(j) (rint(((double) (j)) * TS_PREC_INV) / TS_PREC_INV)
- #endif
- 
  #define USE_POSTGRES_DATES0
  #define USE_ISO_DATES	1
  #define USE_SQL_DATES	2
--- 7,12 
***
*** 243,278 
  } while(0)
  #endif
  
- /* in both timestamp.h and ecpg/dt.h */
- #define DAYS_PER_YEAR	365.25	/* assumes leap year every four years */
- #define MONTHS_PER_YEAR 12
- /*
-  *	DAYS_PER_MONTH is very imprecise.  The more accurate value is
-  *	365.2425/12 = 30.436875, or '30 days 10:29:06'.  Right now we only
-  *	return an integral number of days, but someday perhaps we should
-  *	also return a 'time' value to be used as well.	ISO 8601 suggests
-  *	30 days.
-  */
- #define DAYS_PER_MONTH	30		/* assumes exactly 30 days per month */
- #define HOURS_PER_DAY	24		/* assume no daylight savings time changes */
- 
- /*
-  *	This doesn't adjust for uneven daylight savings time intervals or leap
-  *	seconds, and it crudely estimates leap years.  A more accurate value
-  *	for days per years is 365.2422.
-  */
- #define SECS_PER_YEAR	(36525 * 864)	/* avoid floating-point computation */
- #define SECS_PER_DAY	86400
- #define SECS_PER_HOUR	3600
- #define SECS_PER_MINUTE 60
- #define MINS_PER_HOUR	60
- 
- #ifdef HAVE_INT64_TIMESTAMP
- #define USECS_PER_DAY	INT64CONST(864)
- #define USECS_PER_HOUR	INT64CONST(36)
- #define USECS_PER_MINUTE INT64CONST(6000)
- #define USECS_PER_SEC	INT64CONST(100)
- #endif
  
  /*
   * Date/time validation
--- 230,235 
***
*** 280,302 
   */
  #define isleap(y) (((y) % 4) == 0  (((y) % 100) != 0 || ((y) % 400) == 0))
  
- /* Julian date support for date2j() and j2date()
-  *
-  * IS_VALID_JULIAN checks the minimum date exactly, but is a bit sloppy
-  * about the maximum, since it's far enough out to not be especially
-  * interesting.
-  */
- 
- #define JULIAN_MINYEAR (-4713)
- #define JULIAN_MINMONTH (11)
- #define JULIAN_MINDAY (24)
- #define JULIAN_MAXYEAR (5874898)
- 
- #define IS_VALID_JULIAN(y,m,d) y)  JULIAN_MINYEAR) \
-   || (((y) == JULIAN_MINYEAR)  (((m)  JULIAN_MINMONTH) \
-   || (((m) == JULIAN_MINMONTH)  ((d) = JULIAN_MINDAY) \
-   ((y)  JULIAN_MAXYEAR))
- 
  #define UTIME_MINYEAR (1901)
  #define UTIME_MINMONTH (12)
  #define UTIME_MINDAY (14)
--- 237,242 
***
*** 311,336 
   || (((y) == UTIME_MAXYEAR)  (((m)  UTIME_MAXMONTH) \
|| (((m) == UTIME_MAXMONTH)  ((d) = UTIME_MAXDAY))
  
- #ifdef HAVE_INT64_TIMESTAMP
- 
- #define DT_NOBEGIN		(-INT64CONST(0x7fff) - 1)
- #define DT_NOEND		(INT64CONST(0x7fff))
- #else
- 
- #ifdef HUGE_VAL
- #define DT_NOBEGIN		(-HUGE_VAL)
- #define DT_NOEND		(HUGE_VAL)
- #else
- #define DT_NOBEGIN		(-DBL_MAX)
- #define DT_NOEND		(DBL_MAX)
- #endif
- #endif   /* HAVE_INT64_TIMESTAMP */
- 
- #define TIMESTAMP_NOBEGIN(j)	do {(j) = DT_NOBEGIN;} while (0)
- #define TIMESTAMP_NOEND(j)			do 

Re: [HACKERS] unite recovery.conf and postgresql.conf

2011-09-09 Thread Josh Berkus
On 9/9/11 7:05 AM, Tom Lane wrote:
 Magnus Hagander mag...@hagander.net writes:
 I have to wonder though, if it wouldn't be less confusing to just get
 rid of recovery.conf and use a *different* file for this. Just to make
 it clear it's not a config file, but just a boolean exists/notexists
 state.
 
 +1.  If it's not a configuration file anymore, it shouldn't be called
 one.

I'm in favor of this.  People are sufficiently confused by the existing
behavior that we're not going to confuse them further by changing it.

In fact: currently we have a trigger file concept in recovery.conf.
Might we unite the trigger file concept with the concept of having a
file to indicate that server is a replica on startup?

i.e. in postgresql.conf we have two settings:

replication_status = {master,standby}
failover_trigger_file = ''  #defaults to $PGDATA/master_trigger

If replication_status = master, then the server is a standalone or
master node.

If replication_status = standby, then the server is a PITR, warm
standby, or hot standby replica.

If the failover_trigger_file is created and detected, then PostgreSQL
will reload in master mode.  replication_status will get set to master
so that it's easy to tell this from the command line.

The above would be a lot easier to comprehend than the current behavior,
and would allow us all of the same options the current behavoir does.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-09 Thread Noah Misch
On Fri, Sep 09, 2011 at 10:27:22AM -0400, Tom Lane wrote:
 George Barnett gbarn...@atlassian.com writes:
  [ patch to retry writes on NFS ]
 
 I'm having a hard time getting excited about this idea, because IMO
 NFS is insufficiently reliable to run a database on, and no patch like
 this can fix that.  However, some concrete points:
 
 1. If writes need to be retried, why not reads?  (No, that's not an
 invitation to expand the scope of the patch; it's a question about NFS
 implementation.)

To support all POSIX:2008-conforming read() implementations, we must indeed
retry reads.  I suppose the OP never encountered this in practice, though.

 2. What is the rationale for supposing that a retry a nanosecond later
 will help?  If it will help, why didn't the kernel just do that?

POSIX:2008 unconditionally permits[1] partial writes of requests exceeding 512
bytes (_POSIX_PIPE_BUF).  We shouldn't complain when a kernel provides a
conforming write(), even if it appears that the kernel achieved little by
using some freedom afforded it.

 3. What about EINTR?  If you believe that this is necessary, then the
 kernel logically has to return EINTR when it would like you to retry but
 it hasn't been able to write any bytes yet.  If you get a zero return
 you have to assume that means out-of-disk-space.

Assuming conforming SA_RESTART for write()/read(), this will not happen.  The
call will restart and resume blocking until it writes something.

nm

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] fsyncing data to disk

2011-09-09 Thread Nulik Nol
Hi,
this is not exactly a Postgresql question, but an input from hackers
list like this would be invaluable for me.
I am coding my own database engine, and I decided to do not implement
transaction engine because it implies too much code.
But to achieve the Durability of ACID I need a 100% reliable write to
disk. By design no record in my DB will be larger than 512 bytes, so I
am using the page size of 512 bytes, that matches the size of the disk
block, so every write() I will execute with the following fdatasync()
call will be 100% written, is that correct? It won't make a 300 byte
write if I tell it to write 512 and the power goes off or will it? I
am going to use the whole partition device for the DB (like /dev/sda1)
, so no filesystem code will be used. Also I am using asynchronous IO
(the aio_read and aio_write) and I don't know if they can be combined
with the fdatasync() syscall?

Will appreciate your comments

Regards

-- 
==
The power of zero is infinite

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-09 Thread Kevin Grittner
Noah Misch n...@leadboat.com wrote:
 
 We shouldn't complain when a kernel provides a conforming write(),
 even if it appears that the kernel achieved little by using some
 freedom afforded it.
 
I remember we had some compiler warnings in the logging area because
we were making the assumption that no implementation would ever use
that freedom.  I suggested we replace the simple, unchecked calls to
write with a function which did what we expected through an API
conforming loop:
 
http://archives.postgresql.org/pgsql-hackers/2011-02/msg01719.php
 
The response was that we could ignore the documented API because we
had never seen nor heard of it being true for writes to disk
files.  I'm still uncomfortable with that.  Where I have seen
people code to implementation details rather than the documented
API, it has often not turned out well in the long run.
 
I'd still be willing to put together a patch for that if people buy
into it.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fsyncing data to disk

2011-09-09 Thread Florian Pflug
On Sep9, 2011, at 20:15 , Nulik Nol wrote:
 this is not exactly a Postgresql question, but an input from hackers
 list like this would be invaluable for me.
 I am coding my own database engine, and I decided to do not implement
 transaction engine because it implies too much code.
 But to achieve the Durability of ACID I need a 100% reliable write to
 disk. By design no record in my DB will be larger than 512 bytes, so I
 am using the page size of 512 bytes,

Beware that there *are* disks with block sizes other than 512 bytes. For
example, at least for 2.5 disks, 4096 bytes/block is becoming quite
common these days.

 that matches the size of the disk
 block, so every write() I will execute with the following fdatasync()
 call will be 100% written, is that correct? It won't make a 300 byte
 write if I tell it to write 512 and the power goes off or will it?

Since error correction is done per-block, it's very unlikely that you'd see
only 300 of the 512 bytes overwritten - the drive would detect uncorrectable
data corruption and report an error instead. Whether that error is reported back
to the application as an IO error or as a zeroed-out block probably depends on
the OS.

What you actually seem to want is a stronger all-or-nothing guarantee which
precludes the error case. AFAIK, most disk drives kinda-of do that, because
the various capacitors which stabilize the power supply usually hold enough
charge to complete a write once it's started, and because they stop operating
if the power drops below some threshold. But I doubt that they provide any
hard guarantees in this area, I guess it's more of a best-effort thing.

To get hard guarantees, you'll need to use a RAID controller with a
battery-backed cache. Or use a journal/WAL like postgres (and most filesystems)
do, and protect journal/WAL entries with a checksum to detect partially written
entries.

 I am going to use the whole partition device for the DB (like /dev/sda1)
 , so no filesystem code will be used. Also I am using asynchronous IO
 (the aio_read and aio_write) and I don't know if they can be combined
 with the fdatasync() syscall?

Someone else (maybe the POSIX spec?) must answer that as I know very little
about asynchronous IO.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql.conf archive_command example

2011-09-09 Thread Florian Pflug
On Sep8, 2011, at 15:09 , Aidan Van Dyk wrote:
 Personally, I think both of these show examples of why PG should be
 looking hard at either providing a simple robust local directory based
 archive_command, or very seriously pointing users at properly written
 tools like omniptr, or ptrtools, walmgr, etc...
 
 Neither of those cases should ever happen.  If you're copying a file
 into the archive, and making it appear non-atomically in your archive,
 your doing something wrong.

+1000.

Archiving WAL should be done by copying to a temp file and moving it
into place. Before returning success, one should probably also do the
fsync incantations the linux kernel guys argued are necessary to prevent
the file from appearing empty if the machine crashes shortly after the
move. (Yeah, they fixed that after enough people complained, but the fact
that they even went as far as arguing their behaviour is correct according
to POSIX makes me uneasy...)

It'd be very cool if we shipped a tool that did that correctly (pg_walcopy
maybe?) on all supported platforms.

best regards,
Florian Pflug




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql.conf archive_command example

2011-09-09 Thread Martijn van Oosterhout
On Fri, Sep 09, 2011 at 08:59:43PM +0200, Florian Pflug wrote:
 Archiving WAL should be done by copying to a temp file and moving it
 into place. Before returning success, one should probably also do the
 fsync incantations the linux kernel guys argued are necessary to prevent
 the file from appearing empty if the machine crashes shortly after the
 move. (Yeah, they fixed that after enough people complained, but the fact
 that they even went as far as arguing their behaviour is correct according
 to POSIX makes me uneasy...)

Well, they fixed it for ext2/3/4 but that doesn't change the fact that
most other filesystems don't provide the same guarentees. If you want
to be sure the file contents hit the disk, you need to do an fsync.
 
(If you suggested to people we could add a new WAL sync method that
wrote the data to disk without fsync and renamed it over an existing
file and assured them that the data would survive a crash, they'd say
you're nuts).

 It'd be very cool if we shipped a tool that did that correctly (pg_walcopy
 maybe?) on all supported platforms.

It's hard enough to get right that shipping a tool that works properly
is eminently sensible. If only to demonstrate how it should be done.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] memory-related bugs

2011-09-09 Thread Daniel Farina
On Thu, Sep 8, 2011 at 1:56 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Daniel Farina dan...@heroku.com writes:
 On Tue, Sep 6, 2011 at 12:00 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'm still of the opinion that there's no real need to avoid memcpy with
 identical source and destination, so I didn't apply this first patch.

 I am leery of memcpy with overlapping regions.  I know that it can
 cause an infinite loop on ssse3 architectures, having to do with some
 backwards-iteration it does:
 https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/609290

 The linked page offers no support for your claim of an infinite loop,
 and in any case it's discussing a case involving *overlapping* regions,
 not *identical* regions.

What do you mean?  As per my original bug report, or in this case
(possibly both; I should have dumped the registers, which I'll do if I
see it again...)?  I'm able to believe that things are fine with all
known memcpys today in this case.

 The reason why this concern is irrelevant for identical regions is that
 no matter what order the memcpy routine copies the bytes in, it's
 necessarily storing the exact same data into each byte that was there
 before.  The only way that strange results could be produced is if the
 routine transiently set some byte to a value other than its final value,
 which would mean that it must be intending to store to that location
 more than once, which would be silly and inefficient.

This is a good point, I do understand there is a distinction between
the degenerate-case memcpy-to-identical region and the
overlapping-case.

In my naivety, I'd ask what the costs are of pedantic adherence to
this common guideline and, in event someone somewhere does something
that is not expected (or, has a slow-but-not-technically-buggy memcpy)
are broken, how likely the failure will be easy to pick out.  But I
don't seriously expect an answer, because I don't think this code path
has been a problem for so many years and measuring those things are
pretty hard.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] augmenting MultiXacts to improve foreign keys

2011-09-09 Thread Alvaro Herrera

Excerpts from Alvaro Herrera's message of mar ago 09 13:01:04 -0400 2011:

 To implement this, we need to augment MultiXact to store the lock type
 that each comprising Xid holds on the tuple.  Two bits per Xid are
 needed.  My thinking is that we could simply extend the members to add
 a byte for every four members.

So I've been working on this, and I've noticed that somehow it seems to
have turned into a giant snowball.  I'd like opinions on the items below
before I continue work here; if any of these ideas turns out to be a
showstopper, I'd like to know sooner rather than later.

1. since MultiXacts are going to contain updates and not just locks, it
means they will need to persist beyond OldestXmin -- in fact,
pg_multixact is going to have the same truncation rules as pg_clog,
namely the vacuum freeze horizon.  Currently they are truncated very
quickly; this is not going to be true anymore.

2. This also means that we may have to resolve MultiXacts to their
comprising TransactionIds when tqual.c is doing visibility checks on the
tuples.  Right now, the code simply does things like this:

if (tuple-t_infomask  HEAP_XMAX_IS_MULTI)
{
/* MultiXacts are currently only allowed to lock tuples */
Assert(tuple-t_infomask  HEAP_IS_LOCKED);
return true;
}
/* further code checks HeapTupleHeaderGetXmax(tuple) */

It's now going to need to do something more like

if (tuple-t_infomask  HEAP_XMAX_IS_MULTI)
{
if (tuple-t_infomask  HEAP_IS_LOCKED)
return true;
xmax = HeapTupleGetUpdateXid(tuple);
}
else
xmax = HeapTupleHeaderGetXmax(tuple);
/* further code checks our derived xmax */

where the HeapTupleGetUpdateXid function looks more or less like this

TransactionId
HeapTupleGetUpdateXid(HeapTupleHeader tuple)
{
TransactionId   update_xact;

Assert(!(tuple-t_infomask  HEAP_XMAX_IS_NOT_UPDATE));
Assert(tuple-t_infomask  HEAP_XMAX_IS_MULTI);

MultiXactMember *members;

nmembers = GetMultiXactIdMembers(xwait, members);

if (nmembers  0)
{
int i;

for (i = 0; i  nmembers; i++)
{
/* KEY SHARE lockers are okay -- ignore it */
if (members[i].status == MultiXactStatusKeyShare)
continue;
/* there should be at most one updater */
Assert(update_xact == InvalidTransactionId);
/* other status values not acceptable because they
 * conflict with update */
Assert(members[i].status == MultiXactStatusUpdate);
update_xact = members[i].xid;
}
}

return update_xact;
}


Which leads us to:

3. I've come up with HEAP_XMAX_IS_NOT_UPDATE in t_infomask, which means
that the Xmax, being a MultiXact, does not contain an update Xid.  This
reuses the old value of HEAP_XMAX_SHARED_LOCK.  I've used this rather
weird semantics for these reasons:

a) it's pg_upgrade compatible.  Any tuple that has the SHARED_LOCK bit
from the older version set cannot contain an update, and so the bit is
automatically right.
b) it quickly avoids having to run the GetMultiXactIdMembers thingy
(which is expensive) in the common case that there's no update.
c) it lets me keep the HEAP_IS_LOCKED semantics; which means this tuple
is only locked by Xmax, not updated, which is used extensively in
various places.
/*
 * if any of these bits is set, xmax hasn't deleted the tuple, only locked it.
 */
#define HEAP_IS_LOCKED  (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_IS_NOT_UPDATE | \
 HEAP_XMAX_KEYSHR_LOCK)


4. FOR SHARE is not a very widely used clause, I think.  FOR UPDATE is
part of the standard and thus it warrants quick innards, i.e. its own
hint bit.  FOR KEY SHARE is going to be used by RI triggers and so it
should also be quick; I've also given it its own hint bit.  However,
FOR SHARE is probably not used much and I've relegated it to being
mandatorily stored in MultiXact.  That is, if someone requests a FOR
SHARE lock on a tuple, it will get a singleton MultiXact.  The reason
for this is that I didn't want to use one more hint bit.

5. I've now used three hint bits -- reused HEAP_XMAX_SHARED_LOCK as
HEAP_XMAX_IS_NOT_UPDATE (already explained); used the free hint bit from
t_infomask as HEAP_XMAX_KEYSHR_LOCK (should be obvious); and I've used
0x2000 in t_infomask2 as HEAP_UPDATE_KEY_INTACT, to mean that this tuple
has been updated but the key columns have not been modified.  This lets
heapam.c know that this tuple can be further key-share locked.

6. When locking a tuple that is being concurrently updated, the locking
transaction must obviously walk up the update chain (t_self) 

Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?

2011-09-09 Thread Daniel Farina
On Wed, Aug 24, 2011 at 1:04 PM, Daniel Farina dan...@heroku.com wrote:
 On Wed, Aug 24, 2011 at 12:38 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Assuming the issue really is the physical unlinks (which I agree I'd
 like to see some evidence for), I wonder whether the problem could be
 addressed by moving smgrDoPendingDeletes() to after locks are released,
 instead of before, in CommitTransaction/AbortTransaction.  There does
 not seem to be any strong reason why we have to do that before lock
 release, since incoming potential users of a table should not be trying
 to access the old physical storage after that anyway.

 Alright, since this concern about confirming the expensive part of
 index dropping has come up a few times but otherwise the waters are
 warm, I'll go ahead and do some work to pin things down a bit before
 we continue working on those assumptions.


This suspicion seems to be proven correct; there came an opportunity
where we were removing some indexes on a live system and I took the
opportunity to carefully control and time the process.  There's not
much relationship between size of the index and the delay, but the
pauses are still very real. On the other hand, the first time this was
noticed there was significantly higher load.

I'd still like to do something to solve this problem, though: even if
the time-consuming part of the process is not file unlinking, it's
clearly something after the AccessExclusiveLock is acquired based on
our other measurements.

Back to the drawing board...

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against multiple instances per cluster

2011-09-09 Thread Greg Stark
On Thu, Sep 8, 2011 at 10:03 PM, Magnus Hagander mag...@hagander.net wrote:
 Would there be a way to prevent this abhorrent scenario from coming
 into existence?

 There are plenty of clustering products out there that are really
 designed for one thing pimarily, and that's dealing with this kind of
 fencing.

Wouldn't those products exist to *allow* you to set up an environment
like this safely?

I think what Thom is saying is it would be nice if we could notice the
situation looks bad and *stop* the user from doing this at all.

We could do that easily if we were willing to trade off some
convenience for users who don't have shared storage by just removing
the code for determining if there's a stale lock file.

Also if the shared filesystem happened to have a working locking
server and we use the right file locking api then we would be able to
notice an apparently stale lock file that is nonetheless locked by
another postgres instance. There was some talk about using one of the
locking apis a while back.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large C files

2011-09-09 Thread Greg Stark
On Fri, Sep 9, 2011 at 6:57 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 In particular, I'd like to know what
 boundaries it is envisaged that the code should be refactored along to
 increase its conceptual integrity, or to better separate concerns. I
 assume that that's the idea, since each new .c file would have to have
 a discrete purpose.

 I'd like to see it split into routines involved in writing WAL, and those
 involved in recovery. And maybe a third file for archiving-related stuff.

Having a clean API for working with WAL independently of recovery
would let us have a maintainable xlogdump tool that doesn't constantly
get out of sync with the wal archive format. It would also make it
easier to write things like prefretching logic that requires reading
the upcoming xlog before its time to actually replay it.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-09 Thread Bruce Momjian
Tom Lane wrote:
 George Barnett gbarn...@atlassian.com writes:
  [ patch to retry writes on NFS ]
 
 I'm having a hard time getting excited about this idea, because IMO
 NFS is insufficiently reliable to run a database on, and no patch like
 this can fix that.  However, some concrete points:
 
 1. If writes need to be retried, why not reads?  (No, that's not an
 invitation to expand the scope of the patch; it's a question about NFS
 implementation.)
 
 2. What is the rationale for supposing that a retry a nanosecond later
 will help?  If it will help, why didn't the kernel just do that?
 
 3. What about EINTR?  If you believe that this is necessary, then the
 kernel logically has to return EINTR when it would like you to retry but
 it hasn't been able to write any bytes yet.  If you get a zero return
 you have to assume that means out-of-disk-space.
 
 4. As coded, the patch behaves incorrectly if you get a zero return on a
 retry.  If we were going to do this, I think we'd need to absorb the
 errno-munging currently done by callers into the writeAll function.
 
 On the whole I think you'd be better off lobbying your NFS implementors
 to provide something closer to the behavior of every other filesystem on
 the planet.  Or checking to see if you need to adjust your NFS
 configuration, as the other responders mentioned.

Can our NFS documentation be improved (section 17.2.1)?

http://www.postgresql.org/docs/9.1/static/creating-cluster.html

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unite recovery.conf and postgresql.conf

2011-09-09 Thread Greg Stark
On Fri, Sep 9, 2011 at 6:40 PM, Josh Berkus j...@agliodbs.com wrote:
 I'm in favor of this.  People are sufficiently confused by the existing
 behavior that we're not going to confuse them further by changing it.


Fwiw as someone who *was* confused previously, it now makes perfect
sense to me. We have postgres.conf which always applies and then
recovery.conf which can have all the same options but they only apply
during recover. That's much clearer than we have two configuration
files with two disjoint sets of options and good luck remembering
which options belong in which file. And it still serves a useful
purpose if you have options like recovery_target that you only want to
apply during recovery and then plan to remove.



-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fsyncing data to disk

2011-09-09 Thread Greg Stark
On Fri, Sep 9, 2011 at 7:46 PM, Florian Pflug f...@phlo.org wrote:
 I am going to use the whole partition device for the DB (like /dev/sda1)
 , so no filesystem code will be used. Also I am using asynchronous IO
 (the aio_read and aio_write) and I don't know if they can be combined
 with the fdatasync() syscall?

 Someone else (maybe the POSIX spec?) must answer that as I know very little
 about asynchronous IO.


There's an aio_fsync as part of the aio api, But you could use fsync
or fdatasync -- I assume you would have to wait for the aio_write to
have finished before you issue the fsync. But if you're going to
fdatasync all your writes right away you may as well open with O_DSYNC
which is, I gather, exactly how aio is intended to be used.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] nonempty default log_line_prefix (was [COMMITTERS] pgsql: Simplify handling of the timezone GUC)

2011-09-09 Thread Alvaro Herrera

Excerpts from Tom Lane's message of vie sep 09 18:59:45 -0300 2011:
 Simplify handling of the timezone GUC by making initdb choose the default.
 
 We were doing some amazingly complicated things in order to avoid running
 the very expensive identify_system_timezone() procedure during GUC
 initialization.  But there is an obvious fix for that, which is to do it
 once during initdb and have initdb install the system-specific default into
 postgresql.conf, as it already does for most other GUC variables that need
 system-environment-dependent defaults.  This means that the timezone (and
 log_timezone) settings no longer have any magic behavior in the server.
 Per discussion.

Hmm, I was hoping that this might open the door for setting a nonempty
default log_line_prefix, but apparently not :-(  Seems
setup_formatted_start_time is still expecting that GUC initialization
would be called before we do anything with it.  Pity ...

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] nonempty default log_line_prefix (was [COMMITTERS] pgsql: Simplify handling of the timezone GUC)

2011-09-09 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Tom Lane's message of vie sep 09 18:59:45 -0300 2011:
 Simplify handling of the timezone GUC by making initdb choose the default.

 Hmm, I was hoping that this might open the door for setting a nonempty
 default log_line_prefix, but apparently not :-(  Seems
 setup_formatted_start_time is still expecting that GUC initialization
 would be called before we do anything with it.  Pity ...

I think it actually would work now, you'd just have to accept that any
log messages that come out during GUC initialization might appear in GMT.
I didn't really experiment with that angle of it --- feel free.

But having said that, I'm not sure that there's a consensus for changing
the default log_line_prefix.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large C files

2011-09-09 Thread Peter Geoghegan
I've produced the refinement of my little shell script anticipated by
my last e-mail (using sed to remove irrelevant variations in
__func__.12345 type symbol names). I decided to test it for:

1. Detecting behavioural changes when removing existing pgrminclude
ignore files (Basically headers that won't break the build if
removed, but will break Postgres). I stuck with .c files here for the
sake of convenience.

2. False positives by including a bunch of random headers.

Fist file tested was pl_comp.c. The script detected the change in
behaviour due to the omission of that header (incidentally, the way
the header is used there strikes me as a bit ugly). The script did not
produce false positives with the addition of these headers:

#include commands/portalcmds.h
#include commands/conversioncmds.h
#include commands/defrem.h
#include commands/proclang.h
#include commands/tablecmds.h
#include commands/tablespace.h
#include commands/typecmds.h
#include commands/collationcmds.h
#include commands/prepare.h
#include commands/explain.h
#include commands/comment.h
#include commands/cluster.h
#include commands/discard.h
#include commands/trigger.h
#include commands/user.h
#include commands/view.h
#include commands/sequence.h
#include commands/dbcommands.h
#include commands/async.h
#include commands/lockcmds.h
#include commands/vacuum.h

The second file tested was regerror.c . Similarly, the omission was
detected, and the addition of the same headers did not produce a false
positive.

It's very difficult or impossible to anticipate how effective the tool
will be in practice, but when you consider that it works and does not
produce false positives for the first 3 real-world cases tested, it
seems reasonable to assume that it's at least worth having around. Tom
recently said of a previous pgrminclude campaign in July 2006 that It
took us two weeks to mostly recover, but we were still dealing with
some fallout in December. I think that makes the case for adding this
tool or some refinement as a complement to pgrminclude in src/tools
fairly compelling.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


nm-diff.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] superusers are members of all roles?

2011-09-09 Thread Bruce Momjian
Robert Haas wrote:
 On Sat, May 7, 2011 at 11:42 PM, Bruce Momjian br...@momjian.us wrote:
  Is this a TODO?
 
 I think so.

Added to TODO:

Address problem where superusers are assumed to be members of all groups

http://archives.postgresql.org/pgsql-hackers/2011-04/msg00337.php 

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] new createuser option for replication role

2011-09-09 Thread Fujii Masao
Hi,

Currently createuser cannot create a role with REPLICATION privilege
because it doesn't have any option to do that. Which sometimes annoys
me when setting up replication. I'd like to propose to add new options
-x (--replication) and -X (--no-replication) into createuser. -x allows
the new user to do replication, and -X disallows. The default is -X.
Is it worth creating the patch?

Though I'd like to use -r and -R as the option name, they have already
been used for CREATEROLE privilege. So I'm thinking to use -x and
-X derived from XLOG. But does anyone have better option name?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers