Re: [HACKERS] Condition to become the standby mode.

2013-07-26 Thread Andres Freund
Hi,

On 2013-07-26 13:19:34 +0900, Sawada Masahiko wrote:
 When the slave server starts, the slave server perform the following
 steps in StartupXLOG():
 1. Read latest CheckPoint record LSN from pg_control file.
 2. Try to fetch CheckPoint record from pg_xlog directory at first.
   ( The server try to read up to prior CheckPoint record from pg_contrl file)
 3. If there is not in pg_xlog, the slave server requests CheckPoint
 record to the primary server.

 in #3, it works only when StandbyMode is true. For StandbyMode is to
 true, database cluster state should be DB_SHUTDOWNED (it is one of
 the conditions).
 that is, slave server can try to fetch checkpoint record from the
 master server after slave server was successful completion.

It also will fetch the xlog from remote when we've found a backup label
(the read_backup_label() branch in StartupXLOG() will set StandbyMode to
true). Which will be the case if we're recovering from something like a
base backup.

The reason we don't fetch from remote if there's no backup label *and*
the last checkpoint wasn't a shutdown checkpoint is that that's doing
*local* crash recovery. Usually it will happen because we needed to
restart after a crash (or immediate restart which basically is the
same).
There's one valid case where you can get into the situation otherwise as
well, which is that the *entire* database directory has been copied via
an atomic snapshot. But in that case you *need* to also snapshot
pg_xlog.

Maybe there's another valid scenario?

 If this problem is solved, there is possible of that we can failback
 by removing the all WAL record which is in pg_xlog before server
 starts as the slave server.
 ( And we also use synchronous_transfer which I'm proposing, I think
 we can fail-back without taking full backup surely)

I still have *massive* doubts about the concept. But anyway, if you want
to do so, you should generate a backup label that specifies the startup
location.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] inconsistent state after crash recovery

2013-07-26 Thread Andres Freund
Hi,

On 2013-07-26 13:33:13 +0900, Satoshi Nagayasu wrote:
 I received a question about inconsistent state after crash recovery.
 
 When a table file is broken (or just lost), PostgreSQL can not recover
 a whole table, and does not show any notice while recoverying.
 I think it means inconsistent state.
 
 (1) create a table, and fill records.
 (2) process a checkpoint.
 (3) fill more records.
 (4) force a crash, and delete the table file.
 (5) run recovery on restarting.
 (6) only records after the checkpoint can be recoverd.
 
 For example, the attached log shows that PostgreSQL can recover
 only 1058 records in the table which contains 2000 records
 before the crash, and does not tell anything in the server log.
 
 Is this expected or acceptable?

I'd say it's both. WAL replay doesn't have the knowledge to detect that
in all too many cases. Nearly always a page's contents will be restored
by a full page image the first time it gets changed so they will
individually look completely normal.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] comment for fast promote

2013-07-26 Thread Fujii Masao
On Fri, Jul 26, 2013 at 11:19 AM, Tomonari Katsumata
katsumata.tomon...@po.ntts.co.jp wrote:
 Hi Fujii-san,

 Thank you for response.


 (2013/07/25 21:15), Fujii Masao wrote:
 On Thu, Jul 25, 2013 at 5:33 PM, Tomonari Katsumata
 katsumata.tomon...@po.ntts.co.jp wrote:
 Hi,

 Now I'm seeing xlog.c in 93_stable for studying fast promote,
 and I have a question.

 I think it has an extra unlink command for promote file.
 (on 9937 line)
 ---
 9934 if (stat(FAST_PROMOTE_SIGNAL_FILE, stat_buf) == 0)
 9935 {
 9936 unlink(FAST_PROMOTE_SIGNAL_FILE);
 9937 unlink(PROMOTE_SIGNAL_FILE);
 9938 fast_promote = true;
 9939 }
 ---

 Is this command necesary ?

 Yes, it prevents PROMOTE_SIGNAL_FILE from remaining even if
 both promote files exist.

 The command(unlink(PROMOTE_SIGNAL_FILE)) here is for
 unusualy case.
 Because the case is when done both procedures below.
  - user create promote file on PGDATA
  - user issue pg_ctl promote

 I understand the reason.
 But I think it's better to unlink(PROMOTE_SIGNAL_FILE) before
 unlink(FAST_PROMOTE_SIGNAL_FILE).
 Because FAST_PROMOTE_SIGNAL_FILE is definetly there but
 PROMOTE_SIGNAL_FILE is sometimes there or not there.

I could not understand why that's better. Could you elaborate that?

 And I have another question linking this behavior.
 I think TriggerFile should be removed too.
 This is corner-case but it will happen.
 How do you think of it ?

I don't have strong opinion about that. I've never heard the complaint
about that current behavior so far.

 One question is that: we really still need to support normal promote?
 pg_ctl promote provides only way to do fast promotion. If we want to
 do normal promotion, we need to create PROMOTE_SIGNAL_FILE
 and send the SIGUSR1 signal to postmaster by hand. This seems messy.

 I think that we should remove normal promotion at all, or change
 pg_ctl promote so that provides also the way to do normal promotion.

 I think he merit of fast promote is
  - allowing quick connection by skipping checkpoint
 and its demerit is
  - taking little bit longer when crash-recovery

 If it is seldom to happen its crash soon after promoting
 and fast promte never breaks consistency of database cluster,
 I think we don't need normal promotion.

You can execute checkpoint after fast promotion for that.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to add support of IF NOT EXISTS to others CREATE statements

2013-07-26 Thread Karol Trzcionka
W dniu 26.07.2013 02:44, Fabrízio de Royes Mello pisze:
 Should be... I fix that in attached patch.
Hello, as I can see there are more inconsistent places.
First style:
OperatorCreate
---
Second style:
ProcedureCreate
TypeCreate
DefineTSParser
DefineType
DefineEnum
---
Third style:
CreateCast
DefineTSDictionary
DefineTSTemplate
DefineTSConfiguration
DefineRange
DefineCompositeType
Regards,
Karol


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dynamic background workers, round two

2013-07-26 Thread Andres Freund
On 2013-07-25 12:35:30 -0400, Robert Haas wrote:
 On Wed, Jul 24, 2013 at 5:34 PM, Andres Freund and...@2ndquadrant.com wrote:
  This seems like a sensible idea to me. But, in the context of dynamic
  query, don't we also need the reverse infrastructure of notifying a
  bgworker that the client, that requested it to be started, has died?
  Ending up with a dozen bgworkers running because the normal backend
  FATALed doesn't seem nice.
 
 I think that will be desirable for many, although not all, uses of
 background workers.  For example, you might want to be able to do
 something like SELECT pg_background('CREATE INDEX CONCURRENTLY ...')
 and then exit your session and have the background worker continue to
 run.  Or maybe somebody writes a trigger that starts a background
 worker (if there's not one already working) and queues up work for the
 trigger, and the background worker continues processing the queue
 until it is empty, and then exits.

 But for parallel query specifically, yeah, we'll probably want the
 untimely demise of either the original backend or any of its workers
 to result in the death of all the workers and an abort of the parent's
 transaction.  However, there's no need for the postmaster to be
 involved in any of that.  For example, once the worker process is up
 and running and the parent has its PID, it can use an on_shmem_exit
 hook or a PG_ENSURE_ERROR_CLEANUP() block to send SIGTERM to any
 still-living workers.

It doesn't need to be the postmaster, but I think we need to provide
central infrastructure for that. I don't want this to end up being
redone poorly in multiple places.
I just wanted to mention it, it obviously doesn't need to be implemented
at the same time.

  +#define InvalidPid   (-1)
  +
 
  That value strikes me as a rather bad idea. As a pid that represents
  init's processgroup. I.e. kill(InvalidPid) will try to kill init... I'd
  rather not spread that outside async.c.
 
 Well, if we ever try to kill(InvalidPid), that's a PostgreSQL bug.
 And if the postgres process has permission to send signals to init,
 that's an OS bug (or a PostgreSQL bug, since we refuse to run as
 root).  So I'm not very worried about it.

I don't think somebody mistakenly kill()ing somebody with -1 is all too
likely, but it's a valid value to pass. That's why I dislike using it as
constant for an invalid value.

 I've got two functions that
 need distinguishable return values for the not-yet-started case and
 the already-dead case, neither of which can be real PIDs.  If we don't
 use 0 and -1, the alternative is presumably to complicate the calling
 signature with another out parameter.  That does not seem like a net
 improvement.

I actually think those cases would be better served by an error code
mechanism which is extensible.

  +/*
  + * Report the PID of a newly-launched background worker in shared memory.
  + *
  + * This function should only be called from the postmaster.
  + */
  +void
  +ReportBackgroundWorkerPID(RegisteredBgWorker *rw)
  +{
  + BackgroundWorkerSlot *slot;
  +
  + Assert(rw-rw_shmem_slot  max_worker_processes);
  + slot = BackgroundWorkerData-slot[rw-rw_shmem_slot];
  + slot-pid = rw-rw_pid;
  +
  + if (rw-rw_worker.bgw_notify_pid != 0)
  + kill(rw-rw_worker.bgw_notify_pid, SIGUSR1);
  +}
 
  Any reason not to use SendProcSignal() properly here? Just that you
  don't know the BackendId? I seems unclean to reuse SIGUSR1 without using
  the infrastructure built for that.
 
 We're in the postmaster here.  The postmaster currently uses kill
 directly in all cases, rather than SendProcSignal(), and I'd rather
 not change that.  Any code that the postmaster calls has got to be
 safe against arbitrary shared memory corruption, and I'd rather keep
 that footprint as small as possible.  For example, if we added
 bgw_backend_id, the postmaster would have to bounds check it to make
 sure it didn't seg fault.  The more of that kind of stuff we add, the
 more chances there are to destabilize the postmaster.  I'd like to
 keep it as minimal as possible.

I can see the point in the argument, but it still seems to be grotty
architecturally.

  + for (;;)
  + {
  + pid = GetBackgroundWorkerPid(handle);
  + if (pid != InvalidPid)
  + break;
  + rc = WaitLatch(MyProc-procLatch,
  +WL_LATCH_SET | 
  WL_POSTMASTER_DEATH, 0);
  + if (rc  WL_POSTMASTER_DEATH)
  + return InvalidPid;
 
  This basically ignores postmaster death. With unix_latch.c that
  shouldn't be a problem because afaics it's implementation correctly
  should report an error in the next WaitLatch without waiting as well. I
  am not sure about the win32 implementation, is that ok there as well?
 
 I don't understand what you mean by this.  Why does it 

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Greg Smith

On 7/25/13 6:02 PM, didier wrote:

It was surely already discussed but why isn't postresql  writing
sequentially its cache in a temporary file?


If you do that, reads of the data will have to traverse that temporary 
file to assemble their data.  You'll make every later reader pay the 
random I/O penalty that's being avoided right now.  Checkpoints are 
already postponing these random writes as long as possible. You have to 
take care of them eventually though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Hannu Krosing
On 07/26/2013 11:42 AM, Greg Smith wrote:
 On 7/25/13 6:02 PM, didier wrote:
 It was surely already discussed but why isn't postresql  writing
 sequentially its cache in a temporary file?

 If you do that, reads of the data will have to traverse that temporary
 file to assemble their data.  You'll make every later reader pay the
 random I/O penalty that's being avoided right now.  Checkpoints are
 already postponing these random writes as long as possible. You have
 to take care of them eventually though.

Well, SSD disks do it in the way proposed by didier (AFAIK), by putting
random
fs pages on one large disk page and having an extra index layer for
resolving
random-to-sequential ordering.

I would not dismiss the idea without more tests and discussion.

We could have a system where checkpoint does sequential writes of dirty
wal buffers to alternating synced holding files (a checkpoint log :) )
and only background writer does random writes with no forced sync at all


-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Hannu Krosing
On 07/26/2013 11:42 AM, Greg Smith wrote:
 On 7/25/13 6:02 PM, didier wrote:
 It was surely already discussed but why isn't postresql  writing
 sequentially its cache in a temporary file?

 If you do that, reads of the data will have to traverse that temporary
 file to assemble their data.  
In case of crash recovery, a sequential reading of this file could be
performed as first step.

this should work fairly well in most cases, at least when the recovery
shared_buffers is not smaller
than the latest run of checkpoint-written dirty buffers.




-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] install libpq.dll in bin directory on Windows / Cygwin

2013-07-26 Thread marco atzeri

Il 7/25/2013 11:02 PM, Alvaro Herrera ha scritto:

Andrew Dunstan wrote:

Jeff Janes asked me about this, and Bruce just tripped up on it.
Usually on Windows it's necessary to have libpq.dll/cygpq.dll either
in the PATH or in the same directory as client .exe files. The
buildfarm client has for many years simply copied this dll from the
installation lib to the installation bin directory after running
make install. But I can't really see why we don't do that as part
of make install anyway. I haven't tested but I think something
like this patch would achieve this goal - it would fix something
that's tripped a lot of people up over the years.


Seems a reasonable workaround for a silly platform bug.  Do you need to
patch the MSVC stuff as well?


Comments? If we do this, should it be backported?


To 9.3, sure, but not further back, as there's probably little point.


on cygwin no need to go before 9.3 .
We already moved them during the package build as workaround.




+ifneq (,$findstring($(PORTNAME), win32 cygwin))
+   $(INSTALL_DATA) $(shlib) '$(DESTDIR)$(bindir)/$(shlib)'
+endif


I wonder if someday we will create a win64 $(PORTNAME) value, or
something like that, and then we'll have to update the port list on
which these hacks are applied all over the place.  Not complaining
about this patch in particular, just an idle thought.



Andrew,
are you planning to include also

http://www.postgresql.org/message-id/e1uqoyt-0007qc...@wrigleys.postgresql.org
http://www.postgresql.org/message-id/51b59794.3000...@gmail.com

to get rid of dllwrap ?

Regards
Marco



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Greg Smith

On 7/26/13 5:59 AM, Hannu Krosing wrote:

Well, SSD disks do it in the way proposed by didier (AFAIK), by putting
random
fs pages on one large disk page and having an extra index layer for
resolving
random-to-sequential ordering.


If your solution to avoiding random writes now is to do sequential ones 
into a buffer, you'll pay for it by having more expensive random reads 
later.  In the SSD write buffer case, that works only because those 
random reads are very cheap.  Do the same thing on a regular drive, and 
you'll be paying a painful penalty *every* time you read in return for 
saving work *once* when you write.  That only makes sense when your 
workload is near write-only.


It's possible to improve on this situation by having some sort of 
background process that goes back and cleans up the random data, 
converting it back into sequentially ordered writes again.  SSD 
controllers also have firmware that does this sort of work, and Postgres 
might do it as part of vacuum cleanup.  But note that such work faces 
exactly the same problems as writing the data out in the first place.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review])

2013-07-26 Thread Amit Kapila
On Friday, July 26, 2013 10:35 AM Alvaro Herrera wrote:
 Josh Berkus escribió:
  On 07/25/2013 02:02 PM, Tom Lane wrote:
   Robert Haas robertmh...@gmail.com writes:
  
   My thought is that people might put postgresql.conf in a directory
   that only contains configuration files and isn't writeable by the
   postgres user.  So I would expect to find postgresql.auto.conf in
 the
   data directory always.
  
   Yeah, exactly.  I think putting it anywhere but $PGDATA is a bad
 idea,
   and a sysadmin who thinks he knows better probably doesn't.
 
  Please see Greg Smith's fairly strong arguments for putting it in the
  config/ directory.
 
 As far as I see, there are two argument he makes:
 
 1. We ought to have conf.d/ (not config/) so that tools have a way to
deploy snippets (e.g. pgtune)
 
 2. we ought to have all the config files together so that versioning
tools (Puppet) can just say keep all files within directory X
versioned and not have to care about specific file names, etc.
 
 
 I can buy (1), because that's a pretty common design for daemons
 nowadays.  But I think that's its own patch, and there's no reason that
 this patch should be messing with this.  I don't care all that much
 about (2), but I have no problem with doing that.
 
 So we could have two patches, first one that introduces a conf.d subdir
 that's automatically parsed after postgresql.conf, and another one that
 implements ALTER SYSTEM by using a file within conf.d.  The reason I
 say
 we need a separate patch for conf.d is that I think it'd be easier to
 argue about it in isolation, than having it be entangled with ALTER
 SYSTEM stuff.  The main contention point I see is where conf.d lives;
 the two options are in $PGDATA or together with postgresql.conf.  Tom
 and Robert, above, say it should be in $PGDATA; but this goes against
 Debian packaging and the Linux FHS (or whatever that thing is called).

Sure, I think we can split into 2 patches, but I doubt it will make it any
easier
to get this feature completed. 
The contention point can delay the first patch (automatic parse of conf.d)
which can
lead to further delay of ALTER SYSTEM SET patch and that will eventually
lead to its death.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] inconsistent state after crash recovery

2013-07-26 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2013-07-26 13:33:13 +0900, Satoshi Nagayasu wrote:
 Is this expected or acceptable?

 I'd say it's both.

Postgres is built on the assumption that the underlying filesystem is
reliable, ie, once you've successfully fsync'd some data that data won't
disappear.  If the filesystem fails to honor that contract, it's a
filesystem bug not a Postgres bug.  Nor is it reasonable to expect
Postgres to be able to detect every such violation.  As an example,
would you expect crash recovery to notice the disappearance of a file
that was touched nowhere in the replayed actions?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Tom Lane
Greg Smith g...@2ndquadrant.com writes:
 On 7/26/13 5:59 AM, Hannu Krosing wrote:
 Well, SSD disks do it in the way proposed by didier (AFAIK), by putting
 random
 fs pages on one large disk page and having an extra index layer for
 resolving
 random-to-sequential ordering.

 If your solution to avoiding random writes now is to do sequential ones 
 into a buffer, you'll pay for it by having more expensive random reads 
 later.

What I'd point out is that that is exactly what WAL does for us, ie
convert a bunch of random writes into sequential writes.  But sooner or
later you have to put the data where it belongs.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review])

2013-07-26 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 The main contention point I see is where conf.d lives;
 the two options are in $PGDATA or together with postgresql.conf.  Tom
 and Robert, above, say it should be in $PGDATA; but this goes against
 Debian packaging and the Linux FHS (or whatever that thing is called).

Ordinarily, if postgresql.conf is not in $PGDATA, it will be somewhere
that the postmaster does not (and should not) have write permissions
for.  I have no objection to inventiny a conf.d subdirectory, I just say
that it must be under $PGDATA.  The argument that this is against FHS
is utter nonsense, because anything we write there is not static
configuration, it's just data.

Come to think of it, maybe part of the reason we're having such a hard
time getting to consensus is that people are conflating the snippet
part with the writable part?  I mean, if you are thinking you want
system-management tools to be able to drop in configuration fragments as
separate files, there's a case to be made for a conf.d subdirectory that
lives somewhere that the postmaster can't necessarily write.  We just
mustn't confuse that with support for ALTER SYSTEM SET.  I strongly
believe that ALTER SYSTEM SET must not be designed to write anywhere
outside $PGDATA.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of SnapshotNow

2013-07-26 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 OK.  I've taken care of all remaining uses of SnapshotNow in the code
 base.  I think we can go ahead and remove it, now.  Patch attached.

 (And there was, hopefully, much rejoicing.)

What about SnapshotSelf?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dynamic background workers, round two

2013-07-26 Thread Robert Haas
On Fri, Jul 26, 2013 at 5:34 AM, Andres Freund and...@2ndquadrant.com wrote:
 It doesn't need to be the postmaster, but I think we need to provide
 central infrastructure for that. I don't want this to end up being
 redone poorly in multiple places.
 I just wanted to mention it, it obviously doesn't need to be implemented
 at the same time.

OK, makes sense.  I'm still working out a design for this part of it;
I think I want to get the dynamic shared memory stuff working first.

  +#define InvalidPid   (-1)
  +
 
  That value strikes me as a rather bad idea. As a pid that represents
  init's processgroup. I.e. kill(InvalidPid) will try to kill init... I'd
  rather not spread that outside async.c.

 Well, if we ever try to kill(InvalidPid), that's a PostgreSQL bug.
 And if the postgres process has permission to send signals to init,
 that's an OS bug (or a PostgreSQL bug, since we refuse to run as
 root).  So I'm not very worried about it.

 I don't think somebody mistakenly kill()ing somebody with -1 is all too
 likely, but it's a valid value to pass. That's why I dislike using it as
 constant for an invalid value.

Well, so is any negative number, but we should never be trying to kill
process groups.

 I've got two functions that
 need distinguishable return values for the not-yet-started case and
 the already-dead case, neither of which can be real PIDs.  If we don't
 use 0 and -1, the alternative is presumably to complicate the calling
 signature with another out parameter.  That does not seem like a net
 improvement.

 I actually think those cases would be better served by an error code
 mechanism which is extensible.

Hmm.  What signature would you suggest?

 I can see the point in the argument, but it still seems to be grotty
 architecturally.

Suck it up.  :-)

I considered the idea of using SIGUSR2, but it didn't seem worth it.

 I don't understand what you mean by this.  Why does it ignore postmaster 
 death?

 The other callers that check for WL_POSTMASTER_DEATH actually perform
 drastic measures like exit()ing upon postmaster. Here you just
 return. With the unix latch implementation it seems to be guaranteed
 that the next WaitLatch() done somewhere else will also return
 WL_POSTMASTER_DEATH. I am not sure whether it might actually consume
 the death in the windows implementation.

Well, IMHO, every backend should exit as quick as possible upon
discovering that the postmaster has died.  However, for complex
political reasons (ah-Tom-Lane-choo!), our policy is to have the
auxiliary processes exit and regular backends soldier on.  This seems
completely stupid to me, but it's not this patch's job to reverse that
policy decision.

To put that another way, one could equally well ask why WaitLatch(),
or for that matter PostmasterIsAlive(), doesn't contain ereport(FATAL)
upon discovering that the postmaster has given up the ghost.  And the
answer is that it's the job of those functions to provide information,
not make policy decisions.

 I don't have a problem with the restriction, but I'd like to see a check
 against it. Maybe check for MyBackendId != InvalidBackendId in
 RegisterDynamicBackgroundWorker()? That would also prevent starting
 further bgworkers before BackgroundWorkerInitializeConnection() is done
 in a connected bgworker which seems to be a good thing.

Well, that's easy enough to fix.  Should we Assert() or elog() or what?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of SnapshotNow

2013-07-26 Thread Andres Freund
On 2013-07-26 08:49:38 -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  OK.  I've taken care of all remaining uses of SnapshotNow in the code
  base.  I think we can go ahead and remove it, now.  Patch attached.
 
  (And there was, hopefully, much rejoicing.)
 
 What about SnapshotSelf?

I thought about that yesterday and I think we should replace the usages
which aren't easily replaceable (constraint stuff) with an mvcc
snapshot, just one treats our transaction's current CommandId as
visible. That should be doable?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of SnapshotNow

2013-07-26 Thread Andres Freund
On 2013-07-25 19:24:53 -0400, Robert Haas wrote:
 (And there was, hopefully, much rejoicing.)

Definitely! Thanks.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of SnapshotNow

2013-07-26 Thread Robert Haas
On Fri, Jul 26, 2013 at 8:49 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 OK.  I've taken care of all remaining uses of SnapshotNow in the code
 base.  I think we can go ahead and remove it, now.  Patch attached.

 (And there was, hopefully, much rejoicing.)

 What about SnapshotSelf?

Well, that's still used in _bt_check_unique, unique_key_recheck
(trigger function to do a deferred uniqueness check), RI_FKey_check,
and rather extensively by sepgsql.  I don't really have much desire to
do the work to get rid of it, though.

Getting rid of SnapshotNow is arguably important on the grounds that
third-party code may be using it, and doing this will force them to do
it the new way instead, and that's got some value.  I'm not sure if
anything's already been committed that relies on MVCC catalog access,
but several things have certainly been proposed and it's a good bet
that 9.4 will rely on the catalog access using MVCC-semantics, so
forcing third-party code to stop using SnapshotNow will prevent subtle
bugs.

But there's no similar joy for SnapshotSelf.  You can argue that all
of the things that we're doing with it are crufty, but nobody's
complaining about any of them, and some of them are in places where
the cost of an additional MVCC snapshot on every iteration might be
much more serious than anything we ever saw for catalog scans.  So I'm
personally content to leave it well enough alone.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extensions makefiles - coverage

2013-07-26 Thread Robert Haas
On Thu, Jul 25, 2013 at 11:07 AM, Ronan Dunklau rdunk...@gmail.com wrote:
 Hello.

 I was having trouble figuring how to use the coverage targets when
 using an extension.
 I am using approximatively the layout that was proposed here:
 http://www.postgresql.org/message-id/51bb1b6e.2070...@dunslane.net
 It looks like everything is hard-coded to take the source and the
 gcda, gcno files in the base directory, but these files lay in a src
 directory with the proposed layout.

 It may be better to base the .gcda file discovery on the OBJS
 variables when using PGXS.

 Please find attached a small patch that implements this. There is
 probably a better way than the redundant rm $(gcda_files) / rm *.gcda
 to cleanup the generated files.

 With the attached patch, the following targets seem to have the same
 behaviour as on the current HEAD, both on the whole tree and on
 individual contrib modules:

 - coverage-html
 - clean
 - coverage-clean
 - clean-coverage

 I noticed that make clean leaves gcda and gcov files on the current
 HEAD, and this is no different with the given patch.

 I also tested it against several pgxn extensions, and it seems to work fine.

You can ensure that your patch doesn't get forgotten about by adding it here:

https://commitfest.postgresql.org/action/commitfest_view/open

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread didier
Hi,


On Fri, Jul 26, 2013 at 11:42 AM, Greg Smith g...@2ndquadrant.com wrote:

 On 7/25/13 6:02 PM, didier wrote:

 It was surely already discussed but why isn't postresql  writing
 sequentially its cache in a temporary file?


 If you do that, reads of the data will have to traverse that temporary
 file to assemble their data.  You'll make every later reader pay the random
 I/O penalty that's being avoided right now.  Checkpoints are already
 postponing these random writes as long as possible. You have to take care
 of them eventually though.


 No the log file is only used at recovery time.

in check point code:
- loop over cache, marks dirty buffers with BM_CHECKPOINT_NEEDED as in
current code
- other workers can't write and evicted these marked buffers to disk,
there's a race with fsync.
- check point fsync now or after the next step.
- check point loop again save to log these buffers, clear
BM_CHECKPOINT_NEEDED but *doesn't* clear BM_DIRTY, of course many buffers
will be written again, as they are when check point isn't running.
- check point done.

During recovery you have to load the log in cache first before applying WAL.

Didier


Re: [HACKERS] getting rid of SnapshotNow

2013-07-26 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Jul 26, 2013 at 8:49 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 What about SnapshotSelf?

 Well, that's still used in _bt_check_unique, unique_key_recheck
 (trigger function to do a deferred uniqueness check), RI_FKey_check,
 and rather extensively by sepgsql.  I don't really have much desire to
 do the work to get rid of it, though.

Hm.  I agree the first three may be all right, but I can't help
suspecting that sepgsql is doing the wrong thing here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extensions makefiles - coverage

2013-07-26 Thread Ronan Dunklau
Thank you for the tip, its done.

2013/7/26 Robert Haas robertmh...@gmail.com:
 On Thu, Jul 25, 2013 at 11:07 AM, Ronan Dunklau rdunk...@gmail.com wrote:
 Hello.

 I was having trouble figuring how to use the coverage targets when
 using an extension.
 I am using approximatively the layout that was proposed here:
 http://www.postgresql.org/message-id/51bb1b6e.2070...@dunslane.net
 It looks like everything is hard-coded to take the source and the
 gcda, gcno files in the base directory, but these files lay in a src
 directory with the proposed layout.

 It may be better to base the .gcda file discovery on the OBJS
 variables when using PGXS.

 Please find attached a small patch that implements this. There is
 probably a better way than the redundant rm $(gcda_files) / rm *.gcda
 to cleanup the generated files.

 With the attached patch, the following targets seem to have the same
 behaviour as on the current HEAD, both on the whole tree and on
 individual contrib modules:

 - coverage-html
 - clean
 - coverage-clean
 - clean-coverage

 I noticed that make clean leaves gcda and gcov files on the current
 HEAD, and this is no different with the given patch.

 I also tested it against several pgxn extensions, and it seems to work fine.

 You can ensure that your patch doesn't get forgotten about by adding it here:

 https://commitfest.postgresql.org/action/commitfest_view/open

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Greg Smith

On 7/26/13 9:14 AM, didier wrote:

During recovery you have to load the log in cache first before applying WAL.


Checkpoints exist to bound recovery time after a crash.  That is their 
only purpose.  What you're suggesting moves a lot of work into the 
recovery path, which will slow down how long it takes to process.


More work at recovery time means someone who uses the default of 
checkpoint_timeout='5 minutes', expecting that crash recovery won't take 
very long, will discover it does take a longer time now.  They'll be 
forced to shrink the value to get the same recovery time as they do 
currently.  You might need to make checkpoint_timeout 3 minutes instead, 
if crash recovery now has all this extra work to deal with.  And when 
the time between checkpoints drops, it will slow the fundamental 
efficiency of checkpoint processing down.  You will end up writing out 
more data in the end.


The interval between checkpoints and recovery time are all related.  If 
you let any one side of the current requirements slip, it makes the rest 
easier to deal with.  Those are all trade-offs though, not improvements. 
 And this particular one is already an option.


If you want less checkpoint I/O per capita and don't care about recovery 
time, you don't need a code change to get it.  Just make 
checkpoint_timeout huge.  A lot of checkpoint I/O issues go away if you 
only do a checkpoint per hour, because instead of random writes you're 
getting sequential ones to the WAL.  But when you crash, expect to be 
down for a significant chunk of an hour, as you go back to sort out all 
of the work postponed before.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of SnapshotNow

2013-07-26 Thread Robert Haas
On Fri, Jul 26, 2013 at 9:18 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Fri, Jul 26, 2013 at 8:49 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 What about SnapshotSelf?

 Well, that's still used in _bt_check_unique, unique_key_recheck
 (trigger function to do a deferred uniqueness check), RI_FKey_check,
 and rather extensively by sepgsql.  I don't really have much desire to
 do the work to get rid of it, though.

 Hm.  I agree the first three may be all right, but I can't help
 suspecting that sepgsql is doing the wrong thing here.

sepgsql is using SnapshotSelf to find the old version of a tuple that
was updated by the core code just before.  That should be safe in the
sense that there can't be a currently-committing transaction somewhere
else that's updated that tuple, if we know that our own uncommitted
transaction has done a transactional update.  There was a recent
thread discussing whether another API might be better, and I'd be
prepared to concede that it might be.  But I don't think it's
drop-dead broken.

Not that I really object if someone wants to have a go at getting rid
of SnapshotSelf, but I think it'd be worth articulating what we hope
to accomplish by so doing.  For example, the btree README says the
following about the deletion algorithm:

---
...  The reason we do it is to
provide an interlock between non-full VACUUM and indexscans.  Since VACUUM
deletes index entries before deleting tuples, the super-exclusive lock
guarantees that VACUUM can't delete any heap tuple that an indexscanning
process might be about to visit.  (This guarantee works only for simple
indexscans that visit the heap in sync with the index scan, not for bitmap
scans.  We only need the guarantee when using non-MVCC snapshot rules; in
an MVCC snapshot, it wouldn't matter if the heap tuple were replaced with
an unrelated tuple at the same TID, because the new tuple wouldn't be
visible to our scan anyway.)
---

Obviously, when we were using SnapshotNow for catalog access, changing
anything here was a non-starter.  But now that we're not, it might be
worth asking whether there are few enough users of non-MVCC rules that
we could apply some suitable treatment to those that remain and then
change the locking protocol here.  And if we did do that, would there
be enough performance benefit to justify the work?  I don't have
answers to those questions, and the answer may well be that we should
leave things as they are, but I think the questions are worth thinking
about.

Aside from any possible advantage in further trimming the list of
available snapshot types, I'd like to spend some time thinking about
what we can do that's safe and useful in terms of reducing lock
levels; or even adding completely new facilities that would have been
DOA in the old world.  I have high hopes in both areas, but I wouldn't
be surprised to find that there are problems we haven't thought about
yet.  I think our dependence on SnapshotNow has wormed itself into our
design choices in deep ways, and I suspect it's going to take a good
deal of thought to figure out exactly what we can improve and how.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Greg Smith

On 7/26/13 8:32 AM, Tom Lane wrote:

What I'd point out is that that is exactly what WAL does for us, ie
convert a bunch of random writes into sequential writes.  But sooner or
later you have to put the data where it belongs.


Hannu was observing that SSD often doesn't do that at all.  They can 
maintain logical - physical translation tables that decode where each 
block was written to forever.  When read seeks are really inexpensive, 
the only pressure to reorder block is wear leveling.


That doesn't really help with regular drives though, where the low seek 
time assumption doesn't play out so well.  The whole idea of writing 
things sequentially and then sorting them out later was the rage in 2001 
for ext3 on Linux, as part of the data=journal mount option.  You can 
go back and see that people are confused but excited about the 
performance at 
http://www.ibm.com/developerworks/linux/library/l-fs8/index.html


Spoiler:  if you use a workload that has checkpoint issues, it doesn't 
help PostgreSQL latency.  Just like using a large write cache, you gain 
some burst performance, but eventually you pay for it with extra latency 
somewhere.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of SnapshotNow

2013-07-26 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Jul 26, 2013 at 9:18 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Well, that's still used in _bt_check_unique, unique_key_recheck
 (trigger function to do a deferred uniqueness check), RI_FKey_check,
 and rather extensively by sepgsql.  I don't really have much desire to
 do the work to get rid of it, though.

 Hm.  I agree the first three may be all right, but I can't help
 suspecting that sepgsql is doing the wrong thing here.

 sepgsql is using SnapshotSelf to find the old version of a tuple that
 was updated by the core code just before.

Oh.  OK, then it reduces to the same case as the other three, ie we're
looking at tuples we know to be update-locked.

 [ interesting ruminations snipped ]

Yeah, removing SnapshotNow catalog access certainly opens the doors
for a lot of new thinking.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of SnapshotNow

2013-07-26 Thread Andres Freund
On 2013-07-26 09:50:32 -0400, Robert Haas wrote:
 sepgsql is using SnapshotSelf to find the old version of a tuple that
 was updated by the core code just before.  That should be safe in the
 sense that there can't be a currently-committing transaction somewhere
 else that's updated that tuple, if we know that our own uncommitted
 transaction has done a transactional update.  There was a recent
 thread discussing whether another API might be better, and I'd be
 prepared to concede that it might be.  But I don't think it's
 drop-dead broken.

It's safe for the tuples updated in that transaction, but it's not safe
to look at anything else if you expect results without the SnapshotNow
problems. E.g. looking at a newly created attribute is fine, but
iterating over all attributes not necessarily.

I am more concerned about the care needed when placing
CommandCounterIncrement()'s somewhere though. It seems more than likely
that this will get repeatedly broken, even if it's not atm (which I
doubt). E.g. inheritance handling seems to be rather wonky WRT this.

 Not that I really object if someone wants to have a go at getting rid
 of SnapshotSelf, but I think it'd be worth articulating what we hope
 to accomplish by so doing.

Agreed. From the internal usages there doesn't seem to be too much
pressure.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Condition to become the standby mode.

2013-07-26 Thread Fujii Masao
On Fri, Jul 26, 2013 at 3:08 PM, Andres Freund and...@2ndquadrant.com wrote:
 Hi,

 On 2013-07-26 13:19:34 +0900, Sawada Masahiko wrote:
 When the slave server starts, the slave server perform the following
 steps in StartupXLOG():
 1. Read latest CheckPoint record LSN from pg_control file.
 2. Try to fetch CheckPoint record from pg_xlog directory at first.
   ( The server try to read up to prior CheckPoint record from pg_contrl file)
 3. If there is not in pg_xlog, the slave server requests CheckPoint
 record to the primary server.

 in #3, it works only when StandbyMode is true. For StandbyMode is to
 true, database cluster state should be DB_SHUTDOWNED (it is one of
 the conditions).
 that is, slave server can try to fetch checkpoint record from the
 master server after slave server was successful completion.

 It also will fetch the xlog from remote when we've found a backup label
 (the read_backup_label() branch in StartupXLOG() will set StandbyMode to
 true). Which will be the case if we're recovering from something like a
 base backup.

 The reason we don't fetch from remote if there's no backup label *and*
 the last checkpoint wasn't a shutdown checkpoint is that that's doing
 *local* crash recovery.

Yes, in that case, local crash recovery occurs first. After it ends (i.e.,
reach the end of the WAL available in local), we can enable standby
mode and fetch the WAL from remote server.

 If this problem is solved, there is possible of that we can failback
 by removing the all WAL record which is in pg_xlog before server
 starts as the slave server.
 ( And we also use synchronous_transfer which I'm proposing, I think
 we can fail-back without taking full backup surely)

 I still have *massive* doubts about the concept. But anyway, if you want
 to do so, you should generate a backup label that specifies the startup
 location.

Generating a backup label doesn't seem to be enough because there is
no backup-end WAL record and we cannot know the consistency point.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Condition to become the standby mode.

2013-07-26 Thread Andres Freund
On 2013-07-26 23:47:59 +0900, Fujii Masao wrote:
  If this problem is solved, there is possible of that we can failback
  by removing the all WAL record which is in pg_xlog before server
  starts as the slave server.
  ( And we also use synchronous_transfer which I'm proposing, I think
  we can fail-back without taking full backup surely)
 
  I still have *massive* doubts about the concept. But anyway, if you want
  to do so, you should generate a backup label that specifies the startup
  location.
 
 Generating a backup label doesn't seem to be enough because there is
 no backup-end WAL record and we cannot know the consistency point.

Since 9.2 we allow generation of base backups from standbys, the
infrastructure built for should be sufficient to pass the lsn at which
consistency is achieved.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread didier
Hi,


On Fri, Jul 26, 2013 at 3:41 PM, Greg Smith
greg@2ndquadg...@2ndquadrant.com(needrant.comg...@2ndquadrant.com
 wrote:

 On 7/26/13 9:14 AM, didier wrote:

 During recovery you have to load the log in cache first before applying
 WAL.


 Checkpoints exist to bound recovery time after a crash.  That is their
 only purpose.  What you're suggesting moves a lot of work into the recovery
 path, which will slow down how long it takes to process.

 Yes it's slower but you're sequentially reading only one file at most the
size of your buffer cache, moreover it's a constant time.

Let say you make a checkpoint and crash just after with a next to empty
WAL.

Now recovery  is very fast but you have to repopulate your cache with
random reads from requests.

With the snapshot it's slower but you read, sequentially again, a lot of
hot cache you will need later when the db starts serving requests.

Of course the worst case is if it crashes just before a checkpoint, most of
the snapshot data are stalled and will be overwritten by WAL ops.

But  If the WAL recovery is CPU bound, loading from the snapshot may be
done concurrently while replaying the WAL.

More work at recovery time means someone who uses the default of
 checkpoint_timeout='5 minutes', expecting that crash recovery won't take
 very long, will discover it does take a longer time now.  They'll be forced
 to shrink the value to get the same recovery time as they do currently.
  You might need to make checkpoint_timeout 3 minutes instead, if crash
 recovery now has all this extra work to deal with.  And when the time
 between checkpoints drops, it will slow the fundamental efficiency of
 checkpoint processing down.  You will end up writing out more data in the
 end.

Yes it's a trade off, now you're paying the price at checkpoint time, every
time,  with the log you're paying only once, at recovery.


 The interval between checkpoints and recovery time are all related.  If
 you let any one side of the current requirements slip, it makes the rest
 easier to deal with.  Those are all trade-offs though, not improvements.
  And this particular one is already an option.

 If you want less checkpoint I/O per capita and don't care about recovery
 time, you don't need a code change to get it.  Just make checkpoint_timeout
 huge.  A lot of checkpoint I/O issues go away if you only do a checkpoint
 per hour, because instead of random writes you're getting sequential ones
 to the WAL.  But when you crash, expect to be down for a significant chunk
 of an hour, as you go back to sort out all of the work postponed before.

It's not the same  it's a snapshot saved and loaded in constant time unlike
the WAL log.

Didier


Re: [HACKERS] Condition to become the standby mode.

2013-07-26 Thread Fujii Masao
On Fri, Jul 26, 2013 at 11:55 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-07-26 23:47:59 +0900, Fujii Masao wrote:
  If this problem is solved, there is possible of that we can failback
  by removing the all WAL record which is in pg_xlog before server
  starts as the slave server.
  ( And we also use synchronous_transfer which I'm proposing, I think
  we can fail-back without taking full backup surely)
 
  I still have *massive* doubts about the concept. But anyway, if you want
  to do so, you should generate a backup label that specifies the startup
  location.

 Generating a backup label doesn't seem to be enough because there is
 no backup-end WAL record and we cannot know the consistency point.

 Since 9.2 we allow generation of base backups from standbys, the
 infrastructure built for should be sufficient to pass the lsn at which
 consistency is achieved.

Yeah, right. I'd forgotten about that.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade -j broken on Windows

2013-07-26 Thread Bruce Momjian
On Thu, Jul 25, 2013 at 10:57:28AM -0400, Bruce Momjian wrote:
 Everyone should be aware that the 9.3 pg_upgrade -j/--jobs option on
 Windows is currently broken, and hopefully will be fixed by the next
 beta.
 
 Someone at PGDay UK told me they were getting pg_upgrade -j crashes on
 Windows.  Andrew Dunstan was able to reproduce the crash, and that has
 been fixed, but there is still a race condition that I am trying to
 diagnose.

After three days of testing and creating a Windows MinGW build
environment (with Andrew's help), I have come up with the attached patch
which fixes the pg_upgrade -j race condition on Windows.  In summary,
creating a file with fopen() from a non-primary thread and then calling
system() soon after can result in a file-in-use error.  The solution was
to issue the fopen() after system() in such cases.

This would be applied to head and 9.3.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade -j broken on Windows

2013-07-26 Thread Bruce Momjian
On Fri, Jul 26, 2013 at 01:27:34PM -0400, Bruce Momjian wrote:
 On Thu, Jul 25, 2013 at 10:57:28AM -0400, Bruce Momjian wrote:
  Everyone should be aware that the 9.3 pg_upgrade -j/--jobs option on
  Windows is currently broken, and hopefully will be fixed by the next
  beta.
  
  Someone at PGDay UK told me they were getting pg_upgrade -j crashes on
  Windows.  Andrew Dunstan was able to reproduce the crash, and that has
  been fixed, but there is still a race condition that I am trying to
  diagnose.
 
 After three days of testing and creating a Windows MinGW build
 environment (with Andrew's help), I have come up with the attached patch
 which fixes the pg_upgrade -j race condition on Windows.  In summary,
 creating a file with fopen() from a non-primary thread and then calling
 system() soon after can result in a file-in-use error.  The solution was
 to issue the fopen() after system() in such cases.
 
 This would be applied to head and 9.3.

Sorry, patch attached.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/contrib/pg_upgrade/exec.c b/contrib/pg_upgrade/exec.c
new file mode 100644
index ef123a8..3ec6e1c
*** a/contrib/pg_upgrade/exec.c
--- b/contrib/pg_upgrade/exec.c
*** static void validate_exec(const char *di
*** 21,26 
--- 21,27 
  
  #ifdef WIN32
  static int	win32_check_directory_write_permissions(void);
+ DWORD   mainThreadId = 0;
  #endif
  
  
*** static int	win32_check_directory_write_p
*** 37,48 
   * If throw_error is true, this raises a PG_FATAL error and pg_upgrade
   * terminates; otherwise it is just reported as PG_REPORT and exec_prog()
   * returns false.
   */
  bool
  exec_prog(const char *log_file, const char *opt_log_file,
  		  bool throw_error, const char *fmt,...)
  {
! 	int			result;
  	int			written;
  
  #define MAXCMDLEN (2 * MAXPGPATH)
--- 38,51 
   * If throw_error is true, this raises a PG_FATAL error and pg_upgrade
   * terminates; otherwise it is just reported as PG_REPORT and exec_prog()
   * returns false.
+  *
+  * The code requires it be called first from the primary thread on Windows.
   */
  bool
  exec_prog(const char *log_file, const char *opt_log_file,
  		  bool throw_error, const char *fmt,...)
  {
! 	int			result = 0;
  	int			written;
  
  #define MAXCMDLEN (2 * MAXPGPATH)
*** exec_prog(const char *log_file, const ch
*** 50,55 
--- 53,64 
  	FILE	   *log;
  	va_list		ap;
  
+ #ifdef WIN32
+ 	/* We assume we are called from the primary thread first */
+ 	if (mainThreadId == 0)
+ 		mainThreadId = GetCurrentThreadId();
+ #endif
+ 
  	written = strlcpy(cmd, SYSTEMQUOTE, sizeof(cmd));
  	va_start(ap, fmt);
  	written += vsnprintf(cmd + written, MAXCMDLEN - written, fmt, ap);
*** exec_prog(const char *log_file, const ch
*** 61,66 
--- 70,91 
  	if (written = MAXCMDLEN)
  		pg_log(PG_FATAL, command too long\n);
  
+ 	pg_log(PG_VERBOSE, %s\n, cmd);
+ 
+ #ifdef WIN32
+ 	/*
+ 	 * For some reason, Windows issues a file-in-use error if we write data
+ 	 * to the log file from a non-primary thread just before we create a
+ 	 * subprocess that also writes to the same log file.  One fix is to
+ 	 * sleep for 100ms.  A cleaner fix is to write to the log file _after_
+ 	 * the subprocess has completed, so we do this only when writing from
+ 	 * a non-primary thread.  fflush(), running system() twice, and
+ 	 * pre-creating the file do not see to help.
+ 	 */
+ 	if (mainThreadId != GetCurrentThreadId())
+ 		result = system(cmd);
+ #endif
+ 
  	log = fopen(log_file, a);
  
  #ifdef WIN32
*** exec_prog(const char *log_file, const ch
*** 84,94 
  
  	if (log == NULL)
  		pg_log(PG_FATAL, cannot write to log file %s\n, log_file);
  #ifdef WIN32
! 	fprintf(log, \n\n);
  #endif
- 	pg_log(PG_VERBOSE, %s\n, cmd);
  	fprintf(log, command: %s\n, cmd);
  
  	/*
  	 * In Windows, we must close the log file at this point so the file is not
--- 109,126 
  
  	if (log == NULL)
  		pg_log(PG_FATAL, cannot write to log file %s\n, log_file);
+ 
  #ifdef WIN32
! 	/* Are we printing command: before its output? */
! 	if (mainThreadId == GetCurrentThreadId())
! 		fprintf(log, \n\n);
  #endif
  	fprintf(log, command: %s\n, cmd);
+ #ifdef WIN32
+ 	/* Are we printing command: after its output? */
+ 	if (mainThreadId != GetCurrentThreadId())
+ 		fprintf(log, \n\n);
+ #endif
  
  	/*
  	 * In Windows, we must close the log file at this point so the file is not
*** exec_prog(const char *log_file, const ch
*** 96,102 
  	 */
  	fclose(log);
  
! 	result = system(cmd);
  
  	if (result != 0)
  	{
--- 128,138 
  	 */
  	fclose(log);
  
! #ifdef WIN32
! 	/* see comment above */
! 	if (mainThreadId == GetCurrentThreadId())
! #endif
! 		result = system(cmd);
  
  	if (result != 0)
  	{
*** exec_prog(const char *log_file, const 

Re: [HACKERS] getting rid of SnapshotNow

2013-07-26 Thread Robert Haas
On Fri, Jul 26, 2013 at 10:16 AM, Andres Freund and...@2ndquadrant.com wrote:
 I am more concerned about the care needed when placing
 CommandCounterIncrement()'s somewhere though. It seems more than likely
 that this will get repeatedly broken, even if it's not atm (which I
 doubt). E.g. inheritance handling seems to be rather wonky WRT this.

There may well be bugs.  I am fine with reviewing patches to improve
the code in this area, but I don't plan to take it upon myself to
rewrite that code.  Either it's working as expected, or nobody's using
it, because we're not getting any bug reports.

 Not that I really object if someone wants to have a go at getting rid
 of SnapshotSelf, but I think it'd be worth articulating what we hope
 to accomplish by so doing.

 Agreed. From the internal usages there doesn't seem to be too much
 pressure.

So unless there are objections to the patch as posted, I'm going to
apply that next week.  This in no way precludes more work in this area
later, but since we're likely to break third-party code with this
change, we might as well get it out of the way as early in the release
cycle as possible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review])

2013-07-26 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Come to think of it, maybe part of the reason we're having such a hard
 time getting to consensus is that people are conflating the snippet
 part with the writable part?  I mean, if you are thinking you want
 system-management tools to be able to drop in configuration fragments as
 separate files, there's a case to be made for a conf.d subdirectory that
 lives somewhere that the postmaster can't necessarily write.  We just
 mustn't confuse that with support for ALTER SYSTEM SET.  I strongly
 believe that ALTER SYSTEM SET must not be designed to write anywhere
 outside $PGDATA.

Agreed.  To continue that thought, I find it *very* unlikely that a
given environment would use *both* a tool like puppet to manage the
files in their conf.d *and* have people using ALTER SYSTEM SET.  You're
going to do one or the other, almost certainly; not the least of which
is because those are very likely two different teams and only one of
them is going to be responsible for the PG system config.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] [GENERAL] postgres FDW cost estimation options unrecognized in 9.3-beta1

2013-07-26 Thread Tom Lane
Lonni J Friedman netll...@gmail.com writes:
 nightly=# ALTER SERVER cuda_db10 OPTIONS (SET use_remote_estimate 'true') ;
 ERROR:  option use_remote_estimate not found

 Am I doing something wrong, or is this a bug?

[ experiments... ]  You need to say ADD, not SET, to add a new option to
the list.  SET might more appropriately be spelled REPLACE, because it
requires that the object already have a defined value for the option,
which will be replaced.

Our documentation appears not to disclose this fine point, but a look
at the SQL-MED standard says it's operating per spec.  The standard also
says that ADD is an error if the option is already defined, which is a
bit more defensible, but still not exactly what I'd call user-friendly.
And the error we issue for that case is pretty misleading too:

regression=# ALTER SERVER cuda_db10 OPTIONS (use_remote_estimate 'true') ;
ALTER SERVER
regression=# ALTER SERVER cuda_db10 OPTIONS (use_remote_estimate 'false') ;
ERROR:  option use_remote_estimate provided more than once

I think we could do with both more documentation, and better error
messages for these cases.  In the SET-where-you-should-use-ADD case,
perhaps

ERROR:  option use_remote_estimate has not been set
HINT: Use ADD not SET to define an option that wasn't already set.

In the ADD-where-you-should-use-SET case, perhaps

ERROR:  option use_remote_estimate is already set
HINT: Use SET not ADD to change an option's value.

The provided more than once wording would be appropriate if the same
option is specified more than once in the command text, but I'm not sure
that it's worth the trouble to detect that case.

Thoughts, better wordings?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] postgres FDW cost estimation options unrecognized in 9.3-beta1

2013-07-26 Thread Lonni J Friedman
On Fri, Jul 26, 2013 at 3:28 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Lonni J Friedman netll...@gmail.com writes:
 nightly=# ALTER SERVER cuda_db10 OPTIONS (SET use_remote_estimate 'true') ;
 ERROR:  option use_remote_estimate not found

 Am I doing something wrong, or is this a bug?

 [ experiments... ]  You need to say ADD, not SET, to add a new option to
 the list.  SET might more appropriately be spelled REPLACE, because it
 requires that the object already have a defined value for the option,
 which will be replaced.

 Our documentation appears not to disclose this fine point, but a look
 at the SQL-MED standard says it's operating per spec.  The standard also
 says that ADD is an error if the option is already defined, which is a
 bit more defensible, but still not exactly what I'd call user-friendly.
 And the error we issue for that case is pretty misleading too:

 regression=# ALTER SERVER cuda_db10 OPTIONS (use_remote_estimate 'true') ;
 ALTER SERVER
 regression=# ALTER SERVER cuda_db10 OPTIONS (use_remote_estimate 'false') ;
 ERROR:  option use_remote_estimate provided more than once

 I think we could do with both more documentation, and better error
 messages for these cases.  In the SET-where-you-should-use-ADD case,
 perhaps

 ERROR:  option use_remote_estimate has not been set
 HINT: Use ADD not SET to define an option that wasn't already set.

 In the ADD-where-you-should-use-SET case, perhaps

 ERROR:  option use_remote_estimate is already set
 HINT: Use SET not ADD to change an option's value.

 The provided more than once wording would be appropriate if the same
 option is specified more than once in the command text, but I'm not sure
 that it's worth the trouble to detect that case.

 Thoughts, better wordings?

Thanks Tom, I've confirmed that using ADD was the solution.  I think
your suggested updated ERROR  HINT text is an excellent improvement.
It definitely would have given me the clue I was missing earlier.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Patch to add support of IF NOT EXISTS to others CREATE statements

2013-07-26 Thread Abhijit Menon-Sen
At 2013-07-26 10:39:00 +0200, karl...@gmail.com wrote:

 Hello, as I can see there are more inconsistent places.

Right. This is what I was referring to in my original review. All of the
relevant sites (pre-patch) that currently do:

if (already exists)
ereport(ERROR …)

should instead be made to do:

if (already exists)
{
if (ifNotExists)
{
ereport(NOTICE …)
return
}

ereport(ERROR …)
}

or even (very slightly easier to review):

if (already exists  ifNotExists)
{
ereport(ERROR …)
return
}

if (already exists)
ereport(ERROR …)

I don't care much which of the two is used, so long as it's (a) the same
everywhere, and (b) there's no if (!ifNot anywhere.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] currval and DISCARD ALL

2013-07-26 Thread Fabrízio de Royes Mello

On 25-07-2013 05:32, suresh.balasubra wrote:

Disclaimer: I am no hacker, just a PostGreSQL user, trying to provide a user
scenario where DISCARD SEQUENCES functionality is required.

We have designed a developed a small Application Development platform for
which the backend is PostGreSQL.

There is a DBLayer which is responsible in generating SQL statements for all
the INSERT, UPDATE, DELETE operations. Data can be pushed into multiple
tables using this layer. What we provide to this layer is just a DataSet
(.NET). DataTables in the DataSet will be named after their respective
tables. We also use DataColumn extended properties to push in additional
logic. All these happen with in a transaction and also in one shot, just one
hit to the DB by appending SQL statements in proper order.

There is an interesting feature that we have built into this DBlayer which
is auto linking. All tables in our system will have a serial field 'id'.
Suppose there is a master table (let us call it 'voucher') and a detail
table ('voucher_lines'), and we are using the layer to push one record to
the master table and 50 records to the detail table. 'voucher_lines' table
will have a integer column 'voucher_id'. '*_id' fields are automatically
populated with 'currval('*_id_seq'). All this works like a charm.

Now, imagine we want to push data into another table (say 'invoice') which
also has a field 'voucher_id'. This is a different activity not connected
with the above mentioned transaction. In this scenario this field will get
updated as currval('voucher_id_seq') returns a value. But we do not want
that to be updated. What we want is to resolve '*_id' fields into values
only within a transaction. After the transaction we want to get away with
the session. If we could have cleared the session someway (DISCARD ALL does
it, but not the currval sequence info) after the first transaction it would
have worked for us.



I already sent a patch to implement DISCARD SEQUENCES [1] that I expect 
to be part of 9.4 version.


Regards,

[1] https://commitfest.postgresql.org/action/patch_view?id=1171

--
   Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers