Re: [HACKERS] Could we replace SysV semaphores with latches?

2012-06-07 Thread Heikki Linnakangas

On 07.06.2012 07:09, Tom Lane wrote:

There has been regular griping in this list about our dependence on SysV
shared memory, but not so much about SysV semaphores, even though the
latter cause their fair share of issues; as seen for example in
buildfarm member spoonbill's recent string of failures:

creating template1 database in 
/home/pgbuild/pgbuildfarm/HEAD/pgsql.25563/src/test/regress/./tmp_check/data/base/1
 ... FATAL:  could not create semaphores: No space left on device
DETAIL:  Failed system call was semget(1, 17, 03600).
HINT:  This error does *not* mean that you have run out of disk space.  It 
occurs when either the system limit for the maximum number of semaphore sets 
(SEMMNI), or the system wide maximum number of semaphores (SEMMNS), would be 
exceeded.  You need to raise the respective kernel parameter.  Alternatively, 
reduce PostgreSQL's consumption of semaphores by reducing its max_connections 
parameter.
The PostgreSQL documentation contains more information about 
configuring your system for PostgreSQL.
child process exited with exit code 1

It strikes me that we have recently put together an independent but just
about equivalent waiting mechanism in the form of latches.  And not only
that, but there's already a latch for each process.  Could we replace
our usages of SysV semaphores with WaitLatch on the procLatch?  Unlike
the situation with shared memory where we need some secondary features
(mumble shm_nattch mumble), I think we aren't really using anything
interesting about SysV semaphores except for the raw ability to wait for
somebody to signal us.


Would have to performance test that carefully. We use semaphores in 
lwlocks, so it's performance critical. A latch might well be slower, 
especially on platforms where a signal does not interrupt sleep, and we 
rely on the signal handler and self-pipe to wake up.


Although perhaps we could improve the latch implementation. pselect() 
might perform better than the self-pipe trick, on platforms where it works.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Interrupting long external library calls

2012-06-07 Thread Sandro Santilli
On Sat, May 26, 2012 at 01:24:26PM +0200, Florian Pflug wrote:
 On May26, 2012, at 12:40 , Simon Riggs wrote:
  On 25 May 2012 17:34, Tom Lane t...@sss.pgh.pa.us wrote:
  I assume that the geos::util::Interrupt::request() call sets a flag
  somewhere that's going to be periodically checked in long-running
  loops.  Would it be possible for the periodic checks to include a
  provision for a callback into Postgres-specific glue code, wherein
  you could test the same flags CHECK_FOR_INTERRUPTS does?  A similar
  approach might then be usable in other contexts, and it seems safer
  to me than messing with a host environment's signal handling.
  
  Can we do that as a macro, e.g.
  
  POSTGRES_LIBRARY_INTERRUPT_CHECK()
  
  Otherwise the fix to this problem may be worse - faulty interrupt
  handlers are worse than none at all.
 
 As it stands, ProcessInterrupts() sometimes returns instead of
 ereport()ing even if InterruptPending is set, interrupts aren't
 held off, and the code isn't in a critical section. That happens if
 QueryCancelPending (or however that's called) gets reset after a
 SIGINT arrived but before CHECK_FOR_INTERRUPTS() is called. Or at
 least that is how I interpret the comment at the bottom of that
 function...
 
 We could thus easily provide POSTGRES_LIBRARY_INTERRUPT_CHECK() if
 we're content with the (slim, probably) chance of false positives.
 
 Or we'd need to refactor things in a way that allows the hypothetical
 POSTGRES_LIBRARY_INTERRUPT_CHECK() to re-use the tests in
 ProcessInterrupts(), but without modifying any of the flags.

So back to this, shortly after discovering (sorry for ignorance, but I
just don't care about programming in a black box environment) that windows
doesn't support posix signals.

If I understood correctly the CHECK_FOR_INTERRUPTS postgresql function
does also take care of events dispatching within windows, so that if
nothing calls that macro there's no way that a pqsignal handler would
be invoked. Is that right ?

In that case I can understand Tom's advice about providing a callback,
and then I would only need to perform the events flushing part of
from within the callback, and only for windows.

Does it sound right ?

--strk; 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ability to listen on two unix sockets

2012-06-07 Thread Gurjeet Singh
On Wed, Jun 6, 2012 at 1:55 PM, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Jun 6, 2012 at 10:38 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Well, that's what I wanted to discuss before Honza starts coding.
  It's not obvious that there are any use-cases for more than two.
  It's also not clear whether there is any value in supporting run-time
  rather than build-time configuration of the socket locations.  The
  Fedora use-case has no need of that, but if people can point to other
  cases where it would be sensible, we can write the patch that way.
 
  You might think we should design this exactly like the TCP-socket
  multiple-listen-addresses case, ie just have a config variable
  containing a list of directory names.  The sticking point there is
  that the directories aren't really interchangeable.  In particular,
  there is still going to be one directory that is the one hard-wired
  into libpq.  So whereas multiple TCP sockets really are pretty much
  interchangeable, I think in the Unix-socket case we are going to have
  to think of it as being a primary socket and one or more alternate
  sockets.  Is there a reason to have more than one alternate, and if
  so what is the use-case?
 
  (BTW, we would probably just adopt the Debian solution if we were
  sure there were no non-libpq clients out there; but we aren't.)

 I recently had an urge to make it possible for the postmaster to
 listen on multiple ports and even went so far as to code up a patch to
 allow that.  It still applies, with offsets, so I'll attach it here.
 So I guess I'm +1 on the idea of allowing N UNIX sockets rather than
 limiting it to N=2, and really I'd like to do one better and allow
 listening on multiple TCP ports as well.  Since the PID file contains
 the port number, multiple TCP sockets stop being interchangeable as
 soon as you allow multiple ports, but that's not very difficult to
 handle.  Now, you might ask whether this has any real-world value, and
 obviously I'm going to say yes or I wouldn't be proposing it.  The
 reason for wanting multiple UNIX sockets is because those sockets
 might be in different places that are not all equally accessible to
 everyone, because of things like chroot.  But of course the same thing
 is possible in the network space using iptables and similar tools.
 For example, you might want to have users connect to application A
 using port 5432, and to  application B using port 15432.  Now you can
 use network monitoring tools to see how much data each application is
 sending and receiving, without needing deep packet inspection.  You
 can firewall those ports differently to provide access to different
 groups of users.  And you can even decide, if the database gets
 overloaded, to cut off access to one of those ports, so that the
 application causing the problem becomes inaccessible but the rest of
 the database ceases being overloaded and you can still operate.  Of
 course, you could also do that by changing pg_hba.conf, but for some
 people it might be more convenient (or feel more bullet-proof) to do
 it using network management tools.  There are probably other use
 cases, as well.


+1 for multiple TCP port numbers.

A few days ago I started working on enabling Postgres to communicate using
WebSockets protocol (acting as a wrapper around FE/BE), and I found it
difficult (not impossible) to use the same port for communicating FE/BE
protocol and for https+WebSockets too. It would have been a lot simpler if
I could say that WebSockets is enabled on 5431 and FE/BE on 5432.

Regards,
-- 
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: [HACKERS] Avoiding adjacent checkpoint records

2012-06-07 Thread Simon Riggs
On 6 June 2012 20:08, Tom Lane t...@sss.pgh.pa.us wrote:

 In commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724, Simon modified the
 rule for when to skip checkpoints on the grounds that not enough
 activity has happened since the last one.  However, that commit left the
 comment block about it in a nonsensical state:

    * If this isn't a shutdown or forced checkpoint, and we have not switched
    * to the next WAL file since the start of the last checkpoint, skip the
    * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
    * when the system is idle. That wastes log space, and more importantly it
    * exposes us to possible loss of both current and previous checkpoint
    * records if the machine crashes just as we're writing the update.
    * (Perhaps it'd make even more sense to checkpoint only when the previous
    * checkpoint record is in a different xlog page?)

 The new code entirely fails to prevent writing adjacent checkpoint
 records, because what it checks is the distance from the previous
 checkpoint's REDO pointer, not the previous checkpoint record itself.

You were completely involved in the original discussion of this, as
were others.
I made the change to use the redo pointer at your request. So when you
say Simon did this, what you mean is Simon acted according to group
consensus on an agreed feature.

How come this is a valid discussion? Why does making changes here make
sense when other changes are said to destabilise the code and delay
release?

Should I revisit all the things others have done that I disagree with as well?

No, I should not. Nor should we be trawling through changes made by me either.

I'll look some more at this, but you should consider why this thread
even exists.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time for pgindent run?

2012-06-07 Thread Magnus Hagander
On Thursday, June 7, 2012, Robert Haas wrote:

 On Tue, Jun 5, 2012 at 10:25 AM, Bruce Momjian 
 br...@momjian.usjavascript:;
 wrote:
  On Tue, Jun 05, 2012 at 10:21:14AM -0400, Tom Lane wrote:
  Bruce Momjian br...@momjian.us javascript:; writes:
   Is everyone ready for me to run pgindent?  We are nearing the first
   commit-fest (June 15) and will have to branch the git tree soon.
 
  Also, we should do the pgindent run well before the commitfest, so that
  authors of pending patches have time to rebase their patches in case
  pgindent changes the code they are patching ...
 
  Ah, good point.  That will affect commit-fest patches.  We could run it
  only on the 9.3 branch, but that makes double-patching very hard.
 
  Is everyone good for a pgindent run this week?

 The sooner the better.


+1.

Anything people have pending for 9.2 has to be pretty small by now (I
certainly hope), and it would give contributors a good chance to rebase
before the first CF for 9.3.

//Magnus



-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


[HACKERS] SP-GIST docs and examples

2012-06-07 Thread Simon Riggs
I know Oleg has been discussing quad trees for years now, so SP-GIST
sounds like a great feature.

The docs on SP-GIST simply suggest people read the code, which is way
below our normal standard of documentation.

CREATE INDEX contains no examples involving SP-GIST indexes.

It seems likely that the first thing a user will ask is When can I use
it? How do I create one? When should I use it in comparison with
existing methods?

ISTM that if we mention this as a major feature in 9.2, and I think we
should, that there should be something more about these in the shared
docs.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] \conninfo and SSL

2012-06-07 Thread Magnus Hagander
On Wednesday, June 6, 2012, Alvaro Herrera wrote:


 Excerpts from Robert Haas's message of mié jun 06 14:45:46 -0400 2012:
  On Sun, Jun 3, 2012 at 5:30 AM, Alastair Turner 
  b...@ctrlf5.co.zajavascript:;
 wrote:
   A one-line change adds the SSL info on its own line like
  
   --
   You are connected to database scratch as user scratch on host
   127.0.0.1 at port 5432.
   SSL connection (cipher: DHE-RSA-AES256-SHA, bits: 256)
   --
  
   Does this need a more integrated presentation, and therefore a broader
   change to make it translatable?
 
  +1 for doing it that way.

 Yeah, printSSLInfo already outputs translated stuff so this should be
 OK.  Merging both messages into a single translatable unit would be
 pretty cumbersome, for no practical gain.


Seems like a very low-impact change. Are people Ok with sneaking this into
9.2?

//Magnus



-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] pg_receivexlog and feedback message

2012-06-07 Thread Magnus Hagander
On Thursday, June 7, 2012, Fujii Masao wrote:

 On Thu, Jun 7, 2012 at 5:05 AM, Magnus Hagander 
 mag...@hagander.netjavascript:;
 wrote:
  On Wed, Jun 6, 2012 at 8:26 PM, Fujii Masao 
  masao.fu...@gmail.comjavascript:;
 wrote:
  On Tue, Jun 5, 2012 at 11:44 PM, Magnus Hagander 
  mag...@hagander.netjavascript:;
 wrote:
  On Tue, Jun 5, 2012 at 4:42 PM, Fujii Masao 
  masao.fu...@gmail.comjavascript:;
 wrote:
  On Tue, Jun 5, 2012 at 9:53 PM, Magnus Hagander 
  mag...@hagander.netjavascript:;
 wrote:
  Right now, pg_receivexlog sets:
 replymsg-write = InvalidXLogRecPtr;
 replymsg-flush = InvalidXLogRecPtr;
 replymsg-apply = InvalidXLogRecPtr;
 
  when it sends it's status updates.
 
  I'm thinking it sohuld set replymsg-write = blockpos instad.
 
  Why? That way you can see in pg_stat_replication what has actually
  been received by pg_receivexlog - not just what we last sent. This
 can
  be useful in combination with an archive_command that can block WAL
  recycling until it has been saved to the standby. And it would be
  useful as a general monitoring thing as well.
 
  I think the original reason was that it shouldn't interefer with
  synchronous replication - but it does take away a fairly useful
  usecase...
 
  I think that not only replaymsg-write but also -flush should be set
 to
  blockpos in pg_receivexlog. Which allows pg_receivexlog to behave
  as synchronous standby, so we can write WAL to both local and remote
  synchronously. I believe there are some use cases for synchronous
  pg_receivexlog.
 
  pg_receivexlog doesn't currently fsync() after every write. It only
  fsync():s complete files. So we'd need to set -flush only at the end
  of a segment, right?
 
  Yes.
 
  Currently the status update is sent for each status interval. In sync
  replication, transaction has to wait for a while even after
 pg_receivexlog
  has written or flushed the WAL data.
 
  So we should add new option which specifies whether pg_receivexlog
  sends the status packet back as soon as it writes or flushes the WAL
  data, like the walreceiver does?
 
  That might be useful, but I think that's 9.3 material at this point.

 Fair enough. That's new feature rather than a bugfix.

  But I think we can get the set the write location in as a bugfix.

 Also set the flush location? Sending the flush location back seems
 helpful when using pg_receivexlog for WAL archiving purpose. By
 seeing the flush location we can ensure that WAL file has been archived
 durably (IOW, WAL file has been flushed in remote archive area).


You  can do that with the write location as well, as long as you round it
off to complete segments, can't you?

In fact that's exactly the usecase that got me to realize we were missing
this :-)

//Magnus



-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] SP-GIST docs and examples

2012-06-07 Thread Magnus Hagander
On Thursday, June 7, 2012, Simon Riggs wrote:

 I know Oleg has been discussing quad trees for years now, so SP-GIST
 sounds like a great feature.

 The docs on SP-GIST simply suggest people read the code, which is way
 below our normal standard of documentation.


It's the same we have for normal GiST indexes, no? It's under internals,
I think it's ok in there.


CREATE INDEX contains no examples involving SP-GIST indexes.

 It seems likely that the first thing a user will ask is When can I use
 it? How do I create one? When should I use it in comparison with
 existing methods?


This is the problem though - it only has internals documentation, and no
end-user documentation.

Not sure GiST has that much end-user documentation either. And it is
included in things like the FTI docs.



 ISTM that if we mention this as a major feature in 9.2, and I think we
 should, that there should be something more about these in the shared
 docs.


+1, it should definitely get some more docs if possible.

//Magnus




-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] Interrupting long external library calls

2012-06-07 Thread Florian Pflug
On Jun7, 2012, at 10:20 , Sandro Santilli wrote:
 On Sat, May 26, 2012 at 01:24:26PM +0200, Florian Pflug wrote:
 On May26, 2012, at 12:40 , Simon Riggs wrote:
 On 25 May 2012 17:34, Tom Lane t...@sss.pgh.pa.us wrote:
 I assume that the geos::util::Interrupt::request() call sets a flag
 somewhere that's going to be periodically checked in long-running
 loops.  Would it be possible for the periodic checks to include a
 provision for a callback into Postgres-specific glue code, wherein
 you could test the same flags CHECK_FOR_INTERRUPTS does?  A similar
 approach might then be usable in other contexts, and it seems safer
 to me than messing with a host environment's signal handling.
 
 Can we do that as a macro, e.g.
 
 POSTGRES_LIBRARY_INTERRUPT_CHECK()
 
 Otherwise the fix to this problem may be worse - faulty interrupt
 handlers are worse than none at all.
 
 As it stands, ProcessInterrupts() sometimes returns instead of
 ereport()ing even if InterruptPending is set, interrupts aren't
 held off, and the code isn't in a critical section. That happens if
 QueryCancelPending (or however that's called) gets reset after a
 SIGINT arrived but before CHECK_FOR_INTERRUPTS() is called. Or at
 least that is how I interpret the comment at the bottom of that
 function...
 
 We could thus easily provide POSTGRES_LIBRARY_INTERRUPT_CHECK() if
 we're content with the (slim, probably) chance of false positives.
 
 Or we'd need to refactor things in a way that allows the hypothetical
 POSTGRES_LIBRARY_INTERRUPT_CHECK() to re-use the tests in
 ProcessInterrupts(), but without modifying any of the flags.
 
 So back to this, shortly after discovering (sorry for ignorance, but I
 just don't care about programming in a black box environment) that windows
 doesn't support posix signals.
 
 If I understood correctly the CHECK_FOR_INTERRUPTS postgresql function
 does also take care of events dispatching within windows, so that if
 nothing calls that macro there's no way that a pqsignal handler would
 be invoked. Is that right ?

Yes. 

 In that case I can understand Tom's advice about providing a callback,
 and then I would only need to perform the events flushing part of
 from within the callback, and only for windows.

Why would you need a signal handler in the library at all, then? Just
test the same flags that CHECK_FOR_INTERRUPTS does in the callback, and
call your interrupt request method if they indicate abort. (Or, slightly
cleaner maybe, allow the callback to abort processing by returning false)

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Interrupting long external library calls

2012-06-07 Thread Sandro Santilli
On Thu, Jun 07, 2012 at 12:00:09PM +0200, Florian Pflug wrote:
 On Jun7, 2012, at 10:20 , Sandro Santilli wrote:

  In that case I can understand Tom's advice about providing a callback,
  and then I would only need to perform the events flushing part of
  from within the callback, and only for windows.
 
 Why would you need a signal handler in the library at all, then? Just
 test the same flags that CHECK_FOR_INTERRUPTS does in the callback, and
 call your interrupt request method if they indicate abort. (Or, slightly
 cleaner maybe, allow the callback to abort processing by returning false)

I'm just afraid that invoking a callback (from a library to user code)
could be significantly slower than simply lookup a variable, especially
if the interruption checking is performed very frequently. But maybe I'm
being overparanoid.

--strk;

  ,--o-. 
  |   __/  |Delivering high quality PostGIS 2.0 !
  |  / 2.0 |http://strk.keybit.net - http://vizzuality.com
  `-o--'


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Interrupting long external library calls

2012-06-07 Thread Florian Pflug
On Jun7, 2012, at 12:08 , Sandro Santilli wrote:

 On Thu, Jun 07, 2012 at 12:00:09PM +0200, Florian Pflug wrote:
 On Jun7, 2012, at 10:20 , Sandro Santilli wrote:
 
 In that case I can understand Tom's advice about providing a callback,
 and then I would only need to perform the events flushing part of
 from within the callback, and only for windows.
 
 Why would you need a signal handler in the library at all, then? Just
 test the same flags that CHECK_FOR_INTERRUPTS does in the callback, and
 call your interrupt request method if they indicate abort. (Or, slightly
 cleaner maybe, allow the callback to abort processing by returning false)
 
 I'm just afraid that invoking a callback (from a library to user code)
 could be significantly slower than simply lookup a variable, especially
 if the interruption checking is performed very frequently. But maybe I'm
 being overparanoid.

It's definitely slower, probably at least an order of magnitude. But you
don't have to call that callback very often - once every few hundred
milliseconds is already more then plenty. Isn't there a place in the library
where you could put it where it'd be called with roughly that frequency?

IMHO the advantage of not messing with signal handling in the library
is more than worth the slight overhead of a callback, but YMMV.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_receivexlog and feedback message

2012-06-07 Thread Fujii Masao
On Thu, Jun 7, 2012 at 6:25 PM, Magnus Hagander mag...@hagander.net wrote:
 On Thursday, June 7, 2012, Fujii Masao wrote:

 On Thu, Jun 7, 2012 at 5:05 AM, Magnus Hagander mag...@hagander.net
 wrote:
  On Wed, Jun 6, 2012 at 8:26 PM, Fujii Masao masao.fu...@gmail.com
  wrote:
  On Tue, Jun 5, 2012 at 11:44 PM, Magnus Hagander mag...@hagander.net
  wrote:
  On Tue, Jun 5, 2012 at 4:42 PM, Fujii Masao masao.fu...@gmail.com
  wrote:
  On Tue, Jun 5, 2012 at 9:53 PM, Magnus Hagander mag...@hagander.net
  wrote:
  Right now, pg_receivexlog sets:
                         replymsg-write = InvalidXLogRecPtr;
                         replymsg-flush = InvalidXLogRecPtr;
                         replymsg-apply = InvalidXLogRecPtr;
 
  when it sends it's status updates.
 
  I'm thinking it sohuld set replymsg-write = blockpos instad.
 
  Why? That way you can see in pg_stat_replication what has actually
  been received by pg_receivexlog - not just what we last sent. This
  can
  be useful in combination with an archive_command that can block WAL
  recycling until it has been saved to the standby. And it would be
  useful as a general monitoring thing as well.
 
  I think the original reason was that it shouldn't interefer with
  synchronous replication - but it does take away a fairly useful
  usecase...
 
  I think that not only replaymsg-write but also -flush should be set
  to
  blockpos in pg_receivexlog. Which allows pg_receivexlog to behave
  as synchronous standby, so we can write WAL to both local and remote
  synchronously. I believe there are some use cases for synchronous
  pg_receivexlog.
 
  pg_receivexlog doesn't currently fsync() after every write. It only
  fsync():s complete files. So we'd need to set -flush only at the end
  of a segment, right?
 
  Yes.
 
  Currently the status update is sent for each status interval. In sync
  replication, transaction has to wait for a while even after
  pg_receivexlog
  has written or flushed the WAL data.
 
  So we should add new option which specifies whether pg_receivexlog
  sends the status packet back as soon as it writes or flushes the WAL
  data, like the walreceiver does?
 
  That might be useful, but I think that's 9.3 material at this point.

 Fair enough. That's new feature rather than a bugfix.

  But I think we can get the set the write location in as a bugfix.

 Also set the flush location? Sending the flush location back seems
 helpful when using pg_receivexlog for WAL archiving purpose. By
 seeing the flush location we can ensure that WAL file has been archived
 durably (IOW, WAL file has been flushed in remote archive area).


 You  can do that with the write location as well, as long as you round it
 off to complete segments, can't you?

You mean to prevent pg_receivexlog from sending back the end of WAL file
as the write location *before* it completes the WAL file? If so, yes. But
why do you want to keep the flush location invalid?

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_receivexlog and feedback message

2012-06-07 Thread Magnus Hagander
On Thursday, June 7, 2012, Fujii Masao wrote:

 On Thu, Jun 7, 2012 at 6:25 PM, Magnus Hagander mag...@hagander.net
 wrote:
  On Thursday, June 7, 2012, Fujii Masao wrote:
 
  On Thu, Jun 7, 2012 at 5:05 AM, Magnus Hagander mag...@hagander.net
  wrote:
   On Wed, Jun 6, 2012 at 8:26 PM, Fujii Masao masao.fu...@gmail.com
   wrote:
   On Tue, Jun 5, 2012 at 11:44 PM, Magnus Hagander 
 mag...@hagander.net
   wrote:
   On Tue, Jun 5, 2012 at 4:42 PM, Fujii Masao masao.fu...@gmail.com
   wrote:
   On Tue, Jun 5, 2012 at 9:53 PM, Magnus Hagander 
 mag...@hagander.net
   wrote:
   Right now, pg_receivexlog sets:
  replymsg-write = InvalidXLogRecPtr;
  replymsg-flush = InvalidXLogRecPtr;
  replymsg-apply = InvalidXLogRecPtr;
  
   when it sends it's status updates.
  
   I'm thinking it sohuld set replymsg-write = blockpos instad.
  
   Why? That way you can see in pg_stat_replication what has actually
   been received by pg_receivexlog - not just what we last sent. This
   can
   be useful in combination with an archive_command that can block
 WAL
   recycling until it has been saved to the standby. And it would be
   useful as a general monitoring thing as well.
  
   I think the original reason was that it shouldn't interefer with
   synchronous replication - but it does take away a fairly useful
   usecase...
  
   I think that not only replaymsg-write but also -flush should be
 set
   to
   blockpos in pg_receivexlog. Which allows pg_receivexlog to behave
   as synchronous standby, so we can write WAL to both local and
 remote
   synchronously. I believe there are some use cases for synchronous
   pg_receivexlog.
  
   pg_receivexlog doesn't currently fsync() after every write. It only
   fsync():s complete files. So we'd need to set -flush only at the
 end
   of a segment, right?
  
   Yes.
  
   Currently the status update is sent for each status interval. In sync
   replication, transaction has to wait for a while even after
   pg_receivexlog
   has written or flushed the WAL data.
  
   So we should add new option which specifies whether pg_receivexlog
   sends the status packet back as soon as it writes or flushes the WAL
   data, like the walreceiver does?
  
   That might be useful, but I think that's 9.3 material at this point.
 
  Fair enough. That's new feature rather than a bugfix.
 
   But I think we can get the set the write location in as a bugfix.
 
  Also set the flush location? Sending the flush location back seems
  helpful when using pg_receivexlog for WAL archiving purpose. By
  seeing the flush location we can ensure that WAL file has been archived
  durably (IOW, WAL file has been flushed in remote archive area).
 
 
  You  can do that with the write location as well, as long as you round it
 You mean to prevent pg_receivexlog from sending back the end of WAL file
 as the write location *before* it completes the WAL file? If so, yes. But
 why do you want to keep the flush location invalid?


No. pg_receivexlog sends back the correct write location. Whoever does the
check (through pg_stat_replication) rounds down, so it only counts it once
pg_receivexlog has acknowledged receiving the whole mail.

I'm not against doing the flush location as well, I'm just worried about
feature-creep :-) But let's see how big a change that would turn out to
be...

//Magnus



-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] SP-GIST docs and examples

2012-06-07 Thread Robert Haas
On Thu, Jun 7, 2012 at 5:19 AM, Simon Riggs si...@2ndquadrant.com wrote:
 I know Oleg has been discussing quad trees for years now, so SP-GIST
 sounds like a great feature.

 The docs on SP-GIST simply suggest people read the code, which is way
 below our normal standard of documentation.

I completely agree.  The GIN and GIST documentation are pretty
deficient as well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GIST docs and examples

2012-06-07 Thread Thom Brown
On 7 June 2012 12:52, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jun 7, 2012 at 5:19 AM, Simon Riggs si...@2ndquadrant.com wrote:
 I know Oleg has been discussing quad trees for years now, so SP-GIST
 sounds like a great feature.

 The docs on SP-GIST simply suggest people read the code, which is way
 below our normal standard of documentation.

 I completely agree.  The GIN and GIST documentation are pretty
 deficient as well.

Didn't Andrew Gierth volunteer to help rewrite those some time back?
Maybe nudge him about it again.

-- 
Thom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] creating objects in pg_catalog

2012-06-07 Thread Robert Haas
On Wed, Jun 6, 2012 at 5:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 rhaas=# create table pg_catalog.tom (a int);
 ERROR:  permission denied to create pg_catalog.tom

 The offending error check is in heap_create(), and based on what
 you're saying here it seems like we should just rip it out.

 Hmm.  Yeah, it seems like the regular permissions tests on the schemas
 in question should be enough to keep Joe User from making tables there,
 and I do not see a reason why the backend would care if there are
 non-catalog tables laying about in pg_catalog.

 Checking the commit history, it seems this was originally a test to
 prevent people from creating tables named pg_xxx:

 http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=f5a10e722c052006886b678995695001958a#patch3

 which may or may not have been critical once upon a time, but surely is
 not any more.

 So no objection to removing that particular test.

Patch attached.  Barring objections, I'll apply this to 9.3 once we branch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


remove-heap-create-check.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GIST docs and examples

2012-06-07 Thread Oleg Bartunov

On Thu, 7 Jun 2012, Thom Brown wrote:


On 7 June 2012 12:52, Robert Haas robertmh...@gmail.com wrote:

On Thu, Jun 7, 2012 at 5:19 AM, Simon Riggs si...@2ndquadrant.com wrote:

I know Oleg has been discussing quad trees for years now, so SP-GIST
sounds like a great feature.

The docs on SP-GIST simply suggest people read the code, which is way
below our normal standard of documentation.


I completely agree.  The GIN and GIST documentation are pretty
deficient as well.


Didn't Andrew Gierth volunteer to help rewrite those some time back?
Maybe nudge him about it again.


Oh, please !


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas

2012-06-07 Thread Sergey Koposov

On Tue, 5 Jun 2012, Simon Riggs wrote:


Sounds less good and we'd need reasonable proof it actually did
anything useful without being dangerous.


Doing an initial unlocked test speeds things up another 2.69 fold (on
top of 3.55 for your patch) for me, with 1GB of shared buffers.  That
seems like it should be worthwhile.

How do we go about getting reasonable proof that it is safe?


That's enough for me.


So is it planned to apply that patch for 9.2 ?

Thanks,
S

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Early hint bit setting

2012-06-07 Thread Merlin Moncure
On Wed, Jun 6, 2012 at 5:41 PM, Jim Nasby j...@nasby.net wrote:
 On 5/30/12 4:42 PM, Ants Aasma wrote:

 I was thinking about what is the earliest time where we could set hint
 bits. This would be just after the commit has been made visible.


 Except that's only true when there are no other transactions running. That's
 been one of the big sticking points about trying to proactively set hint
 bits; in a real system you're not going to gain very much unless you wait a
 while before setting them.

are you sure?   the relevant code to set hint bit during tuple scan
looks like this:

else if 
(TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple)))
{
if (HeapTupleHeaderGetCmin(tuple) = snapshot-curcid)
return false;   /* inserted after scan started 
*/

if (tuple-t_infomask  HEAP_XMAX_INVALID)  /* xid 
invalid */
return true;

if (tuple-t_infomask  HEAP_IS_LOCKED) /* not 
deleter */
return true;

Assert(!(tuple-t_infomask  HEAP_XMAX_IS_MULTI));

if 
(!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmax(tuple)))
{
/* deleting subtransaction must have aborted */
SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
InvalidTransactionId);
return true;
}

if (HeapTupleHeaderGetCmax(tuple) = snapshot-curcid)
return true;/* deleted after scan started */
else
return false;   /* deleted before scan started 
*/
}
else if 
(TransactionIdIsInProgress(HeapTupleHeaderGetXmin(tuple)))
return false;
else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple)))
SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
HeapTupleHeaderGetXmin(tuple));
else
{
/* it must have aborted or crashed */
SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
InvalidTransactionId);
return false;
}

The backend that commits the transaction knows that the transaction is
committed and that it's not in progress (at least from itself).   Why
do you have to wait for other transactions in progress to finish?
Setting the xmin committed bit doesn't keep you from checking the xmax
based rules.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] page is not marked all-visible warning in regression tests

2012-06-07 Thread Robert Haas
On Wed, Jun 6, 2012 at 3:07 PM, Andres Freund and...@2ndquadrant.com wrote:
 Hm. For a short while I thought there would be an issue with heap_delete and
 IOS because the deleting transaction can commit without any barriers happening
 on the IOS side. But that only seems to be possible with non MVCC snapshots
 which are currently not allowed with index only scans.

Well, one, commits are irrelevant; the page ceases to be all-visible
as soon as the delete happens.  And two, you can't do a delete or a
commit without a memory barrier - every LWLockAcquire() or
LWLockRelease() is a full barrier, so any operation that requires a
buffer content lock is both preceded and followed by a full barrier.

Proposed patch attached.  This adds some more comments in various
places, and implements your suggestion of retesting the visibility-map
bit when we detect a possible mismatch with the page-level bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


vm-test-cleanup.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] page is not marked all-visible warning in regression tests

2012-06-07 Thread Andres Freund
On Thursday, June 07, 2012 03:20:50 PM Robert Haas wrote:
 On Wed, Jun 6, 2012 at 3:07 PM, Andres Freund and...@2ndquadrant.com 
wrote:
  Hm. For a short while I thought there would be an issue with heap_delete
  and IOS because the deleting transaction can commit without any barriers
  happening on the IOS side. But that only seems to be possible with non
  MVCC snapshots which are currently not allowed with index only scans.
 Well, one, commits are irrelevant; the page ceases to be all-visible
 as soon as the delete happens.
Its not irrelevant for the code as it stands if non-mvcc snapshots were 
allowed. Unless I miss something, even disregarding memory ordering issues, 
there is no interlocking providing protection against the index doing the 
visibilitymap_test() and some concurrent backend doing a heap_delete + commit 
directly after that. If a SnapshotNow were used this could result in different 
results between a IOS and a normal indexscan because the normal indexscan 
would lock the heap page before doing the visibility testing and thus would 
see the new visibility information. But then a SnapshotNow wouldn't be used if 
that were a problem...
For normal snapshots its not a problem because with regards to the snapshot 
everything on the page is still visible as I think that can only happen for 
deletions and HOT updates because the index page would need to get locked 
otherwise. Deletions aren't visible yet and HOT is irrelevant for the IOS by 
definition.

 And two, you can't do a delete or a commit without a memory barrier - every
 LWLockAcquire() or LWLockRelease() is a full barrier, so any operation that
 requires a buffer content lock is both preceded and followed by a full
 barrier.
The memory barrier on the deleting side is irrelevant if there is no memory 
barrier on the reading side.

 Proposed patch attached.  This adds some more comments in various
 places, and implements your suggestion of retesting the visibility-map
 bit when we detect a possible mismatch with the page-level bit.
Thanks, will look at it in a bit.

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] XLog changes for 9.3

2012-06-07 Thread Heikki Linnakangas
When I worked on the XLogInsert scaling patch, it became apparent that 
some changes to the WAL format would make it a lot easier. So for 9.3, 
I'd like to do some refactoring:


1. Use a 64-bit integer instead of the two-variable log/seg 
representation, for identifying a WAL segment. This has no user-visible 
effect, but makes the code a bit simpler.


2. Don't waste the last WAL segment in each logical 4GB file. Currently, 
we skip the WAL segment ending with FF. The comments claim that 
wasting the last segment ensures that we don't have problems 
representing last-byte-position-plus-1, but in my experience, it just 
makes things more complicated. You have two ways to represent the 
segment boundary, and some functions are picky on which one is used. For 
example, XLogWrite() assumes that when you want to flush to the end of a 
logical log file, you use the 5/FF00 representation, not 
6/. Other functions, like XLogPageRead(), expect the latter.


This is a backwards-incompatible change for external utilities that know 
how the WAL segment numbering works. Hopefully there aren't too many of 
those around.


3. Move the only field, xl_rem_len, from the continuation record header 
straight to the xlog page header, eliminating XLogContRecord altogether. 
This makes it easier to calculate in advance how much space a WAL record 
requires, as it no longer depends on how many pages it has to be split 
across. This wastes 4-8 bytes on every xlog page, but that's not much.


4. Allow WAL record header to be split across page boundaries. 
Currently, if there are less than SizeOfXLogRecord bytes left on the 
current WAL page, it is wasted, and the next record is inserted at the 
beginning of the next page. The problem with that is again that it makes 
it impossible to know in advance exactly how much space a WAL record 
requires, because it depends on how many bytes need to be wasted at the 
end of current page.


These changes will help the XLogInsert scaling patch, by making the 
space calculations simpler. In essence, to reserve space for a WAL 
record of size X, you just need to do bytepos += X.  There's a lot 
more details with that, like mapping from the contiguous byte position 
to an XLogRecPtr that takes page headers into account, and noticing 
RedoRecPtr changes safely, but it's a start.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas

2012-06-07 Thread Tom Lane
Sergey Koposov kopo...@ast.cam.ac.uk writes:
 On Tue, 5 Jun 2012, Simon Riggs wrote:
 Sounds less good and we'd need reasonable proof it actually did
 anything useful without being dangerous.

 Doing an initial unlocked test speeds things up another 2.69 fold (on
 top of 3.55 for your patch) for me, with 1GB of shared buffers.  That
 seems like it should be worthwhile.

 How do we go about getting reasonable proof that it is safe?

 That's enough for me.

Say what?  That's a performance result and proves not a damn thing about
safety.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding adjacent checkpoint records

2012-06-07 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:

 there is no guarantee that we'll manage to reach a database state
 that is consistent with data already flushed out to disk during
 the last checkpoint.
 
Robert Haas robertmh...@gmail.com wrote: 
 
 I know of real customers who would have suffered real data loss
 had this code been present in the server version they were using. 
 Checkpoints are the *only* mechanism by which SLRU pages get
 flushed to disk on a mostly-idle system. That means if something
 happens to your pg_xlog directory, and you haven't had a
 checkpoint, you're screwed. 
 
Simon Riggs si...@2ndquadrant.com wrote: 
 
 How come this is a valid discussion? Why does making changes here
 make sense when other changes are said to destabilise the code and
 delay release?
 
I think the standard has been pretty clear -- at this point in a
release we fix bugs and regressions from previous releases.  The bar
for introducing anything which doesn't qualify under either of those
is very high in terms of being obviously useful and *very* low risk.
 
 Should I revisit all the things others have done that I disagree
 with as well?
 
If you can spot any serious bugs, I would sure appreciate it if you
point them out while we're still in beta.  I think we all would.
 
 I'll look some more at this, but you should consider why this
 thread even exists.
 
Because the beta release contains a new data loss bug which needs to
be fixed.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding adjacent checkpoint records

2012-06-07 Thread Robert Haas
On Wed, Jun 6, 2012 at 6:46 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wrote:
 Actually, it looks like there is an extremely simple way to handle this,
 which is to move the call of LogStandbySnapshot (which generates the WAL
 record in question) to before the checkpoint's REDO pointer is set, but
 after we have decided that we need a checkpoint.

 On further contemplation, there is a downside to that idea, which
 probably explains why the code was written as it was: if we place the
 XLOG_RUNNING_XACTS WAL record emitted during a checkpoint before rather
 than after the checkpoint's REDO point, then a hot standby slave
 starting up from that checkpoint won't process the XLOG_RUNNING_XACTS
 record.  That means its KnownAssignedXids machinery won't be fully
 operational until the master starts another checkpoint, which might be
 awhile.  So this could result in undesirable delay in hot standby mode
 becoming active.

 I am not sure how significant this really is though.  Comments?

I suspect that's pretty significant.

 If we don't like that, I can think of a couple of other ways to get there,
 but they have their own downsides:

 * Instead of trying to detect after-the-fact whether any concurrent
 WAL activity happened during the last checkpoint, we could detect it
 during the checkpoint and then keep the info in a static variable in
 the checkpointer process until next time.  However, I don't see any
 bulletproof way to do this without adding at least one or two lines
 of code within XLogInsert, which I'm sure Robert will complain about.

My main concern here is to avoid doing anything that will make things
harder for Heikki's WAL insert scaling patch, which I'm hoping will
get done for 9.3.

What do you have in mind, exactly?  I feel like ProcLastRecPtr might
be enough information.  After logging running xacts, we can check
whether ProcLastRecPtr is equal to the redo pointer.  If so, then
nothing got written to WAL between the time we began the checkpoint
and the time we wrote that record.  If, through further, similar
gyrations, we can then work out whether the checkpoint record
immediately follows the running-xacts record, we're there.  That's
pretty ugly, I guess, but it seems possible.

Alternatively, we could keep a flag in XLogCtl-Insert indicating
whether anything that requires a new checkpoint has happened since the
last checkpoint.  This could be set inside the existing block that
tests whether RedoRecPtr is out of date, so any given backend would
only do it once per checkpoint cycle.  We'd have a hard-coded special
case that would skip setting the flag for a running-xacts record.  I
kind of hate to shove even that much extra code in there, but as long
as it doesn't mess up what Heikki has in mind maybe it'd be OK...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Could we replace SysV semaphores with latches?

2012-06-07 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 07.06.2012 07:09, Tom Lane wrote:
 It strikes me that we have recently put together an independent but just
 about equivalent waiting mechanism in the form of latches.  And not only
 that, but there's already a latch for each process.  Could we replace
 our usages of SysV semaphores with WaitLatch on the procLatch?

 Would have to performance test that carefully. We use semaphores in 
 lwlocks, so it's performance critical. A latch might well be slower, 
 especially on platforms where a signal does not interrupt sleep, and we 
 rely on the signal handler and self-pipe to wake up.

By the time you've reached the point where you conclude you have to
block, all hope of high performance has gone out the window anyway,
so I can't get terribly excited about that.  But in general, yeah,
our current implementation of latches could very possibly use some
more work to improve performance.  I think we've latch-ified enough
code to make that worth doing already.

 Although perhaps we could improve the latch implementation. pselect() 
 might perform better than the self-pipe trick, on platforms where it works.

AFAIK pselect does not fix the basic race condition: what if somebody
else does SetLatch just before you reach the blocking kernel call?
You still end up needing a self-pipe.

I would be more inclined to look into OS-specific primitives such as
futexes on Linux.  (No idea whether those actually would be suitable,
just pointing out that they exist.)  Our semaphore-based API was always
both overcomplicated and underspecified, but I think we have got latch
semantics nailed down well enough that implementations built on
OS-specific primitives could be a reasonable thing.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Andres Freund
On Thursday, June 07, 2012 03:50:35 PM Heikki Linnakangas wrote:
 When I worked on the XLogInsert scaling patch, it became apparent that
 some changes to the WAL format would make it a lot easier. So for 9.3,
 I'd like to do some refactoring:

 1. Use a 64-bit integer instead of the two-variable log/seg
 representation, for identifying a WAL segment. This has no user-visible
 effect, but makes the code a bit simpler.
+1

We can define a sensible InvalidXLogRecPtr instead of doing that locally in 
loads of places! Yipee.

 2. Don't waste the last WAL segment in each logical 4GB file. Currently,
 we skip the WAL segment ending with FF. The comments claim that
 wasting the last segment ensures that we don't have problems
 representing last-byte-position-plus-1, but in my experience, it just
 makes things more complicated. You have two ways to represent the
 segment boundary, and some functions are picky on which one is used. For
 example, XLogWrite() assumes that when you want to flush to the end of a
 logical log file, you use the 5/FF00 representation, not
 6/. Other functions, like XLogPageRead(), expect the latter.
 
 This is a backwards-incompatible change for external utilities that know
 how the WAL segment numbering works. Hopefully there aren't too many of
 those around.
+1

 3. Move the only field, xl_rem_len, from the continuation record header
 straight to the xlog page header, eliminating XLogContRecord altogether.
 This makes it easier to calculate in advance how much space a WAL record
 requires, as it no longer depends on how many pages it has to be split
 across. This wastes 4-8 bytes on every xlog page, but that's not much.
+1. I don't think this will waste a measureable amount in real-world 
scenarios. A very big percentag of pages have continuation records.

 4. Allow WAL record header to be split across page boundaries.
 Currently, if there are less than SizeOfXLogRecord bytes left on the
 current WAL page, it is wasted, and the next record is inserted at the
 beginning of the next page. The problem with that is again that it makes
 it impossible to know in advance exactly how much space a WAL record
 requires, because it depends on how many bytes need to be wasted at the
 end of current page.
+0.5. Its somewhat convenient to be able to look at a record before you have 
reassembled it over multiple pages. But its probably not worth the 
implementation complexity.
If we do that we can remove all the aligment padding as well. Which would be a 
problem for you anyway, wouldn't it?

 These changes will help the XLogInsert scaling patch, by making the
 space calculations simpler. In essence, to reserve space for a WAL
 record of size X, you just need to do bytepos += X.  There's a lot
 more details with that, like mapping from the contiguous byte position
 to an XLogRecPtr that takes page headers into account, and noticing
 RedoRecPtr changes safely, but it's a start.
Hm. Wouldn't you need to remove short/long page headers for that as well? 


Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] page is not marked all-visible warning in regression tests

2012-06-07 Thread Robert Haas
On Thu, Jun 7, 2012 at 9:41 AM, Andres Freund and...@2ndquadrant.com wrote:
 Well, one, commits are irrelevant; the page ceases to be all-visible
 as soon as the delete happens.
 Its not irrelevant for the code as it stands if non-mvcc snapshots were
 allowed. Unless I miss something, even disregarding memory ordering issues,
 there is no interlocking providing protection against the index doing the
 visibilitymap_test() and some concurrent backend doing a heap_delete + commit
 directly after that.

Hmm.  Well, the backend that did the heap_delete would clear the
visibility map bit when it did the delete, before releasing the lock
on the heap buffer.  So normally we'll see that buffer as
not-all-visible as soon as the delete is performed, even if no commit
has happened yet.  But I guess there could be a memory-ordering issue
there for a SnapshotNow scan, because as you say there's no barrier on
the read side in that case.  The delete + commit would have to happen
after we finish fetching the index TID but before we test the
visibility map bit.  So I suppose that if we wanted to support
index-only scans  under SnapshotNow maybe we'd want to lock and unlock
the visibility map page when testing each bit.  But that almost seems
like overkill anyway: for it to matter, someone would have to be
relying on starting a scan, deleting a tuple in another transaction
while the scan was in progress, and having the scan see the delete
when it reached the deleted tuple.  But of course if the scan had take
a microsecond longer to reach the deleted tuple it wouldn't have seen
it as deleted anyway, so it's a bit hard to see how any such code
could be race-free anyhow.

 Proposed patch attached.  This adds some more comments in various
 places, and implements your suggestion of retesting the visibility-map
 bit when we detect a possible mismatch with the page-level bit.
 Thanks, will look at it in a bit.

Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 When I worked on the XLogInsert scaling patch, it became apparent that 
 some changes to the WAL format would make it a lot easier. So for 9.3, 
 I'd like to do some refactoring:

 1. Use a 64-bit integer instead of the two-variable log/seg 
 representation, for identifying a WAL segment. This has no user-visible 
 effect, but makes the code a bit simpler.

 2. Don't waste the last WAL segment in each logical 4GB file. Currently, 
 we skip the WAL segment ending with FF. The comments claim that 
 wasting the last segment ensures that we don't have problems 
 representing last-byte-position-plus-1, but in my experience, it just 
 makes things more complicated.

I think that's actually an indivisible part of point #1.  The issue in
the 32+32 representation is that you'd overflow the low-order half when
trying to represent last-byte-of-file-plus-1, and have to do something
with propagating that to the high half.  In a 64-bit continuous
addressing scheme the problem goes away, and it would just get more
complicated not less to preserve the hole.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] page is not marked all-visible warning in regression tests

2012-06-07 Thread Andres Freund
On Thursday, June 07, 2012 04:27:32 PM Robert Haas wrote:
 On Thu, Jun 7, 2012 at 9:41 AM, Andres Freund and...@2ndquadrant.com 
wrote:
  Proposed patch attached.  This adds some more comments in various
  places, and implements your suggestion of retesting the visibility-map
  bit when we detect a possible mismatch with the page-level bit.
  
  Thanks, will look at it in a bit.
I wonder if
/* mark page all-visible, if appropriate */
if (all_visible  !PageIsAllVisible(page))
{
PageSetAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_set(onerel, blkno, InvalidXLogRecPtr, 
vmbuffer,
  
visibility_cutoff_xid);
}
shouldn't test
if (all_visible 
(!PageIsAllVisible(page) || !all_visible_according_to_vm)
instead.

Commentwise I am not totally content with the emphasis on memory ordering 
because some of the stuff is more locking than memory ordering. Except that I 
think its a pretty clear improvement. I can reformulate the places where I 
find that relevant but I have the feeling that wouldn't help the legibility.
Its the big comment in vacuumlazy.c, the comment in nodeIndexonly.c and the 
one in the header of visibilitymap_test. Should be s/memory-
ordering/concurrency/ except in nodeIndexonlyscan.c

The visibilitymap_clear/PageClearAllVisible in heap_multi_insert should be 
moved into the critical section, shouldn't it?

Regards,

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Could we replace SysV semaphores with latches?

2012-06-07 Thread Robert Haas
On Thu, Jun 7, 2012 at 10:13 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I would be more inclined to look into OS-specific primitives such as
 futexes on Linux.  (No idea whether those actually would be suitable,
 just pointing out that they exist.)  Our semaphore-based API was always
 both overcomplicated and underspecified, but I think we have got latch
 semantics nailed down well enough that implementations built on
 OS-specific primitives could be a reasonable thing.

I've been thinking about trying to futex-ify our spinlock
implementation, so that when we detect that the spinlock is contended
(or contended sufficiently badly?) we go into a long kernel sleep
(e.g. 10 s) and wait to be woken up.  This might perform better than
our current implementation in cases where the spinlock is badly
contended, since it would avoid yanking the cache line around between
all the CPUs on the box.  But I haven't yet, because (1) it'll only
work on Linux and (2) it's better to fix the problem that is causing
the contention rather than make the contention less expensive.  Still,
it might be worth looking into.

I'm not sure whether there's a sensible way to use this for LWLocks
directly.  It would be nice not to be doing the lock-within-a-lock
thing, but I don't know that I really want to maintain a completely
separate LWLock implementation just for Linux, and it's not obvious
how you're supposed to use a futex to implement a lock with more than
one lock mode.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Heikki Linnakangas

On 07.06.2012 17:18, Andres Freund wrote:

On Thursday, June 07, 2012 03:50:35 PM Heikki Linnakangas wrote:

3. Move the only field, xl_rem_len, from the continuation record header
straight to the xlog page header, eliminating XLogContRecord altogether.
This makes it easier to calculate in advance how much space a WAL record
requires, as it no longer depends on how many pages it has to be split
across. This wastes 4-8 bytes on every xlog page, but that's not much.

+1. I don't think this will waste a measureable amount in real-world
scenarios. A very big percentag of pages have continuation records.


Yeah, although the way I'm planning to do it, you'll waste 4 bytes (on 
64-bit architectures) even when there is a continuation record, because 
of alignment:


typedef struct XLogPageHeaderData
{
uint16  xlp_magic; /* magic value for correctness checks */
uint16  xlp_info;  /* flag bits, see below */
TimeLineID  xlp_tli;   /* TimeLineID of first record on
XLogRecPtr  xlp_pageaddr;  /* XLOG address of this page */

+   uint32  xlp_rem_len;   /* bytes remaining of continued record */
 } XLogPageHeaderData;

The page header is currently 16 bytes in length, so adding a 4-byte 
field to it bumps the aligned size to 24 bytes. Nevertheless, I think we 
can well live with that.



4. Allow WAL record header to be split across page boundaries.
Currently, if there are less than SizeOfXLogRecord bytes left on the
current WAL page, it is wasted, and the next record is inserted at the
beginning of the next page. The problem with that is again that it makes
it impossible to know in advance exactly how much space a WAL record
requires, because it depends on how many bytes need to be wasted at the
end of current page.

+0.5. Its somewhat convenient to be able to look at a record before you have
reassembled it over multiple pages. But its probably not worth the
implementation complexity.


Looking at the code, I think it'll be about the same complexity for 
XLogInsert in its current form (it will help the patch I'm working on), 
and makes ReadRecord() a bit more complicated. But not much.



If we do that we can remove all the aligment padding as well. Which would be a
problem for you anyway, wouldn't it?


It's not a problem. You just MAXALIGN the size of the record when you 
calculate how much space it needs, and then all records become naturally 
MAXALIGNed. We could quite easily remove the alignment on-disk if we 
wanted to, ReadRecord() already always copies the record to an aligned 
buffer, but I wasn't planning to do that.



These changes will help the XLogInsert scaling patch, by making the
space calculations simpler. In essence, to reserve space for a WAL
record of size X, you just need to do bytepos += X.  There's a lot
more details with that, like mapping from the contiguous byte position
to an XLogRecPtr that takes page headers into account, and noticing
RedoRecPtr changes safely, but it's a start.

Hm. Wouldn't you need to remove short/long page headers for that as well?


No, those are ok because they're predictable. Although it would make the 
mapping simpler. To convert from a contiguous xlog byte position that 
excludes all headers, to XLogRecPtr, you need to do something like this 
(I just made this up, probably has bugs, but it's about this complex):


#define UsableBytesInPage (XLOG_BLCKSZ - SizeOfXLogShortPHD)
#define UsableBytesInSegment ((XLOG_SEG_SIZE / XLOG_BLCKSZ) * 
UsableBytesInPage - (SizeOfXLogLongPHD - SizeOfXLogShortPHD)


uint64 xlogrecptr;
uint64 full_segments = bytepos / UsableBytesInSegment;
int offset_in_segment = bytepos % UsableBytesInSegment;

xlogrecptr = full_segments * XLOG_SEG_SIZE;
/* is it on the first page? */
if (offset_in_segment  XLOG_BLCKSZ - SizeOfXLogLongPHD)
   xlogrecptr += SizeOfXLogLongPHD + offset_in_segment;
else
{
   /* first page is fully used */
   xlogrecptr += XLOG_BLCKSZ;
   /* add other full pages */
   offset_in_segment -= XLOG_BLCKSZ - SizeOfXLogLongPHD;
   xlogrecptr += (offset_in_segment / UsableBytesInPage) * XLOG_BLCKSZ;
   /* and finally offset within the last page */
   xlogrecptr += offset_in_segment % UsableBytesInPage;
}
/* finally convert the 64-bit xlogrecptr to a XLogRecPtr struct */
XLogRecPtr.xlogid = xlogrecptr  32;
XLogRecPtr.xrecoff = xlogrecptr  0x;

Capsulated in a function, that's not too bad. But if we want to make 
that simpler, one idea would be to allocate the whole 1st page in each 
WAL segment for metadata. That way all the actual xlog pages would hold 
the same amount of xlog data.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ability to listen on two unix sockets

2012-06-07 Thread Honza Horak

On 06/06/2012 04:50 PM, Andres Freund wrote:

On Wednesday, June 06, 2012 04:38:42 PM Tom Lane wrote:

You might think we should design this exactly like the TCP-socket
multiple-listen-addresses case, ie just have a config variable
containing a list of directory names.  The sticking point there is
that the directories aren't really interchangeable.  In particular,
there is still going to be one directory that is the one hard-wired
into libpq.

I wonder if the whole issue doesn't require libpq to also try multiple
hardcoded socket locations.


I guess so. Let's say we add additional socket support and some server 
uses one e.g. at /var/run/postgresql. Then a client can either (1) 
specify the same path explicitly and then we don't need to specify any 
additional sockets on the client side or (2) stick to the default path, 
which is hard-coded, currently to /tmp.


Going back to the original problem (inaccessible /tmp directory), it is 
the case (2) -- a client uses the default path. So any additional 
client-side socket option won't probably help here, but we would 
probably need a second hard-coded path e.g. at /var/run/postgresql.


Regards,
Honza


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding adjacent checkpoint records

2012-06-07 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Jun 6, 2012 at 6:46 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 If we don't like that, I can think of a couple of other ways to get there,
 but they have their own downsides:
 
 * Instead of trying to detect after-the-fact whether any concurrent
 WAL activity happened during the last checkpoint, we could detect it
 during the checkpoint and then keep the info in a static variable in
 the checkpointer process until next time.  However, I don't see any
 bulletproof way to do this without adding at least one or two lines
 of code within XLogInsert, which I'm sure Robert will complain about.

 My main concern here is to avoid doing anything that will make things
 harder for Heikki's WAL insert scaling patch, which I'm hoping will
 get done for 9.3.

Yeah, I'm not very happy either with adding new requirements to
XLogInsert, even if it is just tracking one more bit of information.

 What do you have in mind, exactly?  I feel like ProcLastRecPtr might
 be enough information.  After logging running xacts, we can check
 whether ProcLastRecPtr is equal to the redo pointer.  If so, then
 nothing got written to WAL between the time we began the checkpoint
 and the time we wrote that record.  If, through further, similar
 gyrations, we can then work out whether the checkpoint record
 immediately follows the running-xacts record, we're there.  That's
 pretty ugly, I guess, but it seems possible.

It's fairly messy because of the issues around holes being left at
page ends etc --- so the fact that ProcLastRecPtr is different from the
previous insert pointer doesn't immediately prove whether something else
got inserted first, or we just had to leave some dead space.  Heikki
mentioned this morning that he'd like to remove some of those rules in
9.3, but that doesn't help us today.  Note also that a closer look at
LogStandbySnapshot shows it emits more than one WAL record, meaning
we'd have to do this dance more than once, and the changes to do that
would be pretty deadly to the modularity of the functions
LogStandbySnapshot calls.

The conclusion I'd come to yesterday was that we'd want XLogInsert to
do something like
if (Insert-PrevRecord is different from ProcLastRecPtr)
SomebodyElseWroteWAL = true;
where SomebodyElseWroteWAL is a process-local boolean that we reset
at the start of a checkpoint sequence, and then check after we've
written out the LogStandbySnapshot records and the checkpoint record.
(We'd also have to hack ProcLastRecPtr by initializing it to
Insert-PrevRecord at the time we reset SomebodyElseWroteWAL, which is
sort of ugly in that it messes up the relatively clean definition of
that variable.)  So that's not exactly a lot of new code in the critical
section, but it's still new code.

In the end I think I like the last idea I mentioned (keeping two
different REDO values during a checkpoint) the best.  It's a bit
complicated but the grottiness is localized in CreateCheckpoint.
Or at least I think it will be, without having written a patch yet.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Simon Riggs
On 7 June 2012 14:50, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 These changes will help the XLogInsert scaling patch

...and as I'm sure you're aware will junk much of the replication code
and almost certainly set back the other work that we have brewing for
9.3. So this is a very large curve ball you're throwing there.

Personally, I don't think we should do this until we have a better
regression test suite around replication and recovery because the
impact will be huge but I welcome the suggested changes themselves.

If you are going to do this in 9.3, then it has to be early in the
first Commit Fest and you'll need to be around to quickly follow
through on all of the other subsequent breakages it will cause,
otherwise every other piece of work in this area will be halted or
delayed.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ability to listen on two unix sockets

2012-06-07 Thread Tom Lane
Honza Horak hho...@redhat.com writes:
 On 06/06/2012 04:50 PM, Andres Freund wrote:
 I wonder if the whole issue doesn't require libpq to also try multiple
 hardcoded socket locations.

 I guess so.

I don't really want to go there.  Some use cases have been shown in
this thread for having a server listen in multiple places, but that does
not translate to saying that clients need to support automatically
looking in multiple places.  I think that mainly introduces questions we
could do without, like which server did you actually end up contacting.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Andres Freund
Hi,

On Thursday, June 07, 2012 05:51:07 PM Simon Riggs wrote:
 On 7 June 2012 14:50, Heikki Linnakangas
 
 heikki.linnakan...@enterprisedb.com wrote:
  These changes will help the XLogInsert scaling patch
 
 ...and as I'm sure you're aware will junk much of the replication code
 and almost certainly set back the other work that we have brewing for
 9.3. So this is a very large curve ball you're throwing there.
It's not that bad. Most of that code is pretty abstracted, the changes to 
adapt to that should be less than 20 lines. And it would remove some of the 
complexity.

 Personally, I don't think we should do this until we have a better
 regression test suite around replication and recovery because the
 impact will be huge but I welcome the suggested changes themselves.
Hm. One could regard the logical rep stuff as a testsuite ;)

 If you are going to do this in 9.3, then it has to be early in the
 first Commit Fest and you'll need to be around to quickly follow
 through on all of the other subsequent breakages it will cause,
 otherwise every other piece of work in this area will be halted or
 delayed.
Yea, I would definitely welcome an early patch.

Greetings,

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ability to listen on two unix sockets

2012-06-07 Thread Andres Freund
On Thursday, June 07, 2012 05:55:11 PM Tom Lane wrote:
 Honza Horak hho...@redhat.com writes:
  On 06/06/2012 04:50 PM, Andres Freund wrote:
  I wonder if the whole issue doesn't require libpq to also try multiple
  hardcoded socket locations.
  
  I guess so.
 
 I don't really want to go there.  Some use cases have been shown in
 this thread for having a server listen in multiple places, but that does
 not translate to saying that clients need to support automatically
 looking in multiple places.  I think that mainly introduces questions we
 could do without, like which server did you actually end up contacting.
It would be really nice to have a development psql connect to a distro 
installed psql and vice versa without having to specify -h /var/run/psql and -
h /tmp all the time...

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding adjacent checkpoint records

2012-06-07 Thread Simon Riggs
On 7 June 2012 14:59, Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Tom Lane t...@sss.pgh.pa.us wrote:

 there is no guarantee that we'll manage to reach a database state
 that is consistent with data already flushed out to disk during
 the last checkpoint.

 Robert Haas robertmh...@gmail.com wrote:

 I know of real customers who would have suffered real data loss
 had this code been present in the server version they were using.
 Checkpoints are the *only* mechanism by which SLRU pages get
 flushed to disk on a mostly-idle system. That means if something
 happens to your pg_xlog directory, and you haven't had a
 checkpoint, you're screwed.

If that is the concern, then its a one line fix to add the missing clog flush.

The other suggestions I've skim read seem fairly invasive at this
stage of the release.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Magnus Hagander
On Thu, Jun 7, 2012 at 5:56 PM, Andres Freund and...@2ndquadrant.com wrote:
 Hi,

 On Thursday, June 07, 2012 05:51:07 PM Simon Riggs wrote:
 On 7 June 2012 14:50, Heikki Linnakangas

 heikki.linnakan...@enterprisedb.com wrote:
  These changes will help the XLogInsert scaling patch

 ...and as I'm sure you're aware will junk much of the replication code
 and almost certainly set back the other work that we have brewing for
 9.3. So this is a very large curve ball you're throwing there.
 It's not that bad. Most of that code is pretty abstracted, the changes to
 adapt to that should be less than 20 lines. And it would remove some of the
 complexity.

 Personally, I don't think we should do this until we have a better
 regression test suite around replication and recovery because the
 impact will be huge but I welcome the suggested changes themselves.
 Hm. One could regard the logical rep stuff as a testsuite ;)

 If you are going to do this in 9.3, then it has to be early in the
 first Commit Fest and you'll need to be around to quickly follow
 through on all of the other subsequent breakages it will cause,
 otherwise every other piece of work in this area will be halted or
 delayed.
 Yea, I would definitely welcome an early patch.

Just as I'm sure everybody else would welcome *your* patches landing
in the first commitfest and that you all guarantee to be around
quickly follow through on all potential breakages *that* can cause.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Andres Freund
On Thursday, June 07, 2012 06:02:12 PM Magnus Hagander wrote:
 On Thu, Jun 7, 2012 at 5:56 PM, Andres Freund and...@2ndquadrant.com 
wrote:
  Hi,
  
  On Thursday, June 07, 2012 05:51:07 PM Simon Riggs wrote:
  On 7 June 2012 14:50, Heikki Linnakangas
  
  heikki.linnakan...@enterprisedb.com wrote:
   These changes will help the XLogInsert scaling patch
  
  ...and as I'm sure you're aware will junk much of the replication code
  and almost certainly set back the other work that we have brewing for
  9.3. So this is a very large curve ball you're throwing there.
  
  It's not that bad. Most of that code is pretty abstracted, the changes to
  adapt to that should be less than 20 lines. And it would remove some of
  the complexity.
  
  Personally, I don't think we should do this until we have a better
  regression test suite around replication and recovery because the
  impact will be huge but I welcome the suggested changes themselves.
  
  Hm. One could regard the logical rep stuff as a testsuite ;)
  
  If you are going to do this in 9.3, then it has to be early in the
  first Commit Fest and you'll need to be around to quickly follow
  through on all of the other subsequent breakages it will cause,
  otherwise every other piece of work in this area will be halted or
  delayed.
  
  Yea, I would definitely welcome an early patch.
 
 Just as I'm sure everybody else would welcome *your* patches landing
 in the first commitfest and that you all guarantee to be around
 quickly follow through on all potential breakages *that* can cause.
Agreed.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] page is not marked all-visible warning in regression tests

2012-06-07 Thread Robert Haas
On Thu, Jun 7, 2012 at 11:04 AM, Andres Freund and...@2ndquadrant.com wrote:
 On Thursday, June 07, 2012 04:27:32 PM Robert Haas wrote:
 On Thu, Jun 7, 2012 at 9:41 AM, Andres Freund and...@2ndquadrant.com
 wrote:
  Proposed patch attached.  This adds some more comments in various
  places, and implements your suggestion of retesting the visibility-map
  bit when we detect a possible mismatch with the page-level bit.
 
  Thanks, will look at it in a bit.
 I wonder if
                /* mark page all-visible, if appropriate */
                if (all_visible  !PageIsAllVisible(page))
                {
                        PageSetAllVisible(page);
                        MarkBufferDirty(buf);
                        visibilitymap_set(onerel, blkno, InvalidXLogRecPtr, 
 vmbuffer,
                                                          
 visibility_cutoff_xid);
                }
 shouldn't test
                if (all_visible 
                    (!PageIsAllVisible(page) || !all_visible_according_to_vm)
 instead.

Hmm, I think you're right.

 Commentwise I am not totally content with the emphasis on memory ordering
 because some of the stuff is more locking than memory ordering. Except that I
 think its a pretty clear improvement. I can reformulate the places where I
 find that relevant but I have the feeling that wouldn't help the legibility.
 Its the big comment in vacuumlazy.c, the comment in nodeIndexonly.c and the
 one in the header of visibilitymap_test. Should be s/memory-
 ordering/concurrency/ except in nodeIndexonlyscan.c

Hmm, I see your point.

 The visibilitymap_clear/PageClearAllVisible in heap_multi_insert should be
 moved into the critical section, shouldn't it?

Yes, it should.  I was thinking maybe we could go the other way and
have heap_insert do it before starting the critical section, but
that's no good: clearing the visibility map bit is part of the
critical data change, and we can't do it and then forget to WAL-log
it.

Updated patch attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


vm-test-cleanup-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Heikki Linnakangas

On 07.06.2012 18:51, Simon Riggs wrote:

On 7 June 2012 14:50, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


These changes will help the XLogInsert scaling patch


...and as I'm sure you're aware will junk much of the replication code
and almost certainly set back the other work that we have brewing for
9.3. So this is a very large curve ball you're throwing there.


I don't think this has much impact on what you're doing (although it's a 
bit hard to tell without more details). The way WAL records work is the 
same, it's just the code that lays them out on a page, and reads back 
from a page, that's changed. And that's fairly isolated in xlog.c.



If you are going to do this in 9.3, then it has to be early in the
first Commit Fest and you'll need to be around to quickly follow
through on all of the other subsequent breakages it will cause,
otherwise every other piece of work in this area will be halted or
delayed.


Yeah, the plan is to get this in early, in the first commit fest. Not 
only because of possible breakage, but also because my ultimate goal is 
the XLogInsert refactoring, and I want do that early in the release 
cycle, too.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ability to listen on two unix sockets

2012-06-07 Thread Robert Haas
On Thu, Jun 7, 2012 at 11:57 AM, Andres Freund and...@2ndquadrant.com wrote:
 On Thursday, June 07, 2012 05:55:11 PM Tom Lane wrote:
 Honza Horak hho...@redhat.com writes:
  On 06/06/2012 04:50 PM, Andres Freund wrote:
  I wonder if the whole issue doesn't require libpq to also try multiple
  hardcoded socket locations.
 
  I guess so.

 I don't really want to go there.  Some use cases have been shown in
 this thread for having a server listen in multiple places, but that does
 not translate to saying that clients need to support automatically
 looking in multiple places.  I think that mainly introduces questions we
 could do without, like which server did you actually end up contacting.
 It would be really nice to have a development psql connect to a distro
 installed psql and vice versa without having to specify -h /var/run/psql and -
 h /tmp all the time...

This is true, but you have this problem already.  It might be worth
fixing, but it seems like a separate issue from the topic of this
thread, which is where the server listens.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] page is not marked all-visible warning in regression tests

2012-06-07 Thread Andres Freund
On Thursday, June 07, 2012 06:08:34 PM Robert Haas wrote:
 On Thu, Jun 7, 2012 at 11:04 AM, Andres Freund and...@2ndquadrant.com 
wrote:
  On Thursday, June 07, 2012 04:27:32 PM Robert Haas wrote:
  On Thu, Jun 7, 2012 at 9:41 AM, Andres Freund and...@2ndquadrant.com
  
  wrote:
   Proposed patch attached.  This adds some more comments in various
   places, and implements your suggestion of retesting the
   visibility-map bit when we detect a possible mismatch with the
   page-level bit.
   
   Thanks, will look at it in a bit.
  The visibilitymap_clear/PageClearAllVisible in heap_multi_insert should
  be moved into the critical section, shouldn't it?
 
 Yes, it should.  I was thinking maybe we could go the other way and
 have heap_insert do it before starting the critical section, but
 that's no good: clearing the visibility map bit is part of the
 critical data change, and we can't do it and then forget to WAL-log
 it.
You could do a visibilitymap_pin outside, but I don't really see the point as 
the page is already locked. There might be some slight benefit in doing so in 
multi_insert but that would be more complicated. And of doubtful benefit.

 Updated patch attached.
Looks good.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ability to listen on two unix sockets

2012-06-07 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On Thursday, June 07, 2012 05:55:11 PM Tom Lane wrote:
 Honza Horak hho...@redhat.com writes:
 On 06/06/2012 04:50 PM, Andres Freund wrote:
 I wonder if the whole issue doesn't require libpq to also try multiple
 hardcoded socket locations.

 I don't really want to go there.

 It would be really nice to have a development psql connect to a distro 
 installed psql and vice versa without having to specify -h /var/run/psql and -
 h /tmp all the time...

I don't find that nice at all.  Which server did you actually connect
to?  How do you control it?  You're going to end up needing the -h
switch anyway.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Robert Haas
On Thu, Jun 7, 2012 at 11:51 AM, Simon Riggs si...@2ndquadrant.com wrote:
 So this is a very large curve ball you're throwing there.

This is not exactly unexpected.  At least the first two of these items
were previous discussed in the context of the XLOG scaling patch, many
months ago.  It shouldn't come as a surprise to anyone that Heikki is
planning to continue to work on that patch even though it didn't make
9.2.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ability to listen on two unix sockets

2012-06-07 Thread Andres Freund
On Thursday, June 07, 2012 06:20:32 PM Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On Thursday, June 07, 2012 05:55:11 PM Tom Lane wrote:
  Honza Horak hho...@redhat.com writes:
  On 06/06/2012 04:50 PM, Andres Freund wrote:
  I wonder if the whole issue doesn't require libpq to also try multiple
  hardcoded socket locations.
  
  I don't really want to go there.
  
  It would be really nice to have a development psql connect to a distro
  installed psql and vice versa without having to specify -h /var/run/psql
  and - h /tmp all the time...
 
 I don't find that nice at all.  Which server did you actually connect
 to?  How do you control it?  You're going to end up needing the -h
 switch anyway.
They can't run on the same port anyway unless you disable listening on 
localhost. Changing a single port number is far less effort than typing -h 
/var/run/postgresql ;)

Anyway, I am not wed to this, and I don't plan to put work into it so I better 
shut up ;)

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] page is not marked all-visible warning in regression tests

2012-06-07 Thread Robert Haas
On Thu, Jun 7, 2012 at 12:19 PM, Andres Freund and...@2ndquadrant.com wrote:
 You could do a visibilitymap_pin outside, but I don't really see the point as
 the page is already locked. There might be some slight benefit in doing so in
 multi_insert but that would be more complicated. And of doubtful benefit.

Doesn't RelationGetBufferForTuple() already do exactly that?

 Updated patch attached.
 Looks good.

Thanks for the review.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas

2012-06-07 Thread Simon Riggs
On 7 June 2012 14:56, Tom Lane t...@sss.pgh.pa.us wrote:
 Sergey Koposov kopo...@ast.cam.ac.uk writes:
 On Tue, 5 Jun 2012, Simon Riggs wrote:
 Sounds less good and we'd need reasonable proof it actually did
 anything useful without being dangerous.

 Doing an initial unlocked test speeds things up another 2.69 fold (on
 top of 3.55 for your patch) for me, with 1GB of shared buffers.  That
 seems like it should be worthwhile.

 How do we go about getting reasonable proof that it is safe?

 That's enough for me.

 Say what?  That's a performance result and proves not a damn thing about
 safety.

Of course not.

Based on the rationale explained in the code comments in the patch, it
seems like a reasonable thing to me now.

The argument was that since we hold AccessExclusiveLock on the
relation, no other agent can be reading in new parts of the table into
new buffers, so the only change to a buffer would be away from the
dropping relation, in which case we wouldn't care. Which seems correct
to me.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding adjacent checkpoint records

2012-06-07 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 Robert Haas robertmh...@gmail.com wrote:
 I know of real customers who would have suffered real data loss
 had this code been present in the server version they were using.

 If that is the concern, then its a one line fix to add the missing clog flush.

To where, and what performance impact will that have?

 The other suggestions I've skim read seem fairly invasive at this
 stage of the release.

The issue here is that we committed a not-very-well-thought-out fix
to the original problem.  I think a reasonable argument could be made
for simply reverting commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724
and postponing any of these other ideas to 9.3.  The useless-checkpoints
problem has been there since 9.0 and can surely wait another release
cycle to get fixed.  But I concur with Robert that changing the system
behavior so that checkpointing of committed changes might be delayed
indefinitely is a high-risk choice.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] page is not marked all-visible warning in regression tests

2012-06-07 Thread Andres Freund
On Thursday, June 07, 2012 06:23:58 PM Robert Haas wrote:
 On Thu, Jun 7, 2012 at 12:19 PM, Andres Freund and...@2ndquadrant.com 
wrote:
  You could do a visibilitymap_pin outside, but I don't really see the
  point as the page is already locked. There might be some slight benefit
  in doing so in multi_insert but that would be more complicated. And of
  doubtful benefit.
 
 Doesn't RelationGetBufferForTuple() already do exactly that?
Forget it, sorry. I mis(read|remembered) how multi_insert works and concluded 
from that on how insert works...


Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas

2012-06-07 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 7 June 2012 14:56, Tom Lane t...@sss.pgh.pa.us wrote:
 Say what?  That's a performance result and proves not a damn thing about
 safety.

 Of course not.

 Based on the rationale explained in the code comments in the patch, it
 seems like a reasonable thing to me now.

 The argument was that since we hold AccessExclusiveLock on the
 relation, no other agent can be reading in new parts of the table into
 new buffers, so the only change to a buffer would be away from the
 dropping relation, in which case we wouldn't care. Which seems correct
 to me.

Oh, I must be confused about which patch we are talking about --- I
thought this was in reference to some of the WIP ideas that were being
thrown about with respect to using lock-free access primitives.  Which
patch are you proposing for commit now, exactly?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Andres Freund
On Thursday, June 07, 2012 05:35:11 PM Heikki Linnakangas wrote:
 On 07.06.2012 17:18, Andres Freund wrote:
  On Thursday, June 07, 2012 03:50:35 PM Heikki Linnakangas wrote:
  3. Move the only field, xl_rem_len, from the continuation record header
  straight to the xlog page header, eliminating XLogContRecord altogether.
  This makes it easier to calculate in advance how much space a WAL record
  requires, as it no longer depends on how many pages it has to be split
  across. This wastes 4-8 bytes on every xlog page, but that's not much.
  
  +1. I don't think this will waste a measureable amount in real-world
  scenarios. A very big percentag of pages have continuation records.
 
 Yeah, although the way I'm planning to do it, you'll waste 4 bytes (on
 64-bit architectures) even when there is a continuation record, because
 of alignment:
 
 typedef struct XLogPageHeaderData
 {
  uint16  xlp_magic; /* magic value for correctness checks */
  uint16  xlp_info;  /* flag bits, see below */
  TimeLineID  xlp_tli;   /* TimeLineID of first record on
  XLogRecPtr  xlp_pageaddr;  /* XLOG address of this page */
 
 +   uint32  xlp_rem_len;   /* bytes remaining of continued record */
   } XLogPageHeaderData;
 
 The page header is currently 16 bytes in length, so adding a 4-byte
 field to it bumps the aligned size to 24 bytes. Nevertheless, I think we
 can well live with that.
At that point we can just do the
#define SizeofXLogPageHeaderData (offsetof(XLogPageHeaderData, xlp_pageaddr) + 
sizeof(uint32))
dance. If the record can be smeared over two pages there is no point in 
storing it aligned. Then we don't waste any additional space in comparison to 
the current state.

  If we do that we can remove all the aligment padding as well. Which would
  be a problem for you anyway, wouldn't it?
 It's not a problem. You just MAXALIGN the size of the record when you
 calculate how much space it needs, and then all records become naturally
 MAXALIGNed. We could quite easily remove the alignment on-disk if we
 wanted to, ReadRecord() already always copies the record to an aligned
 buffer, but I wasn't planning to do that.
Whats the reasoning for having alignment on disk if the records aren't stored 
continually?

  These changes will help the XLogInsert scaling patch, by making the
  space calculations simpler. In essence, to reserve space for a WAL
  record of size X, you just need to do bytepos += X.  There's a lot
  more details with that, like mapping from the contiguous byte position
  to an XLogRecPtr that takes page headers into account, and noticing
  RedoRecPtr changes safely, but it's a start.
  
  Hm. Wouldn't you need to remove short/long page headers for that as well?
 
 No, those are ok because they're predictable.
I haven't read your scalability patch, so I am not really sure what you 
need...
The bytepos += X from above isn't as easy that way. But yes, its not that 
complicated.

 Although it would make the
 mapping simpler. To convert from a contiguous xlog byte position that
 excludes all headers, to XLogRecPtr, you need to do something like this
 (I just made this up, probably has bugs, but it's about this complex):
 
 #define UsableBytesInPage (XLOG_BLCKSZ - SizeOfXLogShortPHD)
 #define UsableBytesInSegment ((XLOG_SEG_SIZE / XLOG_BLCKSZ) *
 UsableBytesInPage - (SizeOfXLogLongPHD - SizeOfXLogShortPHD)
 
 uint64 xlogrecptr;
 uint64 full_segments = bytepos / UsableBytesInSegment;
 int offset_in_segment = bytepos % UsableBytesInSegment;
 
 xlogrecptr = full_segments * XLOG_SEG_SIZE;
 /* is it on the first page? */
 if (offset_in_segment  XLOG_BLCKSZ - SizeOfXLogLongPHD)
 xlogrecptr += SizeOfXLogLongPHD + offset_in_segment;
 else
 {
 /* first page is fully used */
 xlogrecptr += XLOG_BLCKSZ;
 /* add other full pages */
 offset_in_segment -= XLOG_BLCKSZ - SizeOfXLogLongPHD;
 xlogrecptr += (offset_in_segment / UsableBytesInPage) * XLOG_BLCKSZ;
 /* and finally offset within the last page */
 xlogrecptr += offset_in_segment % UsableBytesInPage;
 }
 /* finally convert the 64-bit xlogrecptr to a XLogRecPtr struct */
 XLogRecPtr.xlogid = xlogrecptr  32;
 XLogRecPtr.xrecoff = xlogrecptr  0x;
Its a bit more complicated than that, records can span a good bit more than 
just two pages (even more than two segments) and you need to decide for every 
of those whether it has a long or a short header.

 Capsulated in a function, that's not too bad. But if we want to make
 that simpler, one idea would be to allocate the whole 1st page in each
 WAL segment for metadata. That way all the actual xlog pages would hold
 the same amount of xlog data.
Its a bit easier then, but you probably still need to loop over the size and 
subtract till you reached the final point. Its no problem to produce a 100MB 
wal record. But then thats probably nothing to design for.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL 

Re: [HACKERS] page is not marked all-visible warning in regression tests

2012-06-07 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Updated patch attached.

The comments need a pass of copy-editing, eg here and here:

 + * so somebody else could be change the bit just after we look at it.  In 
 fact,
   ^^^

 + * got cleared after we checked it and before we got took the buffer


Seems reasonable beyond that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] page is not marked all-visible warning in regression tests

2012-06-07 Thread Robert Haas
On Thu, Jun 7, 2012 at 12:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Updated patch attached.

 The comments need a pass of copy-editing, eg here and here:

 + * so somebody else could be change the bit just after we look at it.  In 
 fact,
                       ^^^

 +         * got cleared after we checked it and before we got took the buffer
                                                            

 Seems reasonable beyond that.

Thanks, committed with those fixes and a correction for one more typo
I spotted elsewhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding adjacent checkpoint records

2012-06-07 Thread Simon Riggs
On 7 June 2012 17:27, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 Robert Haas robertmh...@gmail.com wrote:
 I know of real customers who would have suffered real data loss
 had this code been present in the server version they were using.

 If that is the concern, then its a one line fix to add the missing clog 
 flush.

 To where, and what performance impact will that have?

To the point where we decide to skip the other parts of the
checkpoint. How would that cause a performance impact exactly? It's
less work than the original behaviour would be.


 The other suggestions I've skim read seem fairly invasive at this
 stage of the release.

 The issue here is that we committed a not-very-well-thought-out fix
 to the original problem.  I think a reasonable argument could be made
 for simply reverting commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724
 and postponing any of these other ideas to 9.3.  The useless-checkpoints
 problem has been there since 9.0 and can surely wait another release
 cycle to get fixed.  But I concur with Robert that changing the system
 behavior so that checkpointing of committed changes might be delayed
 indefinitely is a high-risk choice.

Clearly, delaying checkpoint indefinitely would be a high risk choice.
But they won't be delayed indefinitely, since changes cause WAL
records to be written and that would soon cause another checkpoint.

Robert has shown a bug exists, so it should be fixed directly,
especially if we believe in least invasive changes. If
not-fixing-the-actual-bug happened before, its happening again here as
well, with some poor sounding logic to justify it.

Please act as you see fit.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 dance. If the record can be smeared over two pages there is no point in 
 storing it aligned.

I think this is not true.  The value of requiring alignment is that you
can read the record-length field without first having to copy it somewhere.
In particular, it will get really ugly if the record length field itself
could cross a page boundary.  I think we want to be able to determine
the record length before we do any data copying, so that we can malloc
the record buffer and then just do one copy step.

The real reason for the current behavior of not letting the record
header get split across multiple pages is so that the length field is
guaranteed to be in the first page.  We can still guarantee that if
we (1) put the length field first and (2) require at least int32
alignment.  I think losing that property will be pretty bad though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas

2012-06-07 Thread Simon Riggs
On 7 June 2012 17:34, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 On 7 June 2012 14:56, Tom Lane t...@sss.pgh.pa.us wrote:
 Say what?  That's a performance result and proves not a damn thing about
 safety.

 Of course not.

 Based on the rationale explained in the code comments in the patch, it
 seems like a reasonable thing to me now.

 The argument was that since we hold AccessExclusiveLock on the
 relation, no other agent can be reading in new parts of the table into
 new buffers, so the only change to a buffer would be away from the
 dropping relation, in which case we wouldn't care. Which seems correct
 to me.

 Oh, I must be confused about which patch we are talking about --- I
 thought this was in reference to some of the WIP ideas that were being
 thrown about with respect to using lock-free access primitives.  Which
 patch are you proposing for commit now, exactly?

Both of these, as attached up thread.

Simon's patch - dropallforks.v1.patch
Jeff's patch - DropRelFileNodeBuffers_unlock_v1.patch
(needs a little tidyup)

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Andres Freund
On Thursday, June 07, 2012 06:53:58 PM Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  dance. If the record can be smeared over two pages there is no point in
  storing it aligned.
 
 I think this is not true.  The value of requiring alignment is that you
 can read the record-length field without first having to copy it somewhere.
 In particular, it will get really ugly if the record length field itself
 could cross a page boundary.  I think we want to be able to determine
 the record length before we do any data copying, so that we can malloc
 the record buffer and then just do one copy step.
Hm, I had assumed the record would get copied into a temp/static buffer first 
and only get reassembled together with the data afterwards.
But if thats not the way to go, sure, storing it aligned so that the length 
can always be read aligned within a page is sensible.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding adjacent checkpoint records

2012-06-07 Thread Robert Haas
On Thu, Jun 7, 2012 at 12:52 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Clearly, delaying checkpoint indefinitely would be a high risk choice.
 But they won't be delayed indefinitely, since changes cause WAL
 records to be written and that would soon cause another checkpoint.

But that's exactly the problem - it might not be soon at all.  We have
customers who process about one write transaction per day.  The
current state of play in 9.2 is that we'll checkpoint when we get to
the next WAL segment.  But at one write transaction per day, that
could take weeks or months.  Someone invented a setting called
'checkpoint_timeout' for a very good reason - people don't want huge
amounts of time to pass between checkpoints.  That setting is
currently capped at one hour; the proposition that someone who sets it
to 5 minutes on a low-write-volume system is OK with the effective
value being 5 months does not seem likely to me.

Your suggestion of just checkpointing CLOG seems like it would
mitigate the worst of the damage, but I think I agree with Tom that
just reverting the whole patch would be a better solution, if we can't
figure out anything cleaner.  It's better to have a few unnecessary
checkpoints than to risk losing somebody's data, especially since the
unnecessary checkpoints only happen with wal_level=hot_standby, but
the data loss risk exists for everyone.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Simon Riggs
On 7 June 2012 17:12, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 07.06.2012 18:51, Simon Riggs wrote:

 On 7 June 2012 14:50, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 These changes will help the XLogInsert scaling patch


 ...and as I'm sure you're aware will junk much of the replication code
 and almost certainly set back the other work that we have brewing for
 9.3. So this is a very large curve ball you're throwing there.


 I don't think this has much impact on what you're doing (although it's a bit
 hard to tell without more details). The way WAL records work is the same,
 it's just the code that lays them out on a page, and reads back from a page,
 that's changed. And that's fairly isolated in xlog.c.

I wasn't worried about the code overlap, but the subsidiary breakage
looks pretty enormous to me.

Anything changing filenames will break every HA config anybody has
anywhere. So you can pretty much kiss goodbye to the idea of
pg_upgrade. For me, this one thing alone is sufficient to force next
release to be 10.0.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding adjacent checkpoint records

2012-06-07 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 7 June 2012 17:27, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 If that is the concern, then its a one line fix to add the missing clog 
 flush.

 To where, and what performance impact will that have?

 To the point where we decide to skip the other parts of the
 checkpoint. How would that cause a performance impact exactly? It's
 less work than the original behaviour would be.

It's not clear to me exactly which parts of the checkpoint process would
need to be duplicated here to be safe.  What you're proposing is
basically a partial checkpoint, which would need quite a lot of thought
to be sure it did everything that should be done and nothing that
shouldn't be.  And it would be a continuing risk spot for future bugs of
omission.

 The issue here is that we committed a not-very-well-thought-out fix
 to the original problem.  I think a reasonable argument could be made
 for simply reverting commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724
 and postponing any of these other ideas to 9.3.  The useless-checkpoints
 problem has been there since 9.0 and can surely wait another release
 cycle to get fixed.  But I concur with Robert that changing the system
 behavior so that checkpointing of committed changes might be delayed
 indefinitely is a high-risk choice.

 Clearly, delaying checkpoint indefinitely would be a high risk choice.
 But they won't be delayed indefinitely, since changes cause WAL
 records to be written and that would soon cause another checkpoint.

No, the situation of most concern is where we make some change and then
nothing happens for a very long time.  If there's a continuing flow of
updates, then yes a checkpoint will happen ... eventually.  Even with
the assumption of continuing updates, the time until a commit is
checkpointed might be vastly longer than the configured checkpoint
interval, and that's an interval in which loss of the WAL log is more
dangerous than it was in any previous release.  So basically, this fix
is giving up some hard-to-quantify amount of data security.  Per this
thread, there are other ways in which we can fix the useless-checkpoint
problem without giving up any such security.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Andres Freund
On Thursday, June 07, 2012 07:03:32 PM Simon Riggs wrote:
 On 7 June 2012 17:12, Heikki Linnakangas
 
 heikki.linnakan...@enterprisedb.com wrote:
  On 07.06.2012 18:51, Simon Riggs wrote:
  On 7 June 2012 14:50, Heikki Linnakangas
  
  heikki.linnakan...@enterprisedb.com  wrote:
  These changes will help the XLogInsert scaling patch
  
  ...and as I'm sure you're aware will junk much of the replication code
  and almost certainly set back the other work that we have brewing for
  9.3. So this is a very large curve ball you're throwing there.
  
  I don't think this has much impact on what you're doing (although it's a
  bit hard to tell without more details). The way WAL records work is the
  same, it's just the code that lays them out on a page, and reads back
  from a page, that's changed. And that's fairly isolated in xlog.c.
 I wasn't worried about the code overlap, but the subsidiary breakage
 looks pretty enormous to me.
The xlog arithmetic will still be encapsulated, so not much difference there. 
Removing reading of XLogContRecord isn't complicated and would result in less 
code. Shouldn't be much more than that.

 Anything changing filenames will break every HA config anybody has
 anywhere. So you can pretty much kiss goodbye to the idea of
 pg_upgrade. For me, this one thing alone is sufficient to force next
 release to be 10.0.
Hm? Wal isn't relevant for pg_upgrade. And the HA setups should rely on 
archive_command and such and not do computation of the next/last name. I would 
guess removing that corner-case actually fixes more tools than it breaks.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Kevin Grittner
Simon Riggs si...@2ndquadrant.com wrote:
 
 Anything changing filenames will break every HA config anybody has
 anywhere.
 
It will impact our scripts related to backup and archiving, but I
think we're talking about two or three staff days to cover it in our
shop.
 
We should definitely make sure that this change is conspicuously
noted.  The scariest part is that there will now be files that
matter with names that previously didn't exist, so lack of action
will cause failure to capture a usable backup.  I don't know that it
merits a bump to 10.0, though.  We test every backup for usability,
as I believe any shop should; failure to cover this should cause
pretty obvious errors pretty quickly if you are testing your
backups.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding adjacent checkpoint records

2012-06-07 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 ... It's better to have a few unnecessary
 checkpoints than to risk losing somebody's data, especially since the
 unnecessary checkpoints only happen with wal_level=hot_standby, but
 the data loss risk exists for everyone.

Yeah, that's another point here: the benefit of the patch accrues to
a different set of people than the ones paying the penalty.  If you've
got hot standby enabled, presumably you are replicating to at least one
slave and so the prospect of data loss via WAL loss is mitigated for you.

I also note that the other work done in 9.2 to reduce idle-system load
did not address replication configurations at all; I think we still have
time-driven wakeups in walsender and walreceiver for instance.  So I'd
rather revert the patch now, and consider that a better fix will be part
of a future round of work to reduce the idle-system load in replication
setups.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Could we replace SysV semaphores with latches?

2012-06-07 Thread Stefan Kaltenbrunner
On 06/07/2012 06:09 AM, Tom Lane wrote:
 There has been regular griping in this list about our dependence on SysV
 shared memory, but not so much about SysV semaphores, even though the
 latter cause their fair share of issues; as seen for example in
 buildfarm member spoonbill's recent string of failures:
 
 creating template1 database in 
 /home/pgbuild/pgbuildfarm/HEAD/pgsql.25563/src/test/regress/./tmp_check/data/base/1
  ... FATAL:  could not create semaphores: No space left on device
 DETAIL:  Failed system call was semget(1, 17, 03600).
 HINT:  This error does *not* mean that you have run out of disk space.  It 
 occurs when either the system limit for the maximum number of semaphore sets 
 (SEMMNI), or the system wide maximum number of semaphores (SEMMNS), would be 
 exceeded.  You need to raise the respective kernel parameter.  Alternatively, 
 reduce PostgreSQL's consumption of semaphores by reducing its max_connections 
 parameter.
   The PostgreSQL documentation contains more information about 
 configuring your system for PostgreSQL.
 child process exited with exit code 1

hmm now that you mention that I missed the issue completely - the
problem here is that spoonbill only has resources for one running
postmaster and the failure on 24.5.2012 caused a left over postmaster
instance - should be fixed now


Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 But if you're just using regexp matching against pathnames, your
 tool will be just fine.  Do your tools actually rely on the
 occasional absence of a file in what would otherwise be the usual
 sequence of files?
 
To save snapshot backups for the long term, we generate a list of
the specific WAL files needed to reach a consistent recovery point
from a given base backup.  We keep monthly snapshot backups for a
year.  We currently determine the first and last file needed, and
then create a list of all the WAL files to save.  We error out if
any are missing, so we do skip the FF file.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Early hint bit setting

2012-06-07 Thread Robert Haas
On Wed, Jun 6, 2012 at 6:41 PM, Jim Nasby j...@nasby.net wrote:
 Except that's only true when there are no other transactions running. That's
 been one of the big sticking points about trying to proactively set hint
 bits; in a real system you're not going to gain very much unless you wait a
 while before setting them.

No, the committed hint bit just means that the transaction is
committed.  You don't have to wait for it to be all-visible.

I think my biggest concern about this is that it inevitably relies on
some assumption about how much latency there will be before the client
sends the next request.  That strikes me as impossible to tune.  On
system A, connected to the Internet via an overloaded 56k modem link,
you can get away with doing a huge amount of fiddling around while
waiting for the next request.  But on system B, which uses 10GE or
Infiniband or local sockets, the acceptable latency will be much less.
 Even given identical hardware, scheduler behavior may matter quite a
lot - rumor has it that FreeBSD's scheduling latency may be
significantly less than on Linux, although I have not verified it and
rumor may lie.  But the point is that whether or not this works out to
a win on any given system seems like it will depend on an awful lot of
stuff that we can't know or control.

I would be more inclined to look at trying to make this happen in a
background process, although that's not without its own challenges.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Robert Haas
On Thu, Jun 7, 2012 at 1:40 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Robert Haas robertmh...@gmail.com wrote:

 But if you're just using regexp matching against pathnames, your
 tool will be just fine.  Do your tools actually rely on the
 occasional absence of a file in what would otherwise be the usual
 sequence of files?

 To save snapshot backups for the long term, we generate a list of
 the specific WAL files needed to reach a consistent recovery point
 from a given base backup.  We keep monthly snapshot backups for a
 year.  We currently determine the first and last file needed, and
 then create a list of all the WAL files to save.  We error out if
 any are missing, so we do skip the FF file.

OK, I see.  Still, I think there are a lot of people who don't do
anything that complex, and won't be affected.  But I agree we had
better clearly release-note it as an incompatibility.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Robert Haas
On Thu, Jun 7, 2012 at 1:15 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Simon Riggs si...@2ndquadrant.com wrote:

 Anything changing filenames will break every HA config anybody has
 anywhere.

 It will impact our scripts related to backup and archiving, but I
 think we're talking about two or three staff days to cover it in our
 shop.

 We should definitely make sure that this change is conspicuously
 noted.  The scariest part is that there will now be files that
 matter with names that previously didn't exist, so lack of action
 will cause failure to capture a usable backup.

But if you're just using regexp matching against pathnames, your tool
will be just fine.  Do your tools actually rely on the occasional
absence of a file in what would otherwise be the usual sequence of
files?

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WalSndWakeup() and synchronous_commit=off

2012-06-07 Thread Simon Riggs
On 6 June 2012 20:11, Andres Freund and...@2ndquadrant.com wrote:
 On Tuesday, May 29, 2012 08:42:43 PM Andres Freund wrote:
 Hi,

 On Monday, May 28, 2012 07:11:53 PM Tom Lane wrote:
  Andres Freund and...@2ndquadrant.com writes:
   Does anybody have a better idea than to either call WalSndWakeup() at
   essentially the wrong places or calling it inside a critical section?
  
   Tom, what danger do you see from calling it in a critical section?
 
  My concern was basically that it might throw an error.  Looking at the
  current implementation of SetLatch, it seems that's not possible, but
  I wonder whether we want to lock ourselves into that assumption.

 The assumption is already made at several other places I think.
 XLogSetAsyncXactLSN does a SetLatch and is called from critical sections;
 several signal handlers call it without any attention to the context.

 Requiring it to be called outside would make its usage considerably less
 convenient and I don't really see what could change that would require to
 throw non-panic errors.

  Still, if the alternatives are worse, maybe that's the best answer.
  If we do that, though, let's add comments to WalSndWakeup and SetLatch
  mentioning that they mustn't throw error.

 Patch attached.
 I would like to invite some more review (+commit...) here ;). Imo this is an
 annoying bug which should be fixed before next point release or beta/rc comes
 out...

Moved the wakeup to a logical place outside a critical section.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 Anything changing filenames will break every HA config anybody has
 anywhere.

This seems like nonsense to me.  How many external scripts are likely to
know that we skip the FF page?  There might be some, but not many.

 So you can pretty much kiss goodbye to the idea of pg_upgrade.

And that is certainly nonsense.  I don't think pg_upgrade even knows
about this, and if it does we can surely fix it.

 For me, this one thing alone is sufficient to force next release to be
 10.0.

Huh?  We make incompatible changes in major versions all the time.
This one does not appear to me to be worse than many others.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Simon Riggs
On 7 June 2012 19:52, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 Anything changing filenames will break every HA config anybody has
 anywhere.

 This seems like nonsense to me.  How many external scripts are likely to
 know that we skip the FF page?  There might be some, but not many.

If that is the only change in filenames, then all is forgiven.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLog changes for 9.3

2012-06-07 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 7 June 2012 19:52, Tom Lane t...@sss.pgh.pa.us wrote:
 This seems like nonsense to me.  How many external scripts are likely to
 know that we skip the FF page?  There might be some, but not many.

 If that is the only change in filenames, then all is forgiven.

Oh, now I see what you're on about.  Yes, I agree that we should
maintain the same formatting of WAL segment file names, even though
it will be rather artificial in the 64-bit-arithmetic world.  The
only externally visible change should be the creation of FF-numbered
files where formerly those were skipped.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas

2012-06-07 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 7 June 2012 17:34, Tom Lane t...@sss.pgh.pa.us wrote:
 Oh, I must be confused about which patch we are talking about --- I
 thought this was in reference to some of the WIP ideas that were being
 thrown about with respect to using lock-free access primitives.  Which
 patch are you proposing for commit now, exactly?

 Both of these, as attached up thread.

 Simon's patch - dropallforks.v1.patch
 Jeff's patch - DropRelFileNodeBuffers_unlock_v1.patch
 (needs a little tidyup)

OK, will take a look.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WalSndWakeup() and synchronous_commit=off

2012-06-07 Thread Andres Freund
On Thursday, June 07, 2012 08:41:23 PM Simon Riggs wrote:
 On 6 June 2012 20:11, Andres Freund and...@2ndquadrant.com wrote:
  On Tuesday, May 29, 2012 08:42:43 PM Andres Freund wrote:
  Hi,
  
  On Monday, May 28, 2012 07:11:53 PM Tom Lane wrote:
   Andres Freund and...@2ndquadrant.com writes:
Does anybody have a better idea than to either call WalSndWakeup()
at essentially the wrong places or calling it inside a critical
section?

Tom, what danger do you see from calling it in a critical section?
   
   My concern was basically that it might throw an error.  Looking at the
   current implementation of SetLatch, it seems that's not possible, but
   I wonder whether we want to lock ourselves into that assumption.
  
  The assumption is already made at several other places I think.
  XLogSetAsyncXactLSN does a SetLatch and is called from critical
  sections; several signal handlers call it without any attention to the
  context.
  
  Requiring it to be called outside would make its usage considerably less
  convenient and I don't really see what could change that would require
  to throw non-panic errors.
  
   Still, if the alternatives are worse, maybe that's the best answer.
   If we do that, though, let's add comments to WalSndWakeup and SetLatch
   mentioning that they mustn't throw error.
  
  Patch attached.
  
  I would like to invite some more review (+commit...) here ;). Imo this is
  an annoying bug which should be fixed before next point release or
  beta/rc comes out...
 
 Moved the wakeup to a logical place outside a critical section.
Hm. I don't really like the way you implemented that. While it reduces the 
likelihood quite a bit it will still miss wakeups if an XLogInsert pushes out 
the data because of missing space or if any place does an XLogFlush(lsn).
The knowledge is really only available in XLogWrite...

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas

2012-06-07 Thread Tom Lane
Jeff Janes jeff.ja...@gmail.com writes:
 On 30 May 2012 12:10, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 Also, I wonder if DropRelFileNodeBuffers() could scan the pool without
 grabbing the spinlocks on every buffer? It could do an unlocked test first,
 and only grab the spinlock on buffers that need to be dropped.

 Sounds less good and we'd need reasonable proof it actually did
 anything useful without being dangerous.

 Doing an initial unlocked test speeds things up another 2.69 fold (on
 top of 3.55 for your patch) for me, with 1GB of shared buffers.  That
 seems like it should be worthwhile.

With shared_buffers set to 1GB, I see about a 2X reduction in the total
time to drop a simple table, ie
create table zit(f1 text primary key);
drop table zit;
(This table definition is chosen to ensure there's an index and a toast
table involved, so several scans of the buffer pool are needed.)  The
DROP goes from about 40ms to about 20ms on a fairly recent Xeon desktop.
So I'm convinced this is a win.

I extended the patch to also cover DropDatabaseBuffers,
FlushRelationBuffers, and FlushDatabaseBuffers, which have got the exact
same type of full-pool scan loop, and committed it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas

2012-06-07 Thread Tom Lane
I wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 Both of these, as attached up thread.
 Simon's patch - dropallforks.v1.patch
 Jeff's patch - DropRelFileNodeBuffers_unlock_v1.patch
 (needs a little tidyup)

 OK, will take a look.

I didn't like dropallforks.v1.patch at all as presented, for several
reasons:

* Introducing an AllForks notation that only works in some contexts is
a foot-gun of large caliber.  This concern is not academic: you broke
dropping of temp relations entirely, in the patch as presented, because
for temp rels DropRelFileNodeBuffers would hand off to
DropRelFileNodeLocalBuffers and the latter had not been taught about
AllForks.

* Since we have found out in this thread that the inner loop of
DropRelFileNodeBuffers is performance-critical for the cases we're
dealing with, it seems inappropriate to me to make its tests more
complex.  We want simpler, and we can have simpler given that the
relation-drop case cares neither about fork nor block number.

* The patch modified behavior of XLogDropRelation, which has not been
shown to be performance-critical, and probably can't be because the
hashtable it searches should never be all that large.  It certainly
doesn't matter to the speed of normal execution.

I thought it would be a lot safer and probably a little bit quicker
if we just split DropRelFileNodeBuffers into two routines, one for
the specific-fork case and one for the all-forks case; and then the
same for its main caller smgrdounlink.  So I modified the patch along
those lines and committed it.

As committed, the smgrdounlinkfork case is actually dead code; it's
never called from anywhere.  I left it in place just in case we want
it someday.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] New Postgres committer: Kevin Grittner

2012-06-07 Thread Tom Lane
I am pleased to announce that Kevin Grittner has accepted the core
committee's invitation to become our newest committer.  As you all
know, Kevin's done a good deal of work on the project over the past
couple of years.  We judged that he has the requisite skills,
dedication to the project, and a suitable degree of caution to be
a good committer.  Please join me in welcoming him aboard.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New Postgres committer: Kevin Grittner

2012-06-07 Thread Michael Glaesemann

On Jun 7, 2012, at 18:15, Tom Lane wrote:

 I am pleased to announce that Kevin Grittner has accepted the core
 committee's invitation to become our newest committer.  As you all
 know, Kevin's done a good deal of work on the project over the past
 couple of years.  We judged that he has the requisite skills,
 dedication to the project, and a suitable degree of caution to be
 a good committer.  Please join me in welcoming him aboard.

Congratulations, Kevin! Well-deserved!

Michael Glaesemann
grzm seespotcode net




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New Postgres committer: Kevin Grittner

2012-06-07 Thread Thom Brown
On 7 June 2012 23:15, Tom Lane t...@sss.pgh.pa.us wrote:
 I am pleased to announce that Kevin Grittner has accepted the core
 committee's invitation to become our newest committer.  As you all
 know, Kevin's done a good deal of work on the project over the past
 couple of years.  We judged that he has the requisite skills,
 dedication to the project, and a suitable degree of caution to be
 a good committer.  Please join me in welcoming him aboard.

Kevin, with great power comes great responsibility.  In other words
your workload has just gone up. ;)

You'll be a great addition to the existing team of committers.
-- 
Thom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New Postgres committer: Kevin Grittner

2012-06-07 Thread Peter Geoghegan
On 7 June 2012 23:15, Tom Lane t...@sss.pgh.pa.us wrote:
 Please join me in welcoming him aboard.

Congratulations, Kevin.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: parameterized function scan

2012-06-07 Thread Antonin Houska

On 05/24/2012 12:46 AM, Tom Lane wrote:

Well, it's not per spec: what you did accepts queries that are invalid
per spec and are very likely to be errors rather than intentional
invocations of the LATERAL facility.  This might be all right for


I think I saw queries where function is joined with no explicit LATERAL().

Nevertheless...

Quite honestly, I think that something that only handles functions as
LATERAL righthands is broken by design.  It doesn't meet the spec, and
it's unlikely to represent much of a step towards a full implementation.
Remember Polya's Inventor's Paradox: the more general problem may be
easier to solve.
... sounds more serious. I'll keep it in mind if I get the impression 
that I have a new idea about this problem anytime. Thanks,

Tony H.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New Postgres committer: Kevin Grittner

2012-06-07 Thread Peter Geoghegan
On 7 June 2012 23:40, Peter Geoghegan pe...@2ndquadrant.com wrote:
 On 7 June 2012 23:15, Tom Lane t...@sss.pgh.pa.us wrote:
 Please join me in welcoming him aboard.

 Congratulations, Kevin.

Idle thought for the web team: Now might be a good time to take down
the blurb on .org in which Kevin describes the mailing list support as
amazing.  :-)

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New Postgres committer: Kevin Grittner

2012-06-07 Thread Jaime Casanova
On Thu, Jun 7, 2012 at 5:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Please join me in welcoming him aboard.


Congratulations Kevin

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] log_newpage header comment

2012-06-07 Thread Robert Haas
It seems that in implementing ginbuildempty(), I falsified the first
note in the header comment for log_newpage():

 * Note: all current callers build pages in private memory and write them
 * directly to smgr, rather than using bufmgr.  Therefore there is no need
 * to pass a buffer ID to XLogInsert, nor to perform MarkBufferDirty within
 * the critical section.

Mea culpa, mea culpa, mea maxima culpa.

So, this leads to a couple of questions:

1. Considering that we're logging the entire page, is it necessary (or
even desirable) to include the buffer ID in the rdata structure?  If
so, why?  To put that another way, does my abuse of log_newpage
constitute a bug in gistbuildempty()?

2. Should we add a new function that does the same thing as
log_newpage for a shared buffer?  I'm imagining that the signature
would be:

XLogRecPtr log_newpage_buffer(RelFileNode *rnode, Buffer buffer);

Admittedly, it may seem that we don't quite need such a function to
cater to just one caller, but I think we might have more in the
future.  Among other things, it occurs to me to wonder whether I ought
to rewrite all the ambuildempty routines in the style of
ginbuildempty(), to avoid having to fsync in the foreground.  Even if
not, I think there might be other applications for this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New Postgres committer: Kevin Grittner

2012-06-07 Thread Robert Haas
On Thu, Jun 7, 2012 at 7:09 PM, Jaime Casanova ja...@2ndquadrant.com wrote:
 On Thu, Jun 7, 2012 at 5:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Please join me in welcoming him aboard.


 Congratulations Kevin

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WalSndWakeup() and synchronous_commit=off

2012-06-07 Thread Simon Riggs
On 7 June 2012 21:08, Andres Freund and...@2ndquadrant.com wrote:

 Moved the wakeup to a logical place outside a critical section.

 Hm. I don't really like the way you implemented that. While it reduces the
 likelihood quite a bit it will still miss wakeups if an XLogInsert pushes out
 the data because of missing space or if any place does an XLogFlush(lsn).
 The knowledge is really only available in XLogWrite...

Right, but the placement inside the critical section was objected to.

This way, any caller of XLogFlush() will be swept up at least once per
wal_writer_delay, so missing a few calls doesn't mean we have spikes
in replication delay.

Doing it more frequently was also an objection from Fujii, to which we
must listen.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas

2012-06-07 Thread Simon Riggs
On 7 June 2012 22:54, Tom Lane t...@sss.pgh.pa.us wrote:

 I thought it would be a lot safer and probably a little bit quicker
 if we just split DropRelFileNodeBuffers into two routines, one for
 the specific-fork case and one for the all-forks case; and then the
 same for its main caller smgrdounlink.  So I modified the patch along
 those lines and committed it.

 As committed, the smgrdounlinkfork case is actually dead code; it's
 never called from anywhere.  I left it in place just in case we want
 it someday.

That's fine. The first version of the patch did it exactly that way.

I tried to double guess objections and so recoded it the way submitted.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] log_newpage header comment

2012-06-07 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 It seems that in implementing ginbuildempty(), I falsified the first
 note in the header comment for log_newpage():

  * Note: all current callers build pages in private memory and write them
  * directly to smgr, rather than using bufmgr.  Therefore there is no need
  * to pass a buffer ID to XLogInsert, nor to perform MarkBufferDirty within
  * the critical section.

 1. Considering that we're logging the entire page, is it necessary (or
 even desirable) to include the buffer ID in the rdata structure?  If
 so, why?  To put that another way, does my abuse of log_newpage
 constitute a bug in gistbuildempty()?

AFAICS, not passing the buffer ID to XLogInsert is not an issue, since
we are logging the whole page in any case.  However, failing to perform
MarkBufferDirty within the critical section definitely is an issue.

 2. Should we add a new function that does the same thing as
 log_newpage for a shared buffer?  I'm imagining that the signature
 would be:

Either that or rethink building this data in shared buffers.  What's the
point of that, exactly, for a page that we are most certainly not going
to use in normal operation?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas

2012-06-07 Thread Sergey Koposov

On Thu, 7 Jun 2012, Tom Lane wrote:

I extended the patch to also cover DropDatabaseBuffers,
FlushRelationBuffers, and FlushDatabaseBuffers, which have got the exact
same type of full-pool scan loop, and committed it.


Thanks everybody for the patches/commits.

Sergey

*
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding adjacent checkpoint records

2012-06-07 Thread Simon Riggs
On 7 June 2012 18:03, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jun 7, 2012 at 12:52 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Clearly, delaying checkpoint indefinitely would be a high risk choice.
 But they won't be delayed indefinitely, since changes cause WAL
 records to be written and that would soon cause another checkpoint.

 But that's exactly the problem - it might not be soon at all.  We have
 customers who process about one write transaction per day.  The
 current state of play in 9.2 is that we'll checkpoint when we get to
 the next WAL segment.  But at one write transaction per day, that
 could take weeks or months.  Someone invented a setting called
 'checkpoint_timeout' for a very good reason - people don't want huge
 amounts of time to pass between checkpoints.  That setting is
 currently capped at one hour; the proposition that someone who sets it
 to 5 minutes on a low-write-volume system is OK with the effective
 value being 5 months does not seem likely to me.

We discussed that in the original thread.

The purpose of a checkpoint is to reduce recovery time. Nobody
actually wants a checkpoint just of itself and why would we care how
many months that takes? I grant that it takes a little while to come
to terms with that thought, but that doesn't make the patch wrong.

The only risk of data loss is in the case where someone deletes their
pg_xlog and who didn't take a backup in all that time, which is hardly
recommended behaviour. We're at exactly the same risk of data loss if
someone deletes their pg_clog. Too frequent checkpoints actually makes
the data loss risk from deleted pg_clog greater, so the balance of
data loss risk doesn't seem to have altered.

 Your suggestion of just checkpointing CLOG seems like it would
 mitigate the worst of the damage, but I think I agree with Tom that
 just reverting the whole patch would be a better solution, if we can't
 figure out anything cleaner.  It's better to have a few unnecessary
 checkpoints than to risk losing somebody's data, especially since the
 unnecessary checkpoints only happen with wal_level=hot_standby, but
 the data loss risk exists for everyone.

We're not at risk of losing anyone's data. There is no aspect of the
normal running system that is at a higher risk of data loss.

I don't think its fair comment to claim the patch has issues because
you've found a guy with very, very, very low write rates, no backup
strategy and a propensity to delete essential parts of their database
who will be adversely affected.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding adjacent checkpoint records

2012-06-07 Thread Peter Geoghegan
On 7 June 2012 18:03, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jun 7, 2012 at 12:52 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Clearly, delaying checkpoint indefinitely would be a high risk choice.
 But they won't be delayed indefinitely, since changes cause WAL
 records to be written and that would soon cause another checkpoint.

 But that's exactly the problem - it might not be soon at all.  We have
 customers who process about one write transaction per day.  The
 current state of play in 9.2 is that we'll checkpoint when we get to
 the next WAL segment.  But at one write transaction per day, that
 could take weeks or months.  Someone invented a setting called
 'checkpoint_timeout' for a very good reason - people don't want huge
 amounts of time to pass between checkpoints.  That setting is
 currently capped at one hour; the proposition that someone who sets it
 to 5 minutes on a low-write-volume system is OK with the effective
 value being 5 months does not seem likely to me.

ISTM that this low-volume system accepts approximately the same risk
of data loss as a high volume system that writes the same volume of
data, but in 5 minutes rather than 5 months. You might argue that the
scope of things that can go wrong in 5 months is far greater than the
scope of 5 minutes. I might counter that the mechanical wear attendant
to constant writes was a more important factor, or flash memory having
gone through one too many P/E cycles, becoming unreliable.

The bottom line is that once you've flushed WAL, you're by definition
home-free - if you're relying on a checkpoint to save you from data
loss, you're in big trouble anyway. Checkpoints are theoretically just
for putting some cap on recovery time. That's pretty much how they're
described in the literature.

Your customer's use-case seems very narrow, and your complaint seems
unusual to me, but couldn't you just get the customer to force
checkpoints in a cronjob or something? CheckPointStmt will force,
provided !RecoveryInProgress() .

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding adjacent checkpoint records

2012-06-07 Thread Tom Lane
Peter Geoghegan pe...@2ndquadrant.com writes:
 On 7 June 2012 18:03, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jun 7, 2012 at 12:52 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Clearly, delaying checkpoint indefinitely would be a high risk choice.
 But they won't be delayed indefinitely, since changes cause WAL
 records to be written and that would soon cause another checkpoint.

 But that's exactly the problem - it might not be soon at all.

 Your customer's use-case seems very narrow, and your complaint seems
 unusual to me, but couldn't you just get the customer to force
 checkpoints in a cronjob or something? CheckPointStmt will force,
 provided !RecoveryInProgress() .

I think both you and Simon have completely missed the point.  This
is not a use case in the sense of someone doing it deliberately.
This is about data redundancy, ie, if you lose your WAL through some
unfortunate mishap, are you totally screwed or is there a reasonable
chance that the data is on-disk in the main data store?  I would guess
that the incidents Robert has been talking about were cases where EDB
were engaged to do crash recovery, and were successful precisely because
PG wasn't wholly dependent on the WAL copy of the data.

This project has always put reliability first.  It's not clear to me why
we would compromise that across-the-board in order to slightly reduce
idle load in replication configurations.  Yeah, it's probably not a
*large* compromise ... but it is a compromise, and one that doesn't
seem to me to be very well-advised.  We can fix the idle-load issue
without compromising on this basic goal; it will just take more than
a ten-line patch to do it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers