Re: [HACKERS] Posix Shared Mem patch

2012-07-04 Thread Robert Haas
On Tue, Jul 3, 2012 at 1:46 PM, Josh Kupershmidt schmi...@gmail.com wrote:
 On Tue, Jul 3, 2012 at 6:57 AM, Robert Haas robertmh...@gmail.com wrote:
 Here's a patch that attempts to begin the work of adjusting the
 documentation for this brave new world.  I am guessing that there may
 be other places in the documentation that also require updating, and
 this page probably needs more work, but it's a start.

 I think the boilerplate warnings in config.sgml about needing to raise
 the SysV parameters can go away; patch attached.

Thanks, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-07-03 Thread Robert Haas
On Thu, Jun 28, 2012 at 11:26 AM, Robert Haas robertmh...@gmail.com wrote:
 Assuming things go well, there are a number of follow-on things that
 we need to do finish this up:

 1. Update the documentation.  I skipped this for now, because I think
 that what we write there is going to be heavily dependent on how
 portable this turns out to be, which we don't know yet.  Also, it's
 not exactly clear to me what the documentation should say if this does
 turn out to work everywhere.  Much of section 17.4 will become
 irrelevant to most users, but I doubt we'd just want to remove it; it
 could still matter for people running EXEC_BACKEND or running many
 postmasters on the same machine or, of course, people running on
 platforms where this just doesn't work, if there are any.

Here's a patch that attempts to begin the work of adjusting the
documentation for this brave new world.  I am guessing that there may
be other places in the documentation that also require updating, and
this page probably needs more work, but it's a start.

 2. Update the HINT messages when shared memory allocation fails.
 Maybe the new most-common-failure mode there will be too many
 postmasters running on the same machine?  We might need to wait for
 some field reports before adjusting this.

I think this is mostly a matter of removing the text that says fix
this by reducing shme-related parameters from the relevant hint
messages.

 3. Consider adjusting the logic inside initdb.  If this works
 everywhere, the code for determining how to set shared_buffers should
 become pretty much irrelevant.  Even if it only works some places, we
 could add 64MB or 128MB or whatever to the list of values we probe, so
 that people won't get quite such a sucky configuration out of the box.
  Of course there's no number here that will be good for everyone.

I posted a patch for this one last night.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


shmem-docs.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-07-03 Thread Andres Freund
On Wednesday, June 27, 2012 05:28:14 AM Robert Haas wrote:
 On Tue, Jun 26, 2012 at 6:25 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Josh Berkus j...@agliodbs.com writes:
  So let's fix the 80% case with something we feel confident in, and then
  revisit the no-sysv interlock as a separate patch.  That way if we can't
  fix the interlock issues, we still have a reduced-shmem version of
  Postgres.
  
  Yes.  Insisting that we have the whole change in one patch is a good way
  to prevent any forward progress from happening.  As Alvaro noted, there
  are plenty of issues to resolve without trying to change the interlock
  mechanism at the same time.
 
 So, here's a patch.  Instead of using POSIX shmem, I just took the
 expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS
 memory.  The sysv shm is still allocated, but it's just a copy of
 PGShmemHeader; the real shared memory is the anonymous block.  This
 won't work if EXEC_BACKEND is defined so it just falls back on
 straight sysv shm in that case.
 
 There are obviously some portability issues here - this is documented
 not to work on Linux = 2.4, but it's not clear whether it fails with
 some suitable error code or just pretends to work and does the wrong
 thing.  I tested that it does compile and work on both Linux 3.2.6 and
 MacOS X 10.6.8.  And the comments probably need work and... who knows
 what else is wrong.  But, thoughts?
Btw, RhodiumToad/Andrew Gierth on irc talked about a reason why sysv shared 
memory might be advantageous on some platforms. E.g. on freebsd there is the 
kern.ipc.shm_use_phys setting which prevents paging out shared memory and also 
seems to make tlb translation cheaper. There does not seem to exist an 
alternative for anonymous mmap.
So maybe we should make that a config option? 

Greetings,

Andres
-- 
Andres Freund   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-07-03 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 Btw, RhodiumToad/Andrew Gierth on irc talked about a reason why sysv shared 
 memory might be advantageous on some platforms. E.g. on freebsd there is the 
 kern.ipc.shm_use_phys setting which prevents paging out shared memory and 
 also 
 seems to make tlb translation cheaper. There does not seem to exist an 
 alternative for anonymous mmap.

Isn't that mlock()?

 So maybe we should make that a config option? 

I'd really rather not.  If we're going to go in this direction, we
should just go there.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-07-03 Thread Robert Haas
On Tue, Jul 3, 2012 at 11:36 AM, Andres Freund and...@2ndquadrant.com wrote:
 Btw, RhodiumToad/Andrew Gierth on irc talked about a reason why sysv shared
 memory might be advantageous on some platforms. E.g. on freebsd there is the
 kern.ipc.shm_use_phys setting which prevents paging out shared memory and also
 seems to make tlb translation cheaper. There does not seem to exist an
 alternative for anonymous mmap.
 So maybe we should make that a config option?

Yeah, I was noticing some notes to that effect in the documentation
this morning.  I think the alternative for anonymous mmap is mlock().
However, that can hit kernel limits of its own.  I'm not sure what the
best thing to do about this is.  I think most users will want mlock...
but maybe not all?  So we end up with one option for whether to use
mlock and another for whether to use more or less System V shm?
Sounds confusing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-07-03 Thread Magnus Hagander
On Tue, Jul 3, 2012 at 5:36 PM, Andres Freund and...@2ndquadrant.com wrote:
 On Wednesday, June 27, 2012 05:28:14 AM Robert Haas wrote:
 On Tue, Jun 26, 2012 at 6:25 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Josh Berkus j...@agliodbs.com writes:
  So let's fix the 80% case with something we feel confident in, and then
  revisit the no-sysv interlock as a separate patch.  That way if we can't
  fix the interlock issues, we still have a reduced-shmem version of
  Postgres.
 
  Yes.  Insisting that we have the whole change in one patch is a good way
  to prevent any forward progress from happening.  As Alvaro noted, there
  are plenty of issues to resolve without trying to change the interlock
  mechanism at the same time.

 So, here's a patch.  Instead of using POSIX shmem, I just took the
 expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS
 memory.  The sysv shm is still allocated, but it's just a copy of
 PGShmemHeader; the real shared memory is the anonymous block.  This
 won't work if EXEC_BACKEND is defined so it just falls back on
 straight sysv shm in that case.

 There are obviously some portability issues here - this is documented
 not to work on Linux = 2.4, but it's not clear whether it fails with
 some suitable error code or just pretends to work and does the wrong
 thing.  I tested that it does compile and work on both Linux 3.2.6 and
 MacOS X 10.6.8.  And the comments probably need work and... who knows
 what else is wrong.  But, thoughts?
 Btw, RhodiumToad/Andrew Gierth on irc talked about a reason why sysv shared
 memory might be advantageous on some platforms. E.g. on freebsd there is the
 kern.ipc.shm_use_phys setting which prevents paging out shared memory and also
 seems to make tlb translation cheaper. There does not seem to exist an
 alternative for anonymous mmap.
 So maybe we should make that a config option?

Interesting to see that FreeBSD does this - while at the same time
refusing to fix the use of sysv shared memory under their own jails
system (afaik, at least). They seem to be quite undecided on if it's a
feature to remove or a feature to expand on :O Not sure I'd trust that
to stick around...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-07-03 Thread Andres Freund
On Tuesday, July 03, 2012 05:41:09 PM Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  Btw, RhodiumToad/Andrew Gierth on irc talked about a reason why sysv
  shared memory might be advantageous on some platforms. E.g. on freebsd
  there is the kern.ipc.shm_use_phys setting which prevents paging out
  shared memory and also seems to make tlb translation cheaper. There does
  not seem to exist an alternative for anonymous mmap.
 Isn't that mlock()?
Similar at least yes. I think it might also make the virtual/physical 
translation more direct but that ist just the impression of a very short 
search.

  So maybe we should make that a config option?
 I'd really rather not.  If we're going to go in this direction, we
 should just go there.
I don't really care, just wanted to bring up that at least one experienced 
user would be disappointed ;). As the old implementation needs to stay around 
for EXEC_BACKEND anyway, the price doesn't seem to be too high.

Andres
-- 
Andres Freund   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-07-03 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On Tuesday, July 03, 2012 05:41:09 PM Tom Lane wrote:
 I'd really rather not.  If we're going to go in this direction, we
 should just go there.

 I don't really care, just wanted to bring up that at least one experienced 
 user would be disappointed ;). As the old implementation needs to stay around
 for EXEC_BACKEND anyway, the price doesn't seem to be too high.

Well, my feeling is that sooner or later, perhaps sooner, we are going
to want to be out from under SysV shmem (and semaphores) entirely.
The Linux kernel guys keep threatening to drop support for the feature.
So I think that exposing any knobs about this, or encouraging people
to rely on corner-case properties of SysV on their platform, is just
going to create more pain when we have to pull the plug.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-07-03 Thread Josh Kupershmidt
On Tue, Jul 3, 2012 at 6:57 AM, Robert Haas robertmh...@gmail.com wrote:
 Here's a patch that attempts to begin the work of adjusting the
 documentation for this brave new world.  I am guessing that there may
 be other places in the documentation that also require updating, and
 this page probably needs more work, but it's a start.

I think the boilerplate warnings in config.sgml about needing to raise
the SysV parameters can go away; patch attached.

Josh


config.sgml.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-07-02 Thread Bruce Momjian
On Fri, Jun 29, 2012 at 04:03:40PM -0700, Daniel Farina wrote:
 On Fri, Jun 29, 2012 at 1:00 PM, Merlin Moncure mmonc...@gmail.com wrote:
  On Fri, Jun 29, 2012 at 2:52 PM, Andres Freund and...@2ndquadrant.com 
  wrote:
  Hi All,
 
  In a *very* quick patch I tested using huge pages/MAP_HUGETLB for the 
  mmap'ed
  memory.
  That gives around 9.5% performance benefit in a read-only pgbench run (-n 
  -S -
  j 64 -c 64 -T 10 -M prepared, scale 200, 6GB s_b, 8 cores, 24GB mem).
 
  It also saves a bunch of memory per process due to the smaller page table
  (shared_buffers 6GB):
  cat /proc/$pid_of_pg_backend/status |grep VmPTE
  VmPTE:  6252 kB
  vs
  VmPTE:60 kB
  ... those results are just spectacular (IMO). nice!
 
 That is super awesome.  Smallish databases with a high number of
 connections actually spend a considerable fraction of their
 otherwise-available-for-buffer-cache space on page tables in common
 cases currently.

I thought newer Linux kernels did huge pages automatically?  What Linux
kernel is this?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-07-02 Thread Robert Haas
On Fri, Jun 29, 2012 at 2:31 PM, Josh Berkus j...@agliodbs.com wrote:
 My idea of not dedicated is I can launch a dozen postmasters on this
 machine, and other services too, and it'll be okay as long as they're
 not doing too much.

 Oh, 128MB then?

Proposed patch attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


initdb-128MB.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-29 Thread Josh Berkus

 According to the Google, there is absolutely no way of gettIng MacOS X
 not to overcommit like crazy.  

Well, this is one of a long list of broken things about OSX.  If you
want to see *real* breakage, do some IO performance testing of HFS+

FWIW, I have this issue with Mac desktop applications on my MacBook,
which will happily memory leak until I run out of swap space.

 You can read the amount of system
 memory by using sysctl() to fetch hw.memsize, but it's not really
 clear how much that helps.  We could refuse to start up if the shared
 memory allocation is = hw.memsize, but even an amount slightly less
 than that seems like enough to send the machine into a tailspin, so
 I'm not sure that really gets us anywhere.

I still think it would help.  User errors in allocating shmmem are more
likely to be order-of-magnitude errors (I meant 500MB, not 500GB!)
than be matters of 20% of RAM over.

 One idea I had was to LOG the size of the shared memory allocation
 just before allocating it.  That way, if your system goes into the
 tank, there will at least be something in the log.  But that would be
 useless chatter for most users.

Yes, but it would provide mailing list, IRC and StackExchange quick answers.

I started up PostgreSQL and my MacBook crashed.

Find the file postgres.log.  What's the last 10 lines?

So neither of those things *fixes* the problem ... ultimately, it's
Apple's problem and we can't fix it ... but both of them make it
somewhat better.

The other thing which will avoid the problem for most Mac users is if we
simply allocate 10% of RAM at initdb as a default.  If we do that, then
90% of users will never touch Shmem themselves, and not have the
opportunity to mess up.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-29 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 The other thing which will avoid the problem for most Mac users is if we
 simply allocate 10% of RAM at initdb as a default.  If we do that, then
 90% of users will never touch Shmem themselves, and not have the
 opportunity to mess up.

If we could do that on *all* platforms, I might be for it, but we only
know how to get that number on some platforms.  There's also the issue
of whether we really want to assume that the machine is dedicated to
Postgres, which IMO is an implicit assumption of any default that scales
itself to physical RAM.

For the moment I think we should just allow initdb to scale up a little
bit more than where it is now, perhaps 128MB instead of 32.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-29 Thread Josh Berkus
Tom,

 If we could do that on *all* platforms, I might be for it, but we only
 know how to get that number on some platforms. 

I don't see what's wrong with using it where we can get it, and not
using it where we can't.

  There's also the issue
 of whether we really want to assume that the machine is dedicated to
 Postgres, which IMO is an implicit assumption of any default that scales
 itself to physical RAM.

10% isn't assuming dedicated.  Assuming dedicated would be 20% or 25%.

I was thinking 10%, with a ceiling of 512MB.

 For the moment I think we should just allow initdb to scale up a little
 bit more than where it is now, perhaps 128MB instead of 32.

I wouldn't be opposed to that.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-29 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 If we could do that on *all* platforms, I might be for it, but we only
 know how to get that number on some platforms. 

 I don't see what's wrong with using it where we can get it, and not
 using it where we can't.

Because then we still need to define, and document, a sensible behavior
on the machines where we can't get it.  And document that we do it two
different ways, and document which machines we do it which way on.

 There's also the issue
 of whether we really want to assume that the machine is dedicated to
 Postgres, which IMO is an implicit assumption of any default that scales
 itself to physical RAM.

 10% isn't assuming dedicated.

Really?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-29 Thread Josh Berkus

 10% isn't assuming dedicated.
 
 Really?

Yes.  As I said, the allocation for dedicated PostgreSQL servers is
usually 20% to 25%, up to 8GB.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-29 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 10% isn't assuming dedicated.

 Really?

 Yes.  As I said, the allocation for dedicated PostgreSQL servers is
 usually 20% to 25%, up to 8GB.

Any percentage is assuming dedicated, IMO.  25% might be the more common
number, but you're still assuming that you can have your pick of the
machine's resources.

My idea of not dedicated is I can launch a dozen postmasters on this
machine, and other services too, and it'll be okay as long as they're
not doing too much.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-29 Thread Josh Berkus

 My idea of not dedicated is I can launch a dozen postmasters on this
 machine, and other services too, and it'll be okay as long as they're
 not doing too much.

Oh, 128MB then?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-29 Thread Andres Freund
Hi All,

In a *very* quick patch I tested using huge pages/MAP_HUGETLB for the mmap'ed 
memory.
That gives around 9.5% performance benefit in a read-only pgbench run (-n -S -
j 64 -c 64 -T 10 -M prepared, scale 200, 6GB s_b, 8 cores, 24GB mem).

It also saves a bunch of memory per process due to the smaller page table 
(shared_buffers 6GB):
cat /proc/$pid_of_pg_backend/status |grep VmPTE
VmPTE:  6252 kB
vs
VmPTE:60 kB

Additionally it has the advantage that top/ps/... output under linux now looks 
like:
  PID USER  PR  NI  VIRT  RES  SHR S  %CPU %MEMTIME+  COMMAND 
10603 andres20   0 6381m 4924 1952 R21  0.0   0:28.04 postgres  

i.e. RES now actually shows something usable... Which is rather nice imo.

I don't have the time atm into making this something useable, maybe somebody 
else want to pick it up? Looks pretty worthwile investing some time.

Because of the required setup we sure cannot make this the default but...

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index e040400..05bbdf6 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -54,7 +54,7 @@ typedef int IpcMemoryId;		/* shared memory ID returned by shmget(2) */
 #define MAP_HASSEMAPHORE		0
 #endif
 
-#define	PG_MMAP_FLAGS			(MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+#define	PG_MMAP_FLAGS			(MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE|MAP_HUGETLB)
 
 /* Some really old systems don't define MAP_FAILED. */
 #ifndef MAP_FAILED
@@ -407,6 +407,10 @@ PGSharedMemoryCreate(Size size, bool makePrivate, int port)
 	{
 		long	pagesize = sysconf(_SC_PAGE_SIZE);
 
+		/* round up to hugetlb size on x86-64 linux */
+		if(pagesize  (1024*2048))
+			pagesize = 1024*2048;
+
 		/*
 		 * Ensure request size is a multiple of pagesize.
 		 *

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-29 Thread Merlin Moncure
On Fri, Jun 29, 2012 at 2:52 PM, Andres Freund and...@2ndquadrant.com wrote:
 Hi All,

 In a *very* quick patch I tested using huge pages/MAP_HUGETLB for the mmap'ed
 memory.
 That gives around 9.5% performance benefit in a read-only pgbench run (-n -S -
 j 64 -c 64 -T 10 -M prepared, scale 200, 6GB s_b, 8 cores, 24GB mem).

 It also saves a bunch of memory per process due to the smaller page table
 (shared_buffers 6GB):
 cat /proc/$pid_of_pg_backend/status |grep VmPTE
 VmPTE:      6252 kB
 vs
 VmPTE:        60 kB

 Additionally it has the advantage that top/ps/... output under linux now looks
 like:
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
 10603 andres    20   0 6381m 4924 1952 R    21  0.0   0:28.04 postgres

 i.e. RES now actually shows something usable... Which is rather nice imo.

 I don't have the time atm into making this something useable, maybe somebody
 else want to pick it up? Looks pretty worthwile investing some time.

 Because of the required setup we sure cannot make this the default but...

... those results are just spectacular (IMO). nice!

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-29 Thread Daniel Farina
On Fri, Jun 29, 2012 at 1:00 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Fri, Jun 29, 2012 at 2:52 PM, Andres Freund and...@2ndquadrant.com wrote:
 Hi All,

 In a *very* quick patch I tested using huge pages/MAP_HUGETLB for the mmap'ed
 memory.
 That gives around 9.5% performance benefit in a read-only pgbench run (-n -S 
 -
 j 64 -c 64 -T 10 -M prepared, scale 200, 6GB s_b, 8 cores, 24GB mem).

 It also saves a bunch of memory per process due to the smaller page table
 (shared_buffers 6GB):
 cat /proc/$pid_of_pg_backend/status |grep VmPTE
 VmPTE:  6252 kB
 vs
 VmPTE:60 kB
 ... those results are just spectacular (IMO). nice!

That is super awesome.  Smallish databases with a high number of
connections actually spend a considerable fraction of their
otherwise-available-for-buffer-cache space on page tables in common
cases currently.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Magnus Hagander
On Thu, Jun 28, 2012 at 7:00 AM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jun 27, 2012 at 9:44 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Would Posix shmem help with that at all?  Why did you choose not to
 use the Posix API, anyway?

 It seemed more complicated.  If we use the POSIX API, we've got to
 have code to find a non-colliding name for the shm, and we've got to
 arrange to clean it up at process exit.  Anonymous shm doesn't require
 a name and goes away automatically when it's no longer in use.

 I see.  Those are pretty good reasons ...

 So, should we do it this way?

 I did a little research and discovered that Linux 2.3.51 (released
 3/11/2000) apparently returns EINVAL for MAP_SHARED|MAP_ANONYMOUS.
 That combination is documented to work beginning in Linux 2.4.0.  How
 worried should we be about people trying to run PostgreSQL 9.3 on
 pre-2.4 kernels?  If we want to worry about it, we could try mapping a
 one-page shared MAP_SHARED|MAP_ANONYMOUS segment first.  If that
 works, we could assume that we have a working MAP_SHARED|MAP_ANONYMOUS
 facility and try to allocate the whole segment plus a minimal sysv
 shm.  If the single page allocation fails with EINVAL, we could fall
 back to allocating the entire segment as sysv shm.

Do we really need a runtime check for that? Isn't a configure check
enough? If they *do* deploy postgresql 9.3 on something that old,
they're building from source anyway...


 A related question is - if we do this - should we enable it only on
 ports where we've verified that it works, or should we just turn it on
 everywhere and fix breakage if/when it's reported?  I lean toward the
 latter.

Depends on the amount of expected breakage, but I'd lean towards teh
later as well.


 If we find that there are platforms where (a) mmap is not supported or
 (b) MAP_SHARED|MAP_ANON works but has the wrong semantics, we could
 either shut off this optimization on those platforms by fiat, or we
 could test not only that the call succeeds, but that it works
 properly: create a one-page mapping and fork a child process; in the
 child, write to the mapping and exit; in the parent, wait for the
 child to exit and then test that we can read back the correct
 contents.  This would protect against a hypothetical system where the
 flags are accepted but fail to produce the correct behavior.  I'm
 inclined to think this is over-engineering in the absence of evidence
 that there are platforms that work this way.

Could we actually turn *that* into a configure test, or will that be
too complex?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Robert Haas
On Thu, Jun 28, 2012 at 7:05 AM, Magnus Hagander mag...@hagander.net wrote:
 Do we really need a runtime check for that? Isn't a configure check
 enough? If they *do* deploy postgresql 9.3 on something that old,
 they're building from source anyway...
[...]

 Could we actually turn *that* into a configure test, or will that be
 too complex?

I don't see why we *couldn't* make either of those things into a
configure test, but it seems more complicated than a runtime test and
less accurate, so I guess I'd be in favor of doing them at runtime or
not at all.

Actually, the try-a-one-page-mapping-and-see-if-you-get-EINVAL test is
so simple that I really can't see any reason not to insert that
defense.  The fork-and-check-whether-it-really-works test is probably
excess paranoia until we determine whether that's really a danger
anywhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Jon Nelson
On Thu, Jun 28, 2012 at 6:05 AM, Magnus Hagander mag...@hagander.net wrote:
 On Thu, Jun 28, 2012 at 7:00 AM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jun 27, 2012 at 9:44 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Would Posix shmem help with that at all?  Why did you choose not to
 use the Posix API, anyway?

 It seemed more complicated.  If we use the POSIX API, we've got to
 have code to find a non-colliding name for the shm, and we've got to
 arrange to clean it up at process exit.  Anonymous shm doesn't require
 a name and goes away automatically when it's no longer in use.

 I see.  Those are pretty good reasons ...

 So, should we do it this way?

 I did a little research and discovered that Linux 2.3.51 (released
 3/11/2000) apparently returns EINVAL for MAP_SHARED|MAP_ANONYMOUS.
 That combination is documented to work beginning in Linux 2.4.0.  How
 worried should we be about people trying to run PostgreSQL 9.3 on
 pre-2.4 kernels?  If we want to worry about it, we could try mapping a
 one-page shared MAP_SHARED|MAP_ANONYMOUS segment first.  If that
 works, we could assume that we have a working MAP_SHARED|MAP_ANONYMOUS
 facility and try to allocate the whole segment plus a minimal sysv
 shm.  If the single page allocation fails with EINVAL, we could fall
 back to allocating the entire segment as sysv shm.

Why not just mmap /dev/zero (MAP_SHARED but not MAP_ANONYMOUS)?  I
seem to think that's what I did when I needed this functionality oh so
many moons ago.

-- 
Jon

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Robert Haas
On Thu, Jun 28, 2012 at 9:47 AM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 Why not just mmap /dev/zero (MAP_SHARED but not MAP_ANONYMOUS)?  I
 seem to think that's what I did when I needed this functionality oh so
 many moons ago.

From the reading I've done on this topic, that seems to be a trick
invented on Solaris that is considered grotty and awful by everyone
else.  The thing is that you want the mapping to be shared with the
processes that inherit the mapping from you.  You do *NOT* want the
mapping to be shared with EVERYONE who has mapped that file for any
reason, which is the usual meaning of MAP_SHARED on a file.  Maybe
this happens to work correctly on some or all platforms, but I would
want to have some convincing evidence that it's more widely supported
(with the correct semantics) than MAP_ANON before relying on it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes:
 On Thu, Jun 28, 2012 at 7:00 AM, Robert Haas robertmh...@gmail.com wrote:
 A related question is - if we do this - should we enable it only on
 ports where we've verified that it works, or should we just turn it on
 everywhere and fix breakage if/when it's reported?  I lean toward the
 latter.

 Depends on the amount of expected breakage, but I'd lean towards teh
 later as well.

If we don't turn it on, we won't find out whether it works.  I'd say try
it first and then back off if that proves necessary.  I'd just as soon
not see us write any fallback logic without evidence that it's needed.

FWIW, even my pet dinosaur HP-UX 10.20 box appears to support
mmap(MAP_SHARED|MAP_ANONYMOUS) --- at least the mmap man page documents
both flags.  I find it really pretty hard to believe that there are any
machines out there that haven't got this and yet might be expected to
run PG 9.3+.  We should not go into it with an expectation of failure,
anyway.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Jon Nelson
On Thu, Jun 28, 2012 at 8:57 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jun 28, 2012 at 9:47 AM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 Why not just mmap /dev/zero (MAP_SHARED but not MAP_ANONYMOUS)?  I
 seem to think that's what I did when I needed this functionality oh so
 many moons ago.

 From the reading I've done on this topic, that seems to be a trick
 invented on Solaris that is considered grotty and awful by everyone
 else.  The thing is that you want the mapping to be shared with the
 processes that inherit the mapping from you.  You do *NOT* want the
 mapping to be shared with EVERYONE who has mapped that file for any
 reason, which is the usual meaning of MAP_SHARED on a file.  Maybe
 this happens to work correctly on some or all platforms, but I would
 want to have some convincing evidence that it's more widely supported
 (with the correct semantics) than MAP_ANON before relying on it.

When I did this (I admit, it was on Linux but it was a long time ago)
only the inherited file descriptor + mmap structure mattered -
modifications were private to the process and it's children - other
apps always saw their own /dev/zero. A quick google suggests that -
according to qnx, sco, and some others - mmap'ing /dev/zero retains
the expected privacy. Given how /dev/zero works I'd be very surprised
if it was otherwise.

I would love to see links that suggest that /dev/zero is nasty (or, in
fact, in any way fundamentally different than mmap'ing /dev/zero) -
feel free to send them to me privately to avoid polluting the list.

-- 
Jon

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Tom Lane
... btw, I rather imagine that Robert has already noticed this, but OS X
(and presumably other BSDen) spells the flag MAP_ANON not
MAP_ANONYMOUS.  I also find this rather interesting flag there:

 MAP_HASSEMAPHORE  Notify the kernel that the region may contain sema-
   phores and that special handling may be necessary.

By semaphore I suspect they mean spinlock, so we'd better turn this
flag on where it exists.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Robert Haas
On Thu, Jun 28, 2012 at 10:11 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 ... btw, I rather imagine that Robert has already noticed this, but OS X
 (and presumably other BSDen) spells the flag MAP_ANON not
 MAP_ANONYMOUS.  I also find this rather interesting flag there:

     MAP_HASSEMAPHORE  Notify the kernel that the region may contain sema-
                       phores and that special handling may be necessary.

 By semaphore I suspect they mean spinlock, so we'd better turn this
 flag on where it exists.

Sounds fine to me.  Since no one seems opposed to the basic approach,
and everyone (I assume) will be happier to reduce the impact of
dealing with shared memory limits, I went ahead and committed a
cleaned-up version of the previous patch.  Let's see what the
build-farm thinks.

Assuming things go well, there are a number of follow-on things that
we need to do finish this up:

1. Update the documentation.  I skipped this for now, because I think
that what we write there is going to be heavily dependent on how
portable this turns out to be, which we don't know yet.  Also, it's
not exactly clear to me what the documentation should say if this does
turn out to work everywhere.  Much of section 17.4 will become
irrelevant to most users, but I doubt we'd just want to remove it; it
could still matter for people running EXEC_BACKEND or running many
postmasters on the same machine or, of course, people running on
platforms where this just doesn't work, if there are any.

2. Update the HINT messages when shared memory allocation fails.
Maybe the new most-common-failure mode there will be too many
postmasters running on the same machine?  We might need to wait for
some field reports before adjusting this.

3. Consider adjusting the logic inside initdb.  If this works
everywhere, the code for determining how to set shared_buffers should
become pretty much irrelevant.  Even if it only works some places, we
could add 64MB or 128MB or whatever to the list of values we probe, so
that people won't get quite such a sucky configuration out of the box.
 Of course there's no number here that will be good for everyone.

and of course

4. Fix any platforms that are now horribly broken.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Thom Brown
On 28 June 2012 16:26, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jun 28, 2012 at 10:11 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 ... btw, I rather imagine that Robert has already noticed this, but OS X
 (and presumably other BSDen) spells the flag MAP_ANON not
 MAP_ANONYMOUS.  I also find this rather interesting flag there:

     MAP_HASSEMAPHORE  Notify the kernel that the region may contain sema-
                       phores and that special handling may be necessary.

 By semaphore I suspect they mean spinlock, so we'd better turn this
 flag on where it exists.

 Sounds fine to me.  Since no one seems opposed to the basic approach,
 and everyone (I assume) will be happier to reduce the impact of
 dealing with shared memory limits, I went ahead and committed a
 cleaned-up version of the previous patch.  Let's see what the
 build-farm thinks.

 Assuming things go well, there are a number of follow-on things that
 we need to do finish this up:

 1. Update the documentation.  I skipped this for now, because I think
 that what we write there is going to be heavily dependent on how
 portable this turns out to be, which we don't know yet.  Also, it's
 not exactly clear to me what the documentation should say if this does
 turn out to work everywhere.  Much of section 17.4 will become
 irrelevant to most users, but I doubt we'd just want to remove it; it
 could still matter for people running EXEC_BACKEND or running many
 postmasters on the same machine or, of course, people running on
 platforms where this just doesn't work, if there are any.

 2. Update the HINT messages when shared memory allocation fails.
 Maybe the new most-common-failure mode there will be too many
 postmasters running on the same machine?  We might need to wait for
 some field reports before adjusting this.

 3. Consider adjusting the logic inside initdb.  If this works
 everywhere, the code for determining how to set shared_buffers should
 become pretty much irrelevant.  Even if it only works some places, we
 could add 64MB or 128MB or whatever to the list of values we probe, so
 that people won't get quite such a sucky configuration out of the box.
  Of course there's no number here that will be good for everyone.

 and of course

 4. Fix any platforms that are now horribly broken.

On 64-bit Linux, if I allocate more shared buffers than the system is
capable of reserving, it doesn't start.  This is expected, but there's
no error logged anywhere (actually, nothing logged at all), and the
postmaster.pid file is left behind after this failure.

-- 
Thom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Jeff Janes
On Thu, Jun 28, 2012 at 8:26 AM, Robert Haas robertmh...@gmail.com wrote:

 3. Consider adjusting the logic inside initdb.  If this works
 everywhere, the code for determining how to set shared_buffers should
 become pretty much irrelevant.  Even if it only works some places, we
 could add 64MB or 128MB or whatever to the list of values we probe, so
 that people won't get quite such a sucky configuration out of the box.
  Of course there's no number here that will be good for everyone.

This seems independent of the type of shared memory used and the
limits on it.  If it tried and 64MB or 128MB and discovered that it
couldn't obtain that much shared memory, it automatically climbs down
to smaller values until it finds one that works.  I think the
impediment to adopting larger defaults is not what happens if it can't
get that much shared memory, but rather what happens if the machine
doesn't have that much physical memory.  The test server will still
start (and so there will be no climb-down), leaving a default which is
valid but just has horrid performance.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Robert Haas
On Thu, Jun 28, 2012 at 12:13 PM, Thom Brown t...@linux.com wrote:
 On 64-bit Linux, if I allocate more shared buffers than the system is
 capable of reserving, it doesn't start.  This is expected, but there's
 no error logged anywhere (actually, nothing logged at all), and the
 postmaster.pid file is left behind after this failure.

Fixed.

However, I discovered something unpleasant.  With the new code, on
MacOS X, if you set shared_buffers to say 3200GB, the server happily
starts up.  Or at least the shared memory allocation goes through just
fine.  The postmaster then sits there apparently forever without
emitting any log messages, which I eventually discovered was because
it's busy initializing a billion or so spinlocks.

I'm pretty sure that this machine does not have 3TB of virtual
memory, even counting swap.  So that means that MacOS X has absolutely
no common sense whatsoever as far as anonymous shared memory
allocations go.  Not sure exactly what to do about that.  Linux is
more sensible, at least on the system I tested, and fails cleanly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Magnus Hagander
On Thu, Jun 28, 2012 at 7:15 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jun 28, 2012 at 12:13 PM, Thom Brown t...@linux.com wrote:
 On 64-bit Linux, if I allocate more shared buffers than the system is
 capable of reserving, it doesn't start.  This is expected, but there's
 no error logged anywhere (actually, nothing logged at all), and the
 postmaster.pid file is left behind after this failure.

 Fixed.

 However, I discovered something unpleasant.  With the new code, on
 MacOS X, if you set shared_buffers to say 3200GB, the server happily
 starts up.  Or at least the shared memory allocation goes through just
 fine.  The postmaster then sits there apparently forever without
 emitting any log messages, which I eventually discovered was because
 it's busy initializing a billion or so spinlocks.

 I'm pretty sure that this machine does not have 3TB of virtual
 memory, even counting swap.  So that means that MacOS X has absolutely
 no common sense whatsoever as far as anonymous shared memory
 allocations go.  Not sure exactly what to do about that.  Linux is
 more sensible, at least on the system I tested, and fails cleanly.

What happens if you mlock() it into memory - does that fail quickly?
Is that not something we might want to do *anyway*?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Andres Freund
On Thursday, June 28, 2012 07:19:46 PM Magnus Hagander wrote:
 On Thu, Jun 28, 2012 at 7:15 PM, Robert Haas robertmh...@gmail.com wrote:
  On Thu, Jun 28, 2012 at 12:13 PM, Thom Brown t...@linux.com wrote:
  On 64-bit Linux, if I allocate more shared buffers than the system is
  capable of reserving, it doesn't start.  This is expected, but there's
  no error logged anywhere (actually, nothing logged at all), and the
  postmaster.pid file is left behind after this failure.
  
  Fixed.
  
  However, I discovered something unpleasant.  With the new code, on
  MacOS X, if you set shared_buffers to say 3200GB, the server happily
  starts up.  Or at least the shared memory allocation goes through just
  fine.  The postmaster then sits there apparently forever without
  emitting any log messages, which I eventually discovered was because
  it's busy initializing a billion or so spinlocks.
  
  I'm pretty sure that this machine does not have 3TB of virtual
  memory, even counting swap.  So that means that MacOS X has absolutely
  no common sense whatsoever as far as anonymous shared memory
  allocations go.  Not sure exactly what to do about that.  Linux is
  more sensible, at least on the system I tested, and fails cleanly.
 
 What happens if you mlock() it into memory - does that fail quickly?
 Is that not something we might want to do *anyway*?
You normally can only mlock() mminor amounts of memory without changing 
settings. Requiring to change that setting (aside that mlocking would be a bad 
idea imo) would run contrary to the point of the patch, wouldn't it? ;)

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Magnus Hagander
On Thu, Jun 28, 2012 at 7:27 PM, Andres Freund and...@2ndquadrant.com wrote:
 On Thursday, June 28, 2012 07:19:46 PM Magnus Hagander wrote:
 On Thu, Jun 28, 2012 at 7:15 PM, Robert Haas robertmh...@gmail.com wrote:
  On Thu, Jun 28, 2012 at 12:13 PM, Thom Brown t...@linux.com wrote:
  On 64-bit Linux, if I allocate more shared buffers than the system is
  capable of reserving, it doesn't start.  This is expected, but there's
  no error logged anywhere (actually, nothing logged at all), and the
  postmaster.pid file is left behind after this failure.
 
  Fixed.
 
  However, I discovered something unpleasant.  With the new code, on
  MacOS X, if you set shared_buffers to say 3200GB, the server happily
  starts up.  Or at least the shared memory allocation goes through just
  fine.  The postmaster then sits there apparently forever without
  emitting any log messages, which I eventually discovered was because
  it's busy initializing a billion or so spinlocks.
 
  I'm pretty sure that this machine does not have 3TB of virtual
  memory, even counting swap.  So that means that MacOS X has absolutely
  no common sense whatsoever as far as anonymous shared memory
  allocations go.  Not sure exactly what to do about that.  Linux is
  more sensible, at least on the system I tested, and fails cleanly.

 What happens if you mlock() it into memory - does that fail quickly?
 Is that not something we might want to do *anyway*?
 You normally can only mlock() mminor amounts of memory without changing
 settings. Requiring to change that setting (aside that mlocking would be a bad
 idea imo) would run contrary to the point of the patch, wouldn't it? ;)

It would. I wasn't aware of that limitation :)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes:
 On Thu, Jun 28, 2012 at 7:27 PM, Andres Freund and...@2ndquadrant.com wrote:
 On Thursday, June 28, 2012 07:19:46 PM Magnus Hagander wrote:
 What happens if you mlock() it into memory - does that fail quickly?
 Is that not something we might want to do *anyway*?

 You normally can only mlock() mminor amounts of memory without changing
 settings. Requiring to change that setting (aside that mlocking would be a 
 bad
 idea imo) would run contrary to the point of the patch, wouldn't it? ;)

 It would. I wasn't aware of that limitation :)

The OSX man page says that mlock should give EAGAIN for a permissions
failure (ie, exceeding the rlimit) but

 [ENOMEM]   Some portion of the indicated address range is not
allocated.  There was an error faulting/mapping a
page.

It might be helpful to try mlock (if available, which it isn't
everywhere) and complain about ENOMEM but not other errors.  If course,
if the kernel checks rlimit first, we won't learn anything ...

I think it *would* be a good idea to mlock if we could.  Setting shmem
large enough that it swaps has always been horrible for performance,
and in sysv-land there's no way to prevent that.  But we can't error
out on permissions failure.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Andres Freund
On Thursday, June 28, 2012 07:43:16 PM Tom Lane wrote:
 Magnus Hagander mag...@hagander.net writes:
  On Thu, Jun 28, 2012 at 7:27 PM, Andres Freund and...@2ndquadrant.com 
wrote:
  On Thursday, June 28, 2012 07:19:46 PM Magnus Hagander wrote:
  What happens if you mlock() it into memory - does that fail quickly?
  Is that not something we might want to do *anyway*?
  
  You normally can only mlock() mminor amounts of memory without changing
  settings. Requiring to change that setting (aside that mlocking would be
  a bad idea imo) would run contrary to the point of the patch, wouldn't
  it? ;)
  
  It would. I wasn't aware of that limitation :)
 
 The OSX man page says that mlock should give EAGAIN for a permissions
 failure (ie, exceeding the rlimit) but
 
  [ENOMEM]   Some portion of the indicated address range is not
 allocated.  There was an error faulting/mapping a
 page.
 
 It might be helpful to try mlock (if available, which it isn't
 everywhere) and complain about ENOMEM but not other errors.  If course,
 if the kernel checks rlimit first, we won't learn anything ...
 
 I think it *would* be a good idea to mlock if we could.  Setting shmem
 large enough that it swaps has always been horrible for performance,
 and in sysv-land there's no way to prevent that.  But we can't error
 out on permissions failure.
Its also a very good method to get into hard to diagnose OOM situations 
though. Unless the machine is setup very careful and only runs postgres I 
don't think its acceptable to do that.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On Thursday, June 28, 2012 07:43:16 PM Tom Lane wrote:
 I think it *would* be a good idea to mlock if we could.  Setting shmem
 large enough that it swaps has always been horrible for performance,
 and in sysv-land there's no way to prevent that.  But we can't error
 out on permissions failure.

 Its also a very good method to get into hard to diagnose OOM situations 
 though. Unless the machine is setup very careful and only runs postgres I 
 don't think its acceptable to do that.

Well, the permissions angle is actually a good thing here.  There is
pretty much no risk of the mlock succeeding on a box that hasn't been
specially configured --- and, in most cases, I think you'd need root
cooperation to raise postgres' RLIMIT_MEMLOCK.  So I think we could try
to mlock without having any effect for 99% of users.  The 1% who are
smart enough to raise the rlimit to something suitable would get better,
or at least more predictable, performance.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Andres Freund
On Thursday, June 28, 2012 08:00:06 PM Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On Thursday, June 28, 2012 07:43:16 PM Tom Lane wrote:
  I think it *would* be a good idea to mlock if we could.  Setting shmem
  large enough that it swaps has always been horrible for performance,
  and in sysv-land there's no way to prevent that.  But we can't error
  out on permissions failure.
  
  Its also a very good method to get into hard to diagnose OOM situations
  though. Unless the machine is setup very careful and only runs postgres I
  don't think its acceptable to do that.
 
 Well, the permissions angle is actually a good thing here.  There is
 pretty much no risk of the mlock succeeding on a box that hasn't been
 specially configured --- and, in most cases, I think you'd need root
 cooperation to raise postgres' RLIMIT_MEMLOCK.  So I think we could try
 to mlock without having any effect for 99% of users.  The 1% who are
 smart enough to raise the rlimit to something suitable would get better,
 or at least more predictable, performance.
The heightened limit might just as well target at another application and be 
setup a bit to widely. I agree that it is useful, but I think it requires its 
own setting, defaulting to off. Especially as there are no experiences with 
running a larger pg instance that way.

Greetings,

Andres, for once the conservative one, Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On Thursday, June 28, 2012 08:00:06 PM Tom Lane wrote:
 Well, the permissions angle is actually a good thing here.  There is
 pretty much no risk of the mlock succeeding on a box that hasn't been
 specially configured --- and, in most cases, I think you'd need root
 cooperation to raise postgres' RLIMIT_MEMLOCK.  So I think we could try
 to mlock without having any effect for 99% of users.  The 1% who are
 smart enough to raise the rlimit to something suitable would get better,
 or at least more predictable, performance.

 The heightened limit might just as well target at another application and be 
 setup a bit to widely. I agree that it is useful, but I think it requires its 
 own setting, defaulting to off. Especially as there are no experiences with 
 running a larger pg instance that way.

[ shrug... ]  I think you're inventing things to be afraid of, and
ignoring a very real problem that mlock could fix.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Robert Haas
On Thu, Jun 28, 2012 at 1:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Magnus Hagander mag...@hagander.net writes:
 On Thu, Jun 28, 2012 at 7:27 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
 On Thursday, June 28, 2012 07:19:46 PM Magnus Hagander wrote:
 What happens if you mlock() it into memory - does that fail quickly?
 Is that not something we might want to do *anyway*?

 You normally can only mlock() mminor amounts of memory without changing
 settings. Requiring to change that setting (aside that mlocking would be a 
 bad
 idea imo) would run contrary to the point of the patch, wouldn't it? ;)

 It would. I wasn't aware of that limitation :)

 The OSX man page says that mlock should give EAGAIN for a permissions
 failure (ie, exceeding the rlimit) but

     [ENOMEM]           Some portion of the indicated address range is not
                        allocated.  There was an error faulting/mapping a
                        page.

 It might be helpful to try mlock (if available, which it isn't
 everywhere) and complain about ENOMEM but not other errors.  If course,
 if the kernel checks rlimit first, we won't learn anything ...

I tried this.  At least on my fairly vanilla MacOS X desktop, an mlock
for a larger amount of memory than was conveniently on hand (4GB, on a
4GB box) neither succeeded nor failed in a timely fashion but instead
progressively hung the machine, apparently trying to progressively
push every available page out to swap.  After 5 minutes or so I could
no longer move the mouse.  After about 20 minutes I gave up and hit
the reset button.  So there's apparently no value to this as a
diagnostic tool, at least on this platform.

 I think it *would* be a good idea to mlock if we could.  Setting shmem
 large enough that it swaps has always been horrible for performance,
 and in sysv-land there's no way to prevent that.  But we can't error
 out on permissions failure.

I wouldn't mind having an option, but I think there'd have to be a way
to turn it off for people trying to cram as many lightly-used VMs as
possible onto a single server.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I tried this.  At least on my fairly vanilla MacOS X desktop, an mlock
 for a larger amount of memory than was conveniently on hand (4GB, on a
 4GB box) neither succeeded nor failed in a timely fashion but instead
 progressively hung the machine, apparently trying to progressively
 push every available page out to swap.  After 5 minutes or so I could
 no longer move the mouse.  After about 20 minutes I gave up and hit
 the reset button.  So there's apparently no value to this as a
 diagnostic tool, at least on this platform.

Fun.  I wonder if other BSDen are as brain-dead as OSX on this point.

It'd probably at least be worth filing a bug report with Apple about it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-28 Thread Robert Haas
On Thu, Jun 28, 2012 at 2:51 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 I tried this.  At least on my fairly vanilla MacOS X desktop, an mlock
 for a larger amount of memory than was conveniently on hand (4GB, on a
 4GB box) neither succeeded nor failed in a timely fashion but instead
 progressively hung the machine, apparently trying to progressively
 push every available page out to swap.  After 5 minutes or so I could
 no longer move the mouse.  After about 20 minutes I gave up and hit
 the reset button.  So there's apparently no value to this as a
 diagnostic tool, at least on this platform.

 Fun.  I wonder if other BSDen are as brain-dead as OSX on this point.

 It'd probably at least be worth filing a bug report with Apple about it.

Just for fun, I tried writing a program that does power-of-two-sized
malloc requests.

The first one that failed - on my 4GB Mac, remember - was for
140737488355328 bytes.  Yeah, that' s right: 128 TB.

According to the Google, there is absolutely no way of gettIng MacOS X
not to overcommit like crazy.  You can read the amount of system
memory by using sysctl() to fetch hw.memsize, but it's not really
clear how much that helps.  We could refuse to start up if the shared
memory allocation is = hw.memsize, but even an amount slightly less
than that seems like enough to send the machine into a tailspin, so
I'm not sure that really gets us anywhere.

One idea I had was to LOG the size of the shared memory allocation
just before allocating it.  That way, if your system goes into the
tank, there will at least be something in the log.  But that would be
useless chatter for most users.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-27 Thread Robert Haas
On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 So, here's a patch.  Instead of using POSIX shmem, I just took the
 expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS
 memory.  The sysv shm is still allocated, but it's just a copy of
 PGShmemHeader; the real shared memory is the anonymous block.  This
 won't work if EXEC_BACKEND is defined so it just falls back on
 straight sysv shm in that case.

 Um.  I hadn't thought about the EXEC_BACKEND interaction, but that seems
 like a bit of a showstopper.  I would not like to give up the ability
 to debug EXEC_BACKEND mode on Unixen.

 Would Posix shmem help with that at all?  Why did you choose not to
 use the Posix API, anyway?

It seemed more complicated.  If we use the POSIX API, we've got to
have code to find a non-colliding name for the shm, and we've got to
arrange to clean it up at process exit.  Anonymous shm doesn't require
a name and goes away automatically when it's no longer in use.

With respect to EXEC_BACKEND, I wasn't proposing to kill it, just to
make it continue to use a full-sized sysv shm.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-27 Thread Magnus Hagander
On Wed, Jun 27, 2012 at 3:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 A.M. age...@themactionfaction.com writes:
 On 06/26/2012 07:30 PM, Tom Lane wrote:
 I solved this via fcntl locking.

 No, you didn't, because fcntl locks aren't inherited by child processes.
 Too bad, because they'd be a great solution otherwise.

 You claimed this last time and I replied:
 http://archives.postgresql.org/pgsql-hackers/2011-04/msg00656.php

 I address this race condition by ensuring that a lock-holding violator
 is the postmaster or a postmaster child. If such as condition is
 detected, the child exits immediately without touching the shared
 memory. POSIX shmem is inherited via file descriptors.

 This is possible because the locking API allows one to request which PID
 violates the lock. The child expects the lock to be held and checks that
 the PID is the parent. If the lock is not held, that means that the
 postmaster is dead, so the child exits immediately.

 OK, I went back and re-read the original patch, and I now agree that
 something like this is possible --- but I don't like the way you did
 it. The dependence on particular PIDs seems both unnecessary and risky.

 The key concept here seems to be that the postmaster first stakes a
 claim on the data directory by exclusive-locking a lock file.  If
 successful, it reduces that lock to shared mode (which can be done
 atomically, according to the SUS fcntl specification), and then holds
 the shared lock until it exits.  Spawned children will not initially
 have a lock, but what they can do is attempt to acquire shared lock on
 the lock file.  If fail, exit.  If successful, *check to see that the
 parent postmaster is still alive* (ie, getppid() != 1).  If so, the
 parent must have been continuously holding the lock, and the child has
 successfully joined the pool of shared lock holders.  Otherwise bail
 out without having changed anything.  It is the parent is still alive
 check, not any test on individual PIDs, that makes this work.

 There are two concrete reasons why I don't care for the
 GetPIDHoldingLock() way.  Firstly, the fact that you can get a blocking
 PID from F_GETLK isn't an essential part of the concept of file locking
 IMO --- it's just an incidental part of this particular API.  May I
 remind you that the reason we're stuck on SysV shmem in the first place
 is that we decided to depend on an incidental part of that API, namely
 nattch?  I would like to not require file locking to have any semantics
 more specific than a process can hold an exclusive or a shared lock on
 a file, which is auto-released at process exit.  Secondly, in an NFS
 world I don't believe that the returned l_pid value can be trusted for
 anything.  If it's a PID from a different machine then it might
 accidentally conflict with one on our machine, or not.

 Reflecting on this further, it seems to me that the main remaining
 failure modes are (1) file locking doesn't work, or (2) idiot DBA
 manually removes the lock file.  Both of these could be ameliorated
 with some refinements to the basic idea.  For (1), I suggest that
 we tweak the startup process (only) to attempt to acquire exclusive lock
 on the lock file.  If it succeeds, we know that file locking is broken,
 and we can complain.  (This wouldn't help for cases where cross-machine
 locking is broken, but I see no practical way to detect that.)
 For (2), the problem really is that the proposed patch conflates the PID
 file with the lock file, but people are conditioned to think that PID
 files are removable.  I suggest that we create a separate, permanently
 present file that serves only as the lock file and doesn't ever get
 modified (it need have no content other than the string Don't remove
 this!).  It'd be created by initdb, not by individual postmaster runs;
 indeed the postmaster should fail if it doesn't find the lock file
 already present.  The postmaster PID file should still exist with its
 current contents, but it would serve mostly as documentation and as
 server-contact information for pg_ctl; it would not be part of the data
 directory locking mechanism.

 I wonder whether this design can be adapted to Windows?  IIRC we do
 not have a bulletproof data directory lock scheme for Windows.
 It seems like this makes few enough demands on the lock mechanism
 that there ought to be suitable primitives available there too.

I assume you're saying we need to make changes in the internal API,
right? Because we alreayd have a windows native implementation of
shared memory that AFAIK works, so if the new Unix stuff can be done
with the same internal APIs, it shouldn't nede to be changed. (Sorry,
haven't followed the thread in detail)

If so - can we define exactly what properties it is we *need*?

(A native API worth looking at is e.g.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa365203(v=vs.85).aspx
- but there are probably others as well if that one doesn't do)

-- 
 Magnus Hagander
 Me: 

Re: [HACKERS] Posix Shared Mem patch

2012-06-27 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes:
 On Wed, Jun 27, 2012 at 3:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wonder whether this design can be adapted to Windows?  IIRC we do
 not have a bulletproof data directory lock scheme for Windows.
 It seems like this makes few enough demands on the lock mechanism
 that there ought to be suitable primitives available there too.

 I assume you're saying we need to make changes in the internal API,
 right? Because we alreayd have a windows native implementation of
 shared memory that AFAIK works,

Right, but does it provide honest protection against starting two
postmasters in the same data directory?  Or more to the point,
does it prevent starting a new postmaster when the old postmaster
crashed but there are still orphaned backends making changes?
AFAIR we basically punted on those problems for the Windows port,
for lack of an equivalent to nattch.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-27 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Would Posix shmem help with that at all?  Why did you choose not to
 use the Posix API, anyway?

 It seemed more complicated.  If we use the POSIX API, we've got to
 have code to find a non-colliding name for the shm, and we've got to
 arrange to clean it up at process exit.  Anonymous shm doesn't require
 a name and goes away automatically when it's no longer in use.

I see.  Those are pretty good reasons ...

 With respect to EXEC_BACKEND, I wasn't proposing to kill it, just to
 make it continue to use a full-sized sysv shm.

Well, if the ultimate objective is to get out from under the SysV APIs
entirely, we're not going to get there if we still have to have all that
code for the EXEC_BACKEND case.  Maybe it's time to decide that we don't
need to support EXEC_BACKEND on Unix.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-27 Thread Stephen Frost
All,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Would Posix shmem help with that at all?  Why did you choose not to
  use the Posix API, anyway?
 
  It seemed more complicated.  If we use the POSIX API, we've got to
  have code to find a non-colliding name for the shm, and we've got to
  arrange to clean it up at process exit.  Anonymous shm doesn't require
  a name and goes away automatically when it's no longer in use.
 
 I see.  Those are pretty good reasons ...

After talking to Magnus a bit this morning regarding this, it sounds
like what we're doing on Windows is closer to Anonymous shm, except that
they use an intentionally specific name, which also allows them to
detect if any children are still alive by using a create-if-not-exists
approach on the shm segment and failing if it still exists.  There were
some corner cases around restarts due to it taking a few seconds for the
Windows kernel to pick up on the fact that all the children are dead and
that the shm segment should go away, but they were able to work around
that, and failure to start is surely much better than possible
corruption.

What this all boils down to is- can you have a shm segment that goes
away when no one is still attached to it, but actually give it a name
and then detect if it already exists atomically on startup on
Linux/Unixes?  If so, perhaps we could use the same mechanism on both..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Posix Shared Mem patch

2012-06-27 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Right, but does it provide honest protection against starting two
 postmasters in the same data directory?  Or more to the point,
 does it prevent starting a new postmaster when the old postmaster
 crashed but there are still orphaned backends making changes?
 AFAIR we basically punted on those problems for the Windows port,
 for lack of an equivalent to nattch.

See my other mail, but, after talking to Magnus, it's my understanding
that we had that problem initially, but it was later solved by using a
named shared memory segment which the kernel will clean up when all
children are gone.  That, combined with a 'create-if-exists' call,
allows detection of lost children to be done.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Posix Shared Mem patch

2012-06-27 Thread Magnus Hagander
On Wed, Jun 27, 2012 at 3:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Magnus Hagander mag...@hagander.net writes:
 On Wed, Jun 27, 2012 at 3:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wonder whether this design can be adapted to Windows?  IIRC we do
 not have a bulletproof data directory lock scheme for Windows.
 It seems like this makes few enough demands on the lock mechanism
 that there ought to be suitable primitives available there too.

 I assume you're saying we need to make changes in the internal API,
 right? Because we alreayd have a windows native implementation of
 shared memory that AFAIK works,

 Right, but does it provide honest protection against starting two
 postmasters in the same data directory?  Or more to the point,
 does it prevent starting a new postmaster when the old postmaster
 crashed but there are still orphaned backends making changes?
 AFAIR we basically punted on those problems for the Windows port,
 for lack of an equivalent to nattch.

No, we spent a lot of time trying to *fix* it, and IIRC we did.

We create a shared memory segment with a fixed name based on the data
directory. This shared memory segment is inherited by all children. It
will automatically go away only when all processes that have an open
handle to it go away (in fact, it can even take a second or two more,
if they go away by crash and not by cleanup - we have a workaround in
the code for that). But as long as there is an orphaned backend
around, the shared memory segment stays around.

We don't have nattch. But we do have nattch0. Or something like that.

You can work around it if you find two different paths to the same
data directory (e.g .using junctions), but you are really actively
trying to break the system if you do that...


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-27 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes:
 On Wed, Jun 27, 2012 at 3:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 AFAIR we basically punted on those problems for the Windows port,
 for lack of an equivalent to nattch.

 No, we spent a lot of time trying to *fix* it, and IIRC we did.

OK, in that case this isn't as interesting as I thought.

If we do go over to a file-locking-based solution on Unix, it might be
worthwhile changing to something similar on Windows.  But it would be
more about reducing coding differences between the platforms than
plugging any real holes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-27 Thread Robert Haas
On Wed, Jun 27, 2012 at 9:44 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Would Posix shmem help with that at all?  Why did you choose not to
 use the Posix API, anyway?

 It seemed more complicated.  If we use the POSIX API, we've got to
 have code to find a non-colliding name for the shm, and we've got to
 arrange to clean it up at process exit.  Anonymous shm doesn't require
 a name and goes away automatically when it's no longer in use.

 I see.  Those are pretty good reasons ...

 With respect to EXEC_BACKEND, I wasn't proposing to kill it, just to
 make it continue to use a full-sized sysv shm.

 Well, if the ultimate objective is to get out from under the SysV APIs
 entirely, we're not going to get there if we still have to have all that
 code for the EXEC_BACKEND case.  Maybe it's time to decide that we don't
 need to support EXEC_BACKEND on Unix.

I don't personally see a need to do anything that drastic at this
point.  Admittedly, I rarely compile with EXEC_BACKEND, but I don't
think it's bad to have the option available.  Adjusting shared memory
limits isn't really a big problem for PostgreSQL developers; what
we're trying to avoid is the need for PostgreSQL *users* to concern
themselves with it.  And surely anyone who is using EXEC_BACKEND on
Unix is a developer, not a user.

If and when we come up with a substitute for the nattch interlock,
then this might be worth thinking a bit harder about.  At that point,
if we still want to support EXEC_BACKEND on Unix, then we'd need the
EXEC_BACKEND case at least to use POSIX shm rather than anonymous
shared mmap.  Personally I think that would be not that hard and
probably worth doing, but there doesn't seem to be any point in
writing that code now, because for the simple case of just reducing
the amount of shm that we allocate, an anonymous mapping seems better
all around.

We shouldn't overthink this.  Our shared memory code has allocated a
bunch of crufty hacks over the years to work around various
platform-specific issues, but it's still not a lot of code, so I don't
see any reason to worry unduly about making a surgical fix without
having a master plan.  Nothing we want to do down the road will
require moving the earth.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-27 Thread Robert Haas
On Wed, Jun 27, 2012 at 9:52 AM, Stephen Frost sfr...@snowman.net wrote:
 What this all boils down to is- can you have a shm segment that goes
 away when no one is still attached to it, but actually give it a name
 and then detect if it already exists atomically on startup on
 Linux/Unixes?  If so, perhaps we could use the same mechanism on both..

As I understand it, no.  You can either have anonymous shared
mappings, which go away when no longer in use but do not have a name.
Or you can have POSIX or sysv shm, which have a name but do not
automatically go away when no longer in use.  There seems to be no
method for setting up a segment that both has a name and goes away
automatically.  POSIX shm in particular tries to look like a file,
whereas anonymous memory tries to look more like malloc (except that
you can share the mapping with child processes).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-27 Thread A.M.

On Jun 27, 2012, at 7:34 AM, Robert Haas wrote:

 On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 So, here's a patch.  Instead of using POSIX shmem, I just took the
 expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS
 memory.  The sysv shm is still allocated, but it's just a copy of
 PGShmemHeader; the real shared memory is the anonymous block.  This
 won't work if EXEC_BACKEND is defined so it just falls back on
 straight sysv shm in that case.
 
 Um.  I hadn't thought about the EXEC_BACKEND interaction, but that seems
 like a bit of a showstopper.  I would not like to give up the ability
 to debug EXEC_BACKEND mode on Unixen.
 
 Would Posix shmem help with that at all?  Why did you choose not to
 use the Posix API, anyway?
 
 It seemed more complicated.  If we use the POSIX API, we've got to
 have code to find a non-colliding name for the shm, and we've got to
 arrange to clean it up at process exit.  Anonymous shm doesn't require
 a name and goes away automatically when it's no longer in use.
 
 With respect to EXEC_BACKEND, I wasn't proposing to kill it, just to
 make it continue to use a full-sized sysv shm.
 

I solved this by unlinking the posix shared memory segment immediately after 
creation. The file descriptor to the shared memory is inherited, so, by 
definition, only the postmaster children can access the memory. This ensures 
that shared memory cleanup is immediate after the postmaster and all children 
close, as well. The fcntl locking is not required to protect the posix shared 
memory- it can protect itself.

Cheers,
M




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-27 Thread Robert Haas
On Wed, Jun 27, 2012 at 9:44 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Would Posix shmem help with that at all?  Why did you choose not to
 use the Posix API, anyway?

 It seemed more complicated.  If we use the POSIX API, we've got to
 have code to find a non-colliding name for the shm, and we've got to
 arrange to clean it up at process exit.  Anonymous shm doesn't require
 a name and goes away automatically when it's no longer in use.

 I see.  Those are pretty good reasons ...

So, should we do it this way?

I did a little research and discovered that Linux 2.3.51 (released
3/11/2000) apparently returns EINVAL for MAP_SHARED|MAP_ANONYMOUS.
That combination is documented to work beginning in Linux 2.4.0.  How
worried should we be about people trying to run PostgreSQL 9.3 on
pre-2.4 kernels?  If we want to worry about it, we could try mapping a
one-page shared MAP_SHARED|MAP_ANONYMOUS segment first.  If that
works, we could assume that we have a working MAP_SHARED|MAP_ANONYMOUS
facility and try to allocate the whole segment plus a minimal sysv
shm.  If the single page allocation fails with EINVAL, we could fall
back to allocating the entire segment as sysv shm.

A related question is - if we do this - should we enable it only on
ports where we've verified that it works, or should we just turn it on
everywhere and fix breakage if/when it's reported?  I lean toward the
latter.

If we find that there are platforms where (a) mmap is not supported or
(b) MAP_SHARED|MAP_ANON works but has the wrong semantics, we could
either shut off this optimization on those platforms by fiat, or we
could test not only that the call succeeds, but that it works
properly: create a one-page mapping and fork a child process; in the
child, write to the mapping and exit; in the parent, wait for the
child to exit and then test that we can read back the correct
contents.  This would protect against a hypothetical system where the
flags are accepted but fail to produce the correct behavior.  I'm
inclined to think this is over-engineering in the absence of evidence
that there are platforms that work this way.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Alvaro Herrera

Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012:
 Robert, all:
 
 Last I checked, we had a reasonably acceptable patch to use mostly Posix
 Shared mem with a very small sysv ram partition.  Is there anything
 keeping this from going into 9.3?  It would eliminate a major
 configuration headache for our users.

I don't think that patch was all that reasonable.  It needed work, and
in any case it needs a rebase because it was pretty old.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Robert Haas
On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012:
 Robert, all:

 Last I checked, we had a reasonably acceptable patch to use mostly Posix
 Shared mem with a very small sysv ram partition.  Is there anything
 keeping this from going into 9.3?  It would eliminate a major
 configuration headache for our users.

 I don't think that patch was all that reasonable.  It needed work, and
 in any case it needs a rebase because it was pretty old.

Yep, agreed.

I'd like to get this fixed too, but it hasn't made it up to the top of
my list of things to worry about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Josh Berkus
On 6/26/12 2:13 PM, Robert Haas wrote:
 On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
 Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012:
 Robert, all:

 Last I checked, we had a reasonably acceptable patch to use mostly Posix
 Shared mem with a very small sysv ram partition.  Is there anything
 keeping this from going into 9.3?  It would eliminate a major
 configuration headache for our users.

 I don't think that patch was all that reasonable.  It needed work, and
 in any case it needs a rebase because it was pretty old.
 
 Yep, agreed.
 
 I'd like to get this fixed too, but it hasn't made it up to the top of
 my list of things to worry about.

Was there a post-AgentM version of the patch, which incorporated the
small SySV RAM partition?  I'm not finding it.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Daniel Farina
On Tue, Jun 26, 2012 at 2:18 PM, Josh Berkus j...@agliodbs.com wrote:
 On 6/26/12 2:13 PM, Robert Haas wrote:
 On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
 Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012:
 Robert, all:

 Last I checked, we had a reasonably acceptable patch to use mostly Posix
 Shared mem with a very small sysv ram partition.  Is there anything
 keeping this from going into 9.3?  It would eliminate a major
 configuration headache for our users.

 I don't think that patch was all that reasonable.  It needed work, and
 in any case it needs a rebase because it was pretty old.

 Yep, agreed.

 I'd like to get this fixed too, but it hasn't made it up to the top of
 my list of things to worry about.

 Was there a post-AgentM version of the patch, which incorporated the
 small SySV RAM partition?  I'm not finding it.

On that, I used to be of the opinion that this is a good compromise (a
small amount of interlock space, plus mostly posix shmem), but I've
heard since then (I think via AgentM indirectly, but I'm not sure)
that there are cases where even the small SysV segment can cause
problems -- notably when other software tweaks shared memory settings
on behalf of a user, but only leaves just-enough for the software
being installed.  This is most likely on platforms that don't have a
high SysV shmem limit by default, so installers all feel the
prerogative to increase the limit, but there's no great answer for how
to compose a series of such installations.  It only takes one
installer that says whatever, I'm just catenating stuff to
sysctl.conf that works for me to sabotage Postgres' ability to start.

So there may be a benefit in finding a way to have no SysV memory at
all.  I wouldn't let perfect be the enemy of good to make progress
here, but it appears this was a witnessed real problem, so it may be
worth reconsidering if there is a way we can safely remove all SysV by
finding an alternative to the nattach mechanic.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Robert Haas
On Tue, Jun 26, 2012 at 5:18 PM, Josh Berkus j...@agliodbs.com wrote:
 On 6/26/12 2:13 PM, Robert Haas wrote:
 On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
 Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012:
 Robert, all:

 Last I checked, we had a reasonably acceptable patch to use mostly Posix
 Shared mem with a very small sysv ram partition.  Is there anything
 keeping this from going into 9.3?  It would eliminate a major
 configuration headache for our users.

 I don't think that patch was all that reasonable.  It needed work, and
 in any case it needs a rebase because it was pretty old.

 Yep, agreed.

 I'd like to get this fixed too, but it hasn't made it up to the top of
 my list of things to worry about.

 Was there a post-AgentM version of the patch, which incorporated the
 small SySV RAM partition?  I'm not finding it.

To my knowledge, no.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Josh Berkus

 On that, I used to be of the opinion that this is a good compromise (a
 small amount of interlock space, plus mostly posix shmem), but I've
 heard since then (I think via AgentM indirectly, but I'm not sure)
 that there are cases where even the small SysV segment can cause
 problems -- notably when other software tweaks shared memory settings
 on behalf of a user, but only leaves just-enough for the software
 being installed.  This is most likely on platforms that don't have a
 high SysV shmem limit by default, so installers all feel the
 prerogative to increase the limit, but there's no great answer for how
 to compose a series of such installations.  It only takes one
 installer that says whatever, I'm just catenating stuff to
 sysctl.conf that works for me to sabotage Postgres' ability to start.

Personally, I see this as rather an extreme case, and aside from AgentM
himself, have never run into it before.  Certainly it would be useful to
not need SysV RAM at all, but it's more important to get a working patch
for 9.3.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Robert Haas
On Tue, Jun 26, 2012 at 5:44 PM, Josh Berkus j...@agliodbs.com wrote:

 On that, I used to be of the opinion that this is a good compromise (a
 small amount of interlock space, plus mostly posix shmem), but I've
 heard since then (I think via AgentM indirectly, but I'm not sure)
 that there are cases where even the small SysV segment can cause
 problems -- notably when other software tweaks shared memory settings
 on behalf of a user, but only leaves just-enough for the software
 being installed.  This is most likely on platforms that don't have a
 high SysV shmem limit by default, so installers all feel the
 prerogative to increase the limit, but there's no great answer for how
 to compose a series of such installations.  It only takes one
 installer that says whatever, I'm just catenating stuff to
 sysctl.conf that works for me to sabotage Postgres' ability to start.

 Personally, I see this as rather an extreme case, and aside from AgentM
 himself, have never run into it before.  Certainly it would be useful to
 not need SysV RAM at all, but it's more important to get a working patch
 for 9.3.

+1.

I'd sort of given up on finding a solution that doesn't involve system
V shmem anyway, but now that I think about it... what about using a
FIFO?  The man page for open on MacOS X says:

[ENXIO]O_NONBLOCK and O_WRONLY are set, the file is a FIFO,
   and no process has it open for reading.

And Linux says:

  ENXIO  O_NONBLOCK | O_WRONLY is set, the named file is a  FIFO  and  no
 process has the file open for reading.  Or, the file is a device
 special file and no corresponding device exists.

And HP/UX says:

  [ENXIO]O_NDELAY is set, the named file is a FIFO,
 O_WRONLY is set, and no process has the file open
 for reading.

So, what about keeping a FIFO in the data directory?  When the
postmaster starts up, it tries to open the file with O_NONBLOCK |
O_WRONLY (or O_NDELAY | O_WRONLY, if the platform has O_NDELAY rather
than O_NONBLOCK).  If that succeeds, it bails out.  If it fails with
anything other than ENXIO, it bails out.  If it fails with exactly
ENXIO, then it opens the pipe with O_RDONLY and arranges to pass the
file descriptor down to all of its children, so that a subsequent open
will fail if it or any of its children are still alive.

This might even be more reliable than what we do right now, because
our current system appears not to be robust against the removal of
postmaster.pid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Alvaro Herrera

Excerpts from Daniel Farina's message of mar jun 26 17:40:16 -0400 2012:

 On that, I used to be of the opinion that this is a good compromise (a
 small amount of interlock space, plus mostly posix shmem), but I've
 heard since then (I think via AgentM indirectly, but I'm not sure)
 that there are cases where even the small SysV segment can cause
 problems -- notably when other software tweaks shared memory settings
 on behalf of a user, but only leaves just-enough for the software
 being installed.

This argument is what killed the original patch.  If you want to get
anything done *at all* I think it needs to be dropped.  Changing shmem
implementation is already difficult enough --- you don't need to add the
requirement that the interlocking mechanism be changed simultaneously.
You (or whoever else) can always work on that as a followup patch.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Daniel Farina
On Tue, Jun 26, 2012 at 2:53 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:

 Excerpts from Daniel Farina's message of mar jun 26 17:40:16 -0400 2012:

 On that, I used to be of the opinion that this is a good compromise (a
 small amount of interlock space, plus mostly posix shmem), but I've
 heard since then (I think via AgentM indirectly, but I'm not sure)
 that there are cases where even the small SysV segment can cause
 problems -- notably when other software tweaks shared memory settings
 on behalf of a user, but only leaves just-enough for the software
 being installed.

 This argument is what killed the original patch.  If you want to get
 anything done *at all* I think it needs to be dropped.  Changing shmem
 implementation is already difficult enough --- you don't need to add the
 requirement that the interlocking mechanism be changed simultaneously.
 You (or whoever else) can always work on that as a followup patch.

True, but then again, I did very intentionally write:

 Excerpts from Daniel Farina's message of mar jun 26 17:40:16 -0400 2012:
 *I wouldn't let perfect be the enemy of good* to make progress
 here, but it appears this was a witnessed real problem, so it may
 be worth reconsidering if there is a way we can safely remove all
 SysV by finding an alternative to the nattach mechanic.

(Emphasis mine).

I don't think that -hackers at the time gave the zero-shmem rationale
much weight (I also was not that happy about the safety mechanism of
that patch), but upon more reflection (and taking into account *other*
software that may mangle shmem settings) I think it's something at
least worth thinking about again one more time.  What killed the patch
was an attachment to the deemed-less-safe stategy for avoiding bogus
shmem attachments already in it, but I don't seem to recall anyone
putting a whole lot of thought at the time into the zero-shmem case
from what I could read on the list, because a small interlock with
nattach seemed good-enough.

I'm simply suggesting that for additional benefits it may be worth
thinking about getting around nattach and thus SysV shmem, especially
with regard to safety, in an open-ended way.  Maybe there's a solution
(like Robert's FIFO suggestion?) that is not too onerous and can
satisfy everyone.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread A.M.

On Jun 26, 2012, at 5:44 PM, Josh Berkus wrote:

 
 On that, I used to be of the opinion that this is a good compromise (a
 small amount of interlock space, plus mostly posix shmem), but I've
 heard since then (I think via AgentM indirectly, but I'm not sure)
 that there are cases where even the small SysV segment can cause
 problems -- notably when other software tweaks shared memory settings
 on behalf of a user, but only leaves just-enough for the software
 being installed.  This is most likely on platforms that don't have a
 high SysV shmem limit by default, so installers all feel the
 prerogative to increase the limit, but there's no great answer for how
 to compose a series of such installations.  It only takes one
 installer that says whatever, I'm just catenating stuff to
 sysctl.conf that works for me to sabotage Postgres' ability to start.
 
 Personally, I see this as rather an extreme case, and aside from AgentM
 himself, have never run into it before.  Certainly it would be useful to
 not need SysV RAM at all, but it's more important to get a working patch
 for 9.3.


This can be trivially reproduced if one runs an old (SysV shared memory-based) 
postgresql alongside a potentially newer postgresql with a smaller SysV 
segment. This can occur with applications that bundle postgresql as part of the 
app.

Cheers,
M




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Josh Berkus

 This can be trivially reproduced if one runs an old (SysV shared 
 memory-based) postgresql alongside a potentially newer postgresql with a 
 smaller SysV segment. This can occur with applications that bundle postgresql 
 as part of the app.

I'm not saying it doesn't happen at all.  I'm saying it's not the 80%
case.

So let's fix the 80% case with something we feel confident in, and then
revisit the no-sysv interlock as a separate patch.  That way if we can't
fix the interlock issues, we still have a reduced-shmem version of Postgres.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 So, what about keeping a FIFO in the data directory?

Hm, does that work if the data directory is on NFS?  Or some other weird
not-really-Unix file system?

 When the
 postmaster starts up, it tries to open the file with O_NONBLOCK |
 O_WRONLY (or O_NDELAY | O_WRONLY, if the platform has O_NDELAY rather
 than O_NONBLOCK).  If that succeeds, it bails out.  If it fails with
 anything other than ENXIO, it bails out.  If it fails with exactly
 ENXIO, then it opens the pipe with O_RDONLY

... race condition here ...

 and arranges to pass the
 file descriptor down to all of its children, so that a subsequent open
 will fail if it or any of its children are still alive.

This might be made to work, but that doesn't sound quite right in
detail.

I remember we speculated about using an fcntl lock on some file in the
data directory, but that fails because child processes don't inherit
fcntl locks.

In the modern world, it'd be really a step forward if the lock mechanism
worked on shared storage, ie a data directory on NFS or similar could be
locked against all comers not just those on the same node as the
original postmaster.  I don't know how to do that though.

In the meantime, insisting that we solve this problem before we do
anything is a good recipe for ensuring that nothing happens, just
like it hasn't happened for the last half dozen years.  (I see Alvaro
just made the same point.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread A.M.

On Jun 26, 2012, at 6:12 PM, Daniel Farina wrote:
 
 (Emphasis mine).
 
 I don't think that -hackers at the time gave the zero-shmem rationale
 much weight (I also was not that happy about the safety mechanism of
 that patch), but upon more reflection (and taking into account *other*
 software that may mangle shmem settings) I think it's something at
 least worth thinking about again one more time.  What killed the patch
 was an attachment to the deemed-less-safe stategy for avoiding bogus
 shmem attachments already in it, but I don't seem to recall anyone
 putting a whole lot of thought at the time into the zero-shmem case
 from what I could read on the list, because a small interlock with
 nattach seemed good-enough.
 
 I'm simply suggesting that for additional benefits it may be worth
 thinking about getting around nattach and thus SysV shmem, especially
 with regard to safety, in an open-ended way.  Maybe there's a solution
 (like Robert's FIFO suggestion?) that is not too onerous and can
 satisfy everyone.


I solved this via fcntl locking. I also set up gdb to break in critical regions 
to test the interlock and I found no flaw in the design. More eyes would be 
welcome, of course.
https://github.com/agentm/postgres/tree/posix_shmem

Cheers,
M




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 So let's fix the 80% case with something we feel confident in, and then
 revisit the no-sysv interlock as a separate patch.  That way if we can't
 fix the interlock issues, we still have a reduced-shmem version of Postgres.

Yes.  Insisting that we have the whole change in one patch is a good way
to prevent any forward progress from happening.  As Alvaro noted, there
are plenty of issues to resolve without trying to change the interlock
mechanism at the same time.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 
 In the meantime, insisting that we solve this problem before we do
 anything is a good recipe for ensuring that nothing happens, just
 like it hasn't happened for the last half dozen years.  (I see
 Alvaro just made the same point.)
 
And now so has Josh.
 
+1 from me, too.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Tom Lane
A.M. age...@themactionfaction.com writes:
 This can be trivially reproduced if one runs an old (SysV shared 
 memory-based) postgresql alongside a potentially newer postgresql with a 
 smaller SysV segment. This can occur with applications that bundle postgresql 
 as part of the app.

I don't believe that that case is a counterexample to what's being
proposed (namely, grabbing a minimum-size shmem segment, perhaps 1K).
It would only fail if the old postmaster ate up *exactly* SHMMAX worth
of shmem, which is not real likely.  As a data point, on my Mac laptop
with SHMMAX set to 32MB, 9.2 will by default eat up 31624KB, leaving
more than a meg available.  Sure, that isn't enough to start another
old-style postmaster, but it would be plenty of room for one that only
wants 1K.

Even if you actively try to configure the shmem settings to exactly
fill shmmax (which I concede some installation scripts might do),
it's going to be hard to do because of the 8K granularity of the main
knob, shared_buffers.  Moreover, a installation script that did that
would soon learn not to, because of the fact that we don't worry too
much about changing small details of shared memory consumption in minor
releases.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Alvaro Herrera

Excerpts from Tom Lane's message of mar jun 26 18:58:45 -0400 2012:

 Even if you actively try to configure the shmem settings to exactly
 fill shmmax (which I concede some installation scripts might do),
 it's going to be hard to do because of the 8K granularity of the main
 knob, shared_buffers.

Actually it's very easy -- just try to start postmaster on a system with
not enough shmmax and it will tell you how much shmem it wants.  Then
copy that number verbatim in the config file.  This might fail on picky
systems such as MacOSX that require some exact multiple or power of some
other parameter, but it works fine on Linux.

I think the minimum you can request, at least on Linux, is 1 byte.

 Moreover, a installation script that did that
 would soon learn not to, because of the fact that we don't worry too
 much about changing small details of shared memory consumption in minor
 releases.

+1

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Tom Lane
A.M. age...@themactionfaction.com writes:
 On Jun 26, 2012, at 6:12 PM, Daniel Farina wrote:
 I'm simply suggesting that for additional benefits it may be worth
 thinking about getting around nattach and thus SysV shmem, especially
 with regard to safety, in an open-ended way.

 I solved this via fcntl locking.

No, you didn't, because fcntl locks aren't inherited by child processes.
Too bad, because they'd be a great solution otherwise.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread A.M.

On 06/26/2012 07:30 PM, Tom Lane wrote:

A.M. age...@themactionfaction.com writes:

On Jun 26, 2012, at 6:12 PM, Daniel Farina wrote:

I'm simply suggesting that for additional benefits it may be worth
thinking about getting around nattach and thus SysV shmem, especially
with regard to safety, in an open-ended way.



I solved this via fcntl locking.


No, you didn't, because fcntl locks aren't inherited by child processes.
Too bad, because they'd be a great solution otherwise.



You claimed this last time and I replied:
http://archives.postgresql.org/pgsql-hackers/2011-04/msg00656.php

I address this race condition by ensuring that a lock-holding violator 
is the postmaster or a postmaster child. If such as condition is 
detected, the child exits immediately without touching the shared 
memory. POSIX shmem is inherited via file descriptors.


This is possible because the locking API allows one to request which PID 
violates the lock. The child expects the lock to be held and checks that 
the PID is the parent. If the lock is not held, that means that the 
postmaster is dead, so the child exits immediately.


Cheers,
M

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread A.M.

On 06/26/2012 07:15 PM, Alvaro Herrera wrote:


Excerpts from Tom Lane's message of mar jun 26 18:58:45 -0400 2012:


Even if you actively try to configure the shmem settings to exactly
fill shmmax (which I concede some installation scripts might do),
it's going to be hard to do because of the 8K granularity of the main
knob, shared_buffers.


Actually it's very easy -- just try to start postmaster on a system with
not enough shmmax and it will tell you how much shmem it wants.  Then
copy that number verbatim in the config file.  This might fail on picky
systems such as MacOSX that require some exact multiple or power of some
other parameter, but it works fine on Linux.



Except that we have to account for other installers. A user can install 
an application in the future which clobbers the value and then the 
original application will fail to run. The options to get the first app 
working is:


a) to re-install the first app (potentially preventing the second app 
from running)
b) to have the first app detect the failure and readjust the value 
(guessing what it should be) and potentially forcing a reboot
c) to have the the user manually adjust the value and potentially force 
a reboot


The failure usually gets blamed on the first application.

That's why we had to nuke SysV shmem.

Cheers,
M



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Robert Haas
On Tue, Jun 26, 2012 at 6:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 So, what about keeping a FIFO in the data directory?

 Hm, does that work if the data directory is on NFS?  Or some other weird
 not-really-Unix file system?

I would expect NFS to work in general.  We could test that.  Of
course, it's more than possible that there's some bizarre device out
there that purports to be NFS but doesn't actually support mkfifo.
It's difficult to prove a negative.

 When the
 postmaster starts up, it tries to open the file with O_NONBLOCK |
 O_WRONLY (or O_NDELAY | O_WRONLY, if the platform has O_NDELAY rather
 than O_NONBLOCK).  If that succeeds, it bails out.  If it fails with
 anything other than ENXIO, it bails out.  If it fails with exactly
 ENXIO, then it opens the pipe with O_RDONLY

 ... race condition here ...

Oh, if someone tries to start two postmasters at the same time?  Hmm.

 and arranges to pass the
 file descriptor down to all of its children, so that a subsequent open
 will fail if it or any of its children are still alive.

 This might be made to work, but that doesn't sound quite right in
 detail.

 I remember we speculated about using an fcntl lock on some file in the
 data directory, but that fails because child processes don't inherit
 fcntl locks.

 In the modern world, it'd be really a step forward if the lock mechanism
 worked on shared storage, ie a data directory on NFS or similar could be
 locked against all comers not just those on the same node as the
 original postmaster.  I don't know how to do that though.

Well, I think that in theory that DOES work.  But I also think it's
often misconfigured.  Which could also be said of NFS in general.

 In the meantime, insisting that we solve this problem before we do
 anything is a good recipe for ensuring that nothing happens, just
 like it hasn't happened for the last half dozen years.  (I see Alvaro
 just made the same point.)

Agreed all around.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Tom Lane
A.M. age...@themactionfaction.com writes:
 On 06/26/2012 07:30 PM, Tom Lane wrote:
 I solved this via fcntl locking.

 No, you didn't, because fcntl locks aren't inherited by child processes.
 Too bad, because they'd be a great solution otherwise.

 You claimed this last time and I replied:
 http://archives.postgresql.org/pgsql-hackers/2011-04/msg00656.php

 I address this race condition by ensuring that a lock-holding violator 
 is the postmaster or a postmaster child. If such as condition is 
 detected, the child exits immediately without touching the shared 
 memory. POSIX shmem is inherited via file descriptors.

 This is possible because the locking API allows one to request which PID 
 violates the lock. The child expects the lock to be held and checks that 
 the PID is the parent. If the lock is not held, that means that the 
 postmaster is dead, so the child exits immediately.

OK, I went back and re-read the original patch, and I now agree that
something like this is possible --- but I don't like the way you did
it. The dependence on particular PIDs seems both unnecessary and risky.

The key concept here seems to be that the postmaster first stakes a
claim on the data directory by exclusive-locking a lock file.  If
successful, it reduces that lock to shared mode (which can be done
atomically, according to the SUS fcntl specification), and then holds
the shared lock until it exits.  Spawned children will not initially
have a lock, but what they can do is attempt to acquire shared lock on
the lock file.  If fail, exit.  If successful, *check to see that the
parent postmaster is still alive* (ie, getppid() != 1).  If so, the
parent must have been continuously holding the lock, and the child has
successfully joined the pool of shared lock holders.  Otherwise bail
out without having changed anything.  It is the parent is still alive
check, not any test on individual PIDs, that makes this work.

There are two concrete reasons why I don't care for the
GetPIDHoldingLock() way.  Firstly, the fact that you can get a blocking
PID from F_GETLK isn't an essential part of the concept of file locking
IMO --- it's just an incidental part of this particular API.  May I
remind you that the reason we're stuck on SysV shmem in the first place
is that we decided to depend on an incidental part of that API, namely
nattch?  I would like to not require file locking to have any semantics
more specific than a process can hold an exclusive or a shared lock on
a file, which is auto-released at process exit.  Secondly, in an NFS
world I don't believe that the returned l_pid value can be trusted for
anything.  If it's a PID from a different machine then it might
accidentally conflict with one on our machine, or not.

Reflecting on this further, it seems to me that the main remaining
failure modes are (1) file locking doesn't work, or (2) idiot DBA
manually removes the lock file.  Both of these could be ameliorated
with some refinements to the basic idea.  For (1), I suggest that
we tweak the startup process (only) to attempt to acquire exclusive lock
on the lock file.  If it succeeds, we know that file locking is broken,
and we can complain.  (This wouldn't help for cases where cross-machine
locking is broken, but I see no practical way to detect that.)
For (2), the problem really is that the proposed patch conflates the PID
file with the lock file, but people are conditioned to think that PID
files are removable.  I suggest that we create a separate, permanently
present file that serves only as the lock file and doesn't ever get
modified (it need have no content other than the string Don't remove
this!).  It'd be created by initdb, not by individual postmaster runs;
indeed the postmaster should fail if it doesn't find the lock file
already present.  The postmaster PID file should still exist with its
current contents, but it would serve mostly as documentation and as
server-contact information for pg_ctl; it would not be part of the data
directory locking mechanism.

I wonder whether this design can be adapted to Windows?  IIRC we do
not have a bulletproof data directory lock scheme for Windows.
It seems like this makes few enough demands on the lock mechanism
that there ought to be suitable primitives available there too.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Tom Lane
I wrote:
 Reflecting on this further, it seems to me that the main remaining
 failure modes are (1) file locking doesn't work, or (2) idiot DBA
 manually removes the lock file.

Oh, wait, I just remembered the really fatal problem here: to quote from
the SUS fcntl spec,

All locks associated with a file for a given process are removed
when a file descriptor for that file is closed by that process
or the process holding that file descriptor terminates.

That carefully says a file descriptor, not the file descriptor
through which the lock was acquired.  Any close() referencing the lock
file will do.  That means that it is possible for perfectly innocent
code --- for example, something that scans all files in the data
directory, as say pg_basebackup might do --- to cause a backend process
to lose its lock.  When we looked at this before, it seemed like a
showstopper.  Even if we carefully taught every directory-scanning loop
in postgres not to touch the lock file, we cannot expect that for
instance a pl/perl function wouldn't accidentally break things.  And
99.999% of the time nobody would notice ... it would just be that last
0.001% of people that would be screwed.

Still, this discussion has yielded a useful advance, which is that we
now see how we might safely make use of lock mechanisms that don't
inherit across fork().  We just need something less broken than fcntl().

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Robert Haas
On Tue, Jun 26, 2012 at 6:25 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Josh Berkus j...@agliodbs.com writes:
 So let's fix the 80% case with something we feel confident in, and then
 revisit the no-sysv interlock as a separate patch.  That way if we can't
 fix the interlock issues, we still have a reduced-shmem version of Postgres.

 Yes.  Insisting that we have the whole change in one patch is a good way
 to prevent any forward progress from happening.  As Alvaro noted, there
 are plenty of issues to resolve without trying to change the interlock
 mechanism at the same time.

So, here's a patch.  Instead of using POSIX shmem, I just took the
expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS
memory.  The sysv shm is still allocated, but it's just a copy of
PGShmemHeader; the real shared memory is the anonymous block.  This
won't work if EXEC_BACKEND is defined so it just falls back on
straight sysv shm in that case.

There are obviously some portability issues here - this is documented
not to work on Linux = 2.4, but it's not clear whether it fails with
some suitable error code or just pretends to work and does the wrong
thing.  I tested that it does compile and work on both Linux 3.2.6 and
MacOS X 10.6.8.  And the comments probably need work and... who knows
what else is wrong.  But, thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


anonymous-shmem.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 So, here's a patch.  Instead of using POSIX shmem, I just took the
 expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS
 memory.  The sysv shm is still allocated, but it's just a copy of
 PGShmemHeader; the real shared memory is the anonymous block.  This
 won't work if EXEC_BACKEND is defined so it just falls back on
 straight sysv shm in that case.

Um.  I hadn't thought about the EXEC_BACKEND interaction, but that seems
like a bit of a showstopper.  I would not like to give up the ability
to debug EXEC_BACKEND mode on Unixen.

Would Posix shmem help with that at all?  Why did you choose not to
use the Posix API, anyway?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers