Re: [HACKERS] POSIX shared memory redux

2011-04-14 Thread Florian Weimer
* Tom Lane:

 Well, the fundamental point is that ignoring NFS is not the real
 world.  We can't tell people not to put data directories on NFS,
 and even if we did tell them not to, they'd still do it.  And NFS
 locking is not trustworthy, because the remote lock daemon can crash
 and restart (forgetting everything it ever knew) while your own machine
 and the postmaster remain blissfully awake.

Is this still the case with NFSv4?  Does the local daemon still keep
the lock state?

 None of this is to say that an fcntl lock might not be a useful addition
 to what we do already.  It is to say that fcntl can't just replace what
 we do already, because there are real-world failure cases that the
 current solution handles and fcntl alone wouldn't.

If it requires NFS misbehavior (possibly in an older version), and you
have to start postmasters on separate nodes (which you normally
wouldn't do), doesn't this make it increasingly unlikely that it's
going to be triggered in the wild?

-- 
Florian Weimerfwei...@bfk.de
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2011-04-14 Thread A.M.

On Apr 13, 2011, at 9:30 PM, Robert Haas wrote:

 On Wed, Apr 13, 2011 at 6:11 PM, A.M. age...@themactionfaction.com wrote:
 I don't see why we need to get rid of SysV shared memory; needing less
 of it seems just as good.
 
 1. As long one keeps SysV shared memory around, the postgresql project has 
 to maintain the annoying platform-specific document on how to configure the 
 poorly named kernel parameters. If the SysV region is very small, that means 
 I can run more postgresql instances within the same kernel limits, but one 
 can still hit the limits. My patch allows the postgresql project to delete 
 that page and the hassles with it.
 
 2. My patch proves that SysV is wholly unnecessary. Are you attached to it? 
 (Pun intended.)
 
 With all due respect, I think this is an unproductive conversation.
 Your patch proves that SysV is wholly unnecessary only if we also
 agree that fcntl() locking is just as reliable as the nattch
 interlock, and Tom and I are trying to explain why we don't believe
 that's the case.  Saying that we're just wrong without responding to
 our points substantively doesn't move the conversation forward.

Sorry- it wasn't meant to be an attack- just a dumb pun. I am trying to argue 
that, even if the fcntl is unreliable, the startup procedure is just as 
reliable as it is now. The reasons being:

1) the SysV nattch method's primary purpose is to protect the shmem region. 
This is no longer necessary in my patch because the shared memory in unlinked 
immediately after creation, so only the initial postmaster and its children 
have access.

2) the standard postgresql lock file remains the same

Furthermore, there is indeed a case where the SysV nattch cannot work while the 
fcntl locking can indeed catch: if two separate machines have a postgresql data 
directory mounted over NFS, postgresql will currently allow both machines to 
start a postmaster in that directory because the SysV nattch check fails and 
then the pid in the lock file is the pid on the first machine, so postgresql 
will say starting anyway. With fcntl locking, this can be fixed. SysV only 
has presence on one kernel.


 
 In case it's not clear, here again is what we're concerned about: A
 System V shm *cannot* be removed until nobody is attached to it.  A
 lock file can be removed, or the lock can be accidentally released by
 the apparently innocuous operation of closing a file descriptor.
 
 Both you and Tom have somehow assumed that the patch alters current 
 postgresql behavior. In fact, the opposite is true. I haven't changed any of 
 the existing behavior. The robust behavior remains. I merely added fcntl 
 interlocking on top of the lock file to replace the SysV shmem check.
 
 This seems contradictory.  If you replaced the SysV shmem check, then
 it's not there, which means you altered the behavior.

From what I understood, the primary purpose of the SysV check was to protect 
the shared memory from multiple stompers. The interlock was a neat 
side-effect. 

The lock file contents are currently important to get the pid of a potential, 
conflicting postmaster. With the fcntl API, we can return a live conflicting 
PID (whether a postmaster or a stuck child), so that's an improvement. This 
could be used, for example, for STONITH, to reliably kill a dying replication 
clone- just loop on the pids returned from the lock.

Even if the fcntl check passes, the pid in the lock file is checked, so the 
lock file behavior remains the same.

If you were to implement a daemon with a shared data directory but no shared 
memory, how would implement the interlock? Would you still insist on SysV 
shmem? Unix daemons generally rely on lock files alone. Perhaps there is a 
different API on which we can agree.

Cheers,
M
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2011-04-14 Thread A.M.

On Apr 14, 2011, at 8:22 AM, Florian Weimer wrote:

 * Tom Lane:
 
 Well, the fundamental point is that ignoring NFS is not the real
 world.  We can't tell people not to put data directories on NFS,
 and even if we did tell them not to, they'd still do it.  And NFS
 locking is not trustworthy, because the remote lock daemon can crash
 and restart (forgetting everything it ever knew) while your own machine
 and the postmaster remain blissfully awake.
 
 Is this still the case with NFSv4?  Does the local daemon still keep
 the lock state?

The lock handling has been fixed in NFSv4.

http://nfs.sourceforge.net/
NFS Version 4 introduces support for byte-range locking and share reservation. 
Locking in NFS Version 4 is lease-based, so an NFS Version 4 client must 
maintain contact with an NFS Version 4 server to continue extending its open 
and lock leases.

http://linux.die.net/man/2/flock
flock(2) does not lock files over NFS. Use fcntl(2) instead: that does work 
over NFS, given a sufficiently recent version of Linux and a server which 
supports locking.

I would need some more time to dig up what recent version of Linux specifies, 
but NFSv4 is likely required.

 
 None of this is to say that an fcntl lock might not be a useful addition
 to what we do already.  It is to say that fcntl can't just replace what
 we do already, because there are real-world failure cases that the
 current solution handles and fcntl alone wouldn't.
 
 If it requires NFS misbehavior (possibly in an older version), and you
 have to start postmasters on separate nodes (which you normally
 wouldn't do), doesn't this make it increasingly unlikely that it's
 going to be triggered in the wild?

With the patch I offer, it would be possible to use shared storage and failover 
postgresql nodes on different machines over NFS. (The second postmaster blocks 
and waits for the lock to be released.) Obviously, such as a setup isn't as 
strong as using replication, but given a sufficiently fail-safe shared storage 
setup, it could be made reliable.

Cheers,
M



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2011-04-14 Thread A.M.

On Apr 13, 2011, at 11:37 PM, Tom Lane wrote:

 A.M. age...@themactionfaction.com writes:
 1. As long one keeps SysV shared memory around, the postgresql project
 has to maintain the annoying platform-specific document on how to
 configure the poorly named kernel parameters.
 
 No, if it's just a small area, I don't see that that's an issue.
 You're going to max out on other things (like I/O bandwidth) long before
 you run into the limit on how many postmasters you can have from this.
 The reason that those parameters are problematic now is that people tend
 to want *large* shmem segments and the typical defaults aren't friendly
 to that.

That's assuming that no other processes on the system are using up the 
available segments (such as older postgresql instances).


 2. My patch proves that SysV is wholly unnecessary. Are you attached to it? 
 (Pun intended.)
 
 You were losing this argument already, but ad hominem attacks are pretty
 much guaranteed to get people to tune you out.  

I apologized to Robert Haas in another post- no offense was intended.

 There are real,
 substantive, unfixable reasons to not trust fcntl locking completely.

...on NFS which the postgresql community doesn't recommend anyway. But even in 
that case, the existing lock file (even without the fcntl lock), can catch that 
case via the PID in the file contents. That is what I meant when I claimed that 
the behavior remains the same.

 
 I would encourage you to take a look at the patch.
 
 Just to be perfectly clear: I have not read your patch, and am not
 likely to before the next commitfest starts, because I have
 approximately forty times too many things to do already.  I'm just going
 off your own abbreviated description of the patch.  But from what I know
 about fcntl locking, it's not a sufficient substitute for the SysV shmem
 interlock, because it has failure modes that the SysV interlock doesn't,
 and those failure modes occur in real-world cases.  Yeah, it'd be nice
 to also be able to lock against other postmasters on other NFS clients,
 but we hardly ever hear of somebody getting burnt by the lack of that
 (and fcntl wouldn't be a bulletproof defense anyway).  On the other
 hand, accidentally trying to start a duplicate postmaster on the same
 machine is an everyday occurrence.

I really do appreciate the time you have put into feedback. I posed this 
question also to Robert Haas: is there a different API which you would find 
acceptable? I chose fcntl because it seemed well-suited for this task, but the 
feedback has been regarding NFS v4 concerns.

Cheers,
M
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2011-04-14 Thread Martijn van Oosterhout
On Thu, Apr 14, 2011 at 10:26:33AM -0400, A.M. wrote:
 1) the SysV nattch method's primary purpose is to protect the shmem
 region. This is no longer necessary in my patch because the shared
 memory in unlinked immediately after creation, so only the initial
 postmaster and its children have access.

Umm, you don't unlink SysV shared memory. All the flag does is make
sure it goes away when the last user goes away. In the mean time people
can still connect to it.

 The lock file contents are currently important to get the pid of a
 potential, conflicting postmaster. With the fcntl API, we can return
 a live conflicting PID (whether a postmaster or a stuck child), so
 that's an improvement. This could be used, for example, for STONITH,
 to reliably kill a dying replication clone- just loop on the pids
 returned from the lock.

SysV shared memory also gives you a PID, that's the point.
 
 Even if the fcntl check passes, the pid in the lock file is checked, so the 
 lock file behavior remains the same.

The interlock is to make sure there are no living postmaster children.
The lockfile won't tell you that. So the issue is that while fcntl can
work, sysv can do better.

Also, I think you underestimate the value of the current interlock.
Before this people did manage to trash their databases regularly this
way. Lockfiles can be deleted and yes, people do it all the time.

Actually, it occurs to me you can solve NFS problem by putting the
lockfile in the socket dir. That can't possibly be on NFS.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 Patriotism is when love of your own people comes first; nationalism,
 when hate for people other than your own comes first. 
   - Charles de Gaulle


signature.asc
Description: Digital signature


Re: [HACKERS] POSIX shared memory redux

2011-04-14 Thread Robert Haas
On Thu, Apr 14, 2011 at 7:26 AM, A.M. age...@themactionfaction.com wrote:
 From what I understood, the primary purpose of the SysV check was to protect 
 the shared memory from multiple stompers. The interlock was a neat 
 side-effect.

Not really - the purpose of the interlock is to protect the underlying
data files.  The nattch interlock allows us to be very confident that
there isn't another postmaster running on the same data directory on
the same machine, and that is extremely important.

You've just about convinced me that it might not be a bad idea to add
the fcntl() interlock in addition because, as you say, that has a
chance of working even over NFS.  But the interlock we have now is
*extremely* reliable, and I think we'd need to get some other
amazingly compelling benefit to consider changing it (even if we were
convinced that the alternate method was also reliable).  I don't see
that there is one.  Anyone running an existing version of PostgreSQL
in an environment where they care *at all* about performance has
already adjusted their SysV shm settings way up.  The benefit of using
POSIX shm is that in, say, PostgreSQL 9.2, it might be possible for
shared buffers to have a somewhat higher default setting out of the
box, and be further increased from there without kernel parameter
changes.  And there might be more benefits besides that, but certainly
those by themselves seem pretty worthwhile.  SysV shm is extremely
portable, so I don't think that we're losing anything by continuing to
allocate a small amount of it (a few kilobytes, perhaps) and just push
everything else out into POSIX shm.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2011-04-13 Thread Tom Lane
A.M. age...@themactionfaction.com writes:
 On Apr 11, 2011, at 7:13 PM, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
 I mean I'm not convinced that fcntl() locking will be as reliable.

 I'm not either.  Particularly not on NFS.

 Is there an example of a recent system where fcntl is broken (ignoring NFS)?

Well, the fundamental point is that ignoring NFS is not the real
world.  We can't tell people not to put data directories on NFS,
and even if we did tell them not to, they'd still do it.  And NFS
locking is not trustworthy, because the remote lock daemon can crash
and restart (forgetting everything it ever knew) while your own machine
and the postmaster remain blissfully awake.

None of this is to say that an fcntl lock might not be a useful addition
to what we do already.  It is to say that fcntl can't just replace what
we do already, because there are real-world failure cases that the
current solution handles and fcntl alone wouldn't.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2011-04-13 Thread A . M .

On Apr 13, 2011, at 2:06 AM, Tom Lane wrote:

 A.M. age...@themactionfaction.com writes:
 On Apr 11, 2011, at 7:13 PM, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
 I mean I'm not convinced that fcntl() locking will be as reliable.
 
 I'm not either.  Particularly not on NFS.
 
 Is there an example of a recent system where fcntl is broken (ignoring NFS)?
 
 Well, the fundamental point is that ignoring NFS is not the real
 world.  We can't tell people not to put data directories on NFS,
 and even if we did tell them not to, they'd still do it.  And NFS
 locking is not trustworthy, because the remote lock daemon can crash
 and restart (forgetting everything it ever knew) while your own machine
 and the postmaster remain blissfully awake.
 
 None of this is to say that an fcntl lock might not be a useful addition
 to what we do already.  It is to say that fcntl can't just replace what
 we do already, because there are real-world failure cases that the
 current solution handles and fcntl alone wouldn't.


The goal of this patch is to eliminate SysV shared memory, not to implement 
NFS-capable locking which, as you point out, is virtually impossible.

As far as I can tell, in the worst case, my patch does not change how 
postgresql handles the NFS case. SysV shared memory won't work across NFS, so 
that interlock won't catch, so postgresql is left with looking at a lock file 
with PID of process on another machine, so that won't catch either. This patch 
does not alter the lock file semantics, but merely augments the file with file 
locking.

At least with this patch, there is a chance the lock might work across NFS. In 
the best case, it can allow for shared-storage postgresql failover, which is a 
new feature.

Furthermore, there is an improvement in shared memory handling in that it is 
unlinked immediately after creation, so only the postmaster and its children 
have access to it (through file descriptor inheritance). This means shared 
memory cannot be stomped on by any other process.

Considering that possibly working NFS locking is a side-effect of this patch 
and not its goal and, in the worst possible scenario, it doesn't change current 
behavior, I don't see how this can be a ding against this patch.

Cheers,
M
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2011-04-13 Thread Robert Haas
On Wed, Apr 13, 2011 at 7:20 AM, A.M. age...@themactionfaction.com wrote:
 The goal of this patch is to eliminate SysV shared memory, not to implement 
 NFS-capable locking which, as you point out, is virtually impossible.

 As far as I can tell, in the worst case, my patch does not change how 
 postgresql handles the NFS case. SysV shared memory won't work across NFS, so 
 that interlock won't catch, so postgresql is left with looking at a lock file 
 with PID of process on another machine, so that won't catch either. This 
 patch does not alter the lock file semantics, but merely augments the file 
 with file locking.

 At least with this patch, there is a chance the lock might work across NFS. 
 In the best case, it can allow for shared-storage postgresql failover, which 
 is a new feature.

 Furthermore, there is an improvement in shared memory handling in that it is 
 unlinked immediately after creation, so only the postmaster and its children 
 have access to it (through file descriptor inheritance). This means shared 
 memory cannot be stomped on by any other process.

 Considering that possibly working NFS locking is a side-effect of this patch 
 and not its goal and, in the worst possible scenario, it doesn't change 
 current behavior, I don't see how this can be a ding against this patch.

I don't see why we need to get rid of SysV shared memory; needing less
of it seems just as good.

In answer to your off-list question, one of the principle ways I've
seen fcntl() locking fall over and die is when someone removes the
lock file.  You might think that this could be avoided by picking
something important like pg_control as the log file, but it turns out
that doesn't really work:

http://0pointer.de/blog/projects/locking.html

Tom's point is valid too.  Many storage appliances present themselves
as an NFS server, so it's very plausible for the data directory to be
on an NFS server, and there's no guarantee that flock() won't be
broken there.  If our current interlock were known to be unreliable
also maybe we wouldn't care very much, but AFAICT it's been extremely
robust.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2011-04-13 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 In answer to your off-list question, one of the principle ways I've
 seen fcntl() locking fall over and die is when someone removes the
 lock file.  You might think that this could be avoided by picking
 something important like pg_control as the log file, but it turns out
 that doesn't really work:

 http://0pointer.de/blog/projects/locking.html

Hm, I wasn't aware of the fact that you lose an fcntl() lock if you
close a *different* file descriptor for the same file.  My goodness,
that's horrid :-(.  It basically means that any third-party code running
in a backend could accidentally defeat the locking, and most of the time
you'd never even know it had happened ... as opposed to what would
happen if you detached from shmem for instance.  You could run with such
code for years, and probably never know why sometimes the interlock
failed to work when you needed it to.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2011-04-13 Thread A.M.

On Apr 13, 2011, at 8:36 PM, Robert Haas wrote:
 
 I don't see why we need to get rid of SysV shared memory; needing less
 of it seems just as good.

1. As long one keeps SysV shared memory around, the postgresql project has to 
maintain the annoying platform-specific document on how to configure the poorly 
named kernel parameters. If the SysV region is very small, that means I can run 
more postgresql instances within the same kernel limits, but one can still hit 
the limits. My patch allows the postgresql project to delete that page and the 
hassles with it.

2. My patch proves that SysV is wholly unnecessary. Are you attached to it? 
(Pun intended.)

 
 In answer to your off-list question, one of the principle ways I've
 seen fcntl() locking fall over and die is when someone removes the
 lock file.  You might think that this could be avoided by picking
 something important like pg_control as the log file, but it turns out
 that doesn't really work:
 
 http://0pointer.de/blog/projects/locking.html
 
 Tom's point is valid too.  Many storage appliances present themselves
 as an NFS server, so it's very plausible for the data directory to be
 on an NFS server, and there's no guarantee that flock() won't be
 broken there.  If our current interlock were known to be unreliable
 also maybe we wouldn't care very much, but AFAICT it's been extremely
 robust.

Both you and Tom have somehow assumed that the patch alters current postgresql 
behavior. In fact, the opposite is true. I haven't changed any of the existing 
behavior. The robust behavior remains. I merely added fcntl interlocking on 
top of the lock file to replace the SysV shmem check. If someone deletes the 
postgresql lock file over NFS, the data directory is equally screwed, but with 
my patch there is chance that two machines sharing a properly-configured NFS 
mount can properly interlock- postgresql cannot offer that today, so this is a 
feature upgrade with no loss. The worst case scenario is today's behavior.

My original goal remains to implement POSIX shared memory, but Tom Lane was 
right to point out that the current interlocking check relies on SysV, so, even 
though the startup locking is really orthogonal to shared memory, I implemented 
what could be considered a separate patch for that and rolled it into one.

I would encourage you to take a look at the patch.

Cheers,
M
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2011-04-13 Thread Robert Haas
On Wed, Apr 13, 2011 at 6:11 PM, A.M. age...@themactionfaction.com wrote:
 I don't see why we need to get rid of SysV shared memory; needing less
 of it seems just as good.

 1. As long one keeps SysV shared memory around, the postgresql project has to 
 maintain the annoying platform-specific document on how to configure the 
 poorly named kernel parameters. If the SysV region is very small, that means 
 I can run more postgresql instances within the same kernel limits, but one 
 can still hit the limits. My patch allows the postgresql project to delete 
 that page and the hassles with it.

 2. My patch proves that SysV is wholly unnecessary. Are you attached to it? 
 (Pun intended.)

With all due respect, I think this is an unproductive conversation.
Your patch proves that SysV is wholly unnecessary only if we also
agree that fcntl() locking is just as reliable as the nattch
interlock, and Tom and I are trying to explain why we don't believe
that's the case.  Saying that we're just wrong without responding to
our points substantively doesn't move the conversation forward.

In case it's not clear, here again is what we're concerned about: A
System V shm *cannot* be removed until nobody is attached to it.  A
lock file can be removed, or the lock can be accidentally released by
the apparently innocuous operation of closing a file descriptor.

 Both you and Tom have somehow assumed that the patch alters current 
 postgresql behavior. In fact, the opposite is true. I haven't changed any of 
 the existing behavior. The robust behavior remains. I merely added fcntl 
 interlocking on top of the lock file to replace the SysV shmem check.

This seems contradictory.  If you replaced the SysV shmem check, then
it's not there, which means you altered the behavior.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2011-04-13 Thread Tom Lane
A.M. age...@themactionfaction.com writes:
 1. As long one keeps SysV shared memory around, the postgresql project
 has to maintain the annoying platform-specific document on how to
 configure the poorly named kernel parameters.

No, if it's just a small area, I don't see that that's an issue.
You're going to max out on other things (like I/O bandwidth) long before
you run into the limit on how many postmasters you can have from this.
The reason that those parameters are problematic now is that people tend
to want *large* shmem segments and the typical defaults aren't friendly
to that.

 2. My patch proves that SysV is wholly unnecessary. Are you attached to it? 
 (Pun intended.)

You were losing this argument already, but ad hominem attacks are pretty
much guaranteed to get people to tune you out.  There are real,
substantive, unfixable reasons to not trust fcntl locking completely.

 I would encourage you to take a look at the patch.

Just to be perfectly clear: I have not read your patch, and am not
likely to before the next commitfest starts, because I have
approximately forty times too many things to do already.  I'm just going
off your own abbreviated description of the patch.  But from what I know
about fcntl locking, it's not a sufficient substitute for the SysV shmem
interlock, because it has failure modes that the SysV interlock doesn't,
and those failure modes occur in real-world cases.  Yeah, it'd be nice
to also be able to lock against other postmasters on other NFS clients,
but we hardly ever hear of somebody getting burnt by the lack of that
(and fcntl wouldn't be a bulletproof defense anyway).  On the other
hand, accidentally trying to start a duplicate postmaster on the same
machine is an everyday occurrence.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2011-04-11 Thread Robert Haas
On Sun, Apr 10, 2011 at 5:03 PM, A.M. age...@themactionfaction.com wrote:
 To ensure that no two postmasters can startup in the same data directory, I 
 use fcntl range locking on the data directory lock file, which also works 
 properly on (properly configured) NFS volumes. Whenever a postmaster or 
 postmaster child starts, it acquires a read (non-exclusive) lock on the data 
 directory's lock file. When a new postmaster starts, it queries if anything 
 would block a write (exclusive) lock on the lock file which returns a 
 lock-holding PID in the case when other postgresql processes are running.

This seems a lot leakier than what we do now (imagine, for example,
shared storage) and I'm not sure what the advantage is.  I was
imagining keeping some portion of the data in sysv shm, and moving the
big stuff to a POSIX shm that would operate alongside it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2011-04-11 Thread A.M.

On Apr 11, 2011, at 6:06 PM, Robert Haas wrote:

 On Sun, Apr 10, 2011 at 5:03 PM, A.M. age...@themactionfaction.com wrote:
 To ensure that no two postmasters can startup in the same data directory, I 
 use fcntl range locking on the data directory lock file, which also works 
 properly on (properly configured) NFS volumes. Whenever a postmaster or 
 postmaster child starts, it acquires a read (non-exclusive) lock on the data 
 directory's lock file. When a new postmaster starts, it queries if anything 
 would block a write (exclusive) lock on the lock file which returns a 
 lock-holding PID in the case when other postgresql processes are running.
 
 This seems a lot leakier than what we do now (imagine, for example,
 shared storage) and I'm not sure what the advantage is.  I was
 imagining keeping some portion of the data in sysv shm, and moving the
 big stuff to a POSIX shm that would operate alongside it.

What do you mean by leakier? The goal here is to extinguish SysV shared 
memory for portability and convenience benefits. The mini-SysV proposal was 
implemented and shot down by Tom Lane.

Cheers,
M
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2011-04-11 Thread Robert Haas
On Mon, Apr 11, 2011 at 3:11 PM, A.M. age...@themactionfaction.com wrote:

 On Apr 11, 2011, at 6:06 PM, Robert Haas wrote:

 On Sun, Apr 10, 2011 at 5:03 PM, A.M. age...@themactionfaction.com wrote:
 To ensure that no two postmasters can startup in the same data directory, I 
 use fcntl range locking on the data directory lock file, which also works 
 properly on (properly configured) NFS volumes. Whenever a postmaster or 
 postmaster child starts, it acquires a read (non-exclusive) lock on the 
 data directory's lock file. When a new postmaster starts, it queries if 
 anything would block a write (exclusive) lock on the lock file which 
 returns a lock-holding PID in the case when other postgresql processes are 
 running.

 This seems a lot leakier than what we do now (imagine, for example,
 shared storage) and I'm not sure what the advantage is.  I was
 imagining keeping some portion of the data in sysv shm, and moving the
 big stuff to a POSIX shm that would operate alongside it.

 What do you mean by leakier? The goal here is to extinguish SysV shared 
 memory for portability and convenience benefits. The mini-SysV proposal was 
 implemented and shot down by Tom Lane.

I mean I'm not convinced that fcntl() locking will be as reliable.

I know Tom shot that down before, but I still think it's probably the
best way forward.  The advantage I see is that we would be able to
more easily allocate larger chunks of shared memory with changing
kernel parameters, and perhaps even to dynamically resize shared
memory chunks.  That'd be worth the price of admission even if we
didn't get all those benefits in one commit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2011-04-11 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Apr 11, 2011 at 3:11 PM, A.M. age...@themactionfaction.com wrote:
 What do you mean by leakier? The goal here is to extinguish SysV shared 
 memory for portability and convenience benefits. The mini-SysV proposal was 
 implemented and shot down by Tom Lane.

 I mean I'm not convinced that fcntl() locking will be as reliable.

I'm not either.  Particularly not on NFS.  (Although on NFS you have
other issues to worry about too, like postmasters on different machines
being able to reach the same data directory.  I wonder if we should do
both SysV and fcntl locking ...)

 I know Tom shot that down before, but I still think it's probably the
 best way forward.

Did I?  I think I pointed out that there's zero gain in portability as
long as we still depend on SysV shmem to work.  However, if you're doing
it for other reasons than portability, it might make sense anyway.  The
question is whether there are adequate other reasons.

 The advantage I see is that we would be able to
 more easily allocate larger chunks of shared memory with changing
 kernel parameters,

Yes, getting out from under the SHMMAX bugaboo would be awfully nice.

 and perhaps even to dynamically resize shared memory chunks.

This I don't really believe will ever work reliably, especially not in
32-bit machines.  Whatever your kernel API is, you still have the
problem of finding address space contiguous to what you were already
using.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2011-04-11 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sun, Apr 10, 2011 at 5:03 PM, A.M. age...@themactionfaction.com wrote:
 To ensure that no two postmasters can startup in the same data directory, I 
 use fcntl range locking on the data directory lock file, which also works 
 properly on (properly configured) NFS volumes. Whenever a postmaster or 
 postmaster child starts, it acquires a read (non-exclusive) lock on the data 
 directory's lock file. When a new postmaster starts, it queries if anything 
 would block a write (exclusive) lock on the lock file which returns a 
 lock-holding PID in the case when other postgresql processes are running.

 This seems a lot leakier than what we do now (imagine, for example,
 shared storage) and I'm not sure what the advantage is.

BTW, the above-described solution flat out doesn't work anyway, because
it has a race condition.  Postmaster children have to reacquire the lock
after forking, because fcntl locks aren't inherited during fork().  And
that means you can't tell whether there's a just-started backend that
hasn't yet acquired the lock.  It's really critical for our purposes
that SysV shmem segments are inherited at fork() and so there's no
window where a just-forked backend isn't visible to somebody checking
the state of the shmem segment.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2011-04-11 Thread A.M.

On Apr 11, 2011, at 7:25 PM, Tom Lane wrote:

 Robert Haas robertmh...@gmail.com writes:
 On Sun, Apr 10, 2011 at 5:03 PM, A.M. age...@themactionfaction.com wrote:
 To ensure that no two postmasters can startup in the same data directory, I 
 use fcntl range locking on the data directory lock file, which also works 
 properly on (properly configured) NFS volumes. Whenever a postmaster or 
 postmaster child starts, it acquires a read (non-exclusive) lock on the 
 data directory's lock file. When a new postmaster starts, it queries if 
 anything would block a write (exclusive) lock on the lock file which 
 returns a lock-holding PID in the case when other postgresql processes are 
 running.
 
 This seems a lot leakier than what we do now (imagine, for example,
 shared storage) and I'm not sure what the advantage is.
 
 BTW, the above-described solution flat out doesn't work anyway, because
 it has a race condition.  Postmaster children have to reacquire the lock
 after forking, because fcntl locks aren't inherited during fork().  And
 that means you can't tell whether there's a just-started backend that
 hasn't yet acquired the lock.  It's really critical for our purposes
 that SysV shmem segments are inherited at fork() and so there's no
 window where a just-forked backend isn't visible to somebody checking
 the state of the shmem segment.

Then you haven't looked at my patch because I address this race condition by 
ensuring that a lock-holding violator is the postmaster or a postmaster child. 
If such as condition is detected, the child exits immediately without touching 
the shared memory. POSIX shmem is inherited via file descriptors.

Cheers,
M
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2011-04-11 Thread A.M.

On Apr 11, 2011, at 7:13 PM, Tom Lane wrote:

 Robert Haas robertmh...@gmail.com writes:
 On Mon, Apr 11, 2011 at 3:11 PM, A.M. age...@themactionfaction.com wrote:
 What do you mean by leakier? The goal here is to extinguish SysV shared 
 memory for portability and convenience benefits. The mini-SysV proposal was 
 implemented and shot down by Tom Lane.
 
 I mean I'm not convinced that fcntl() locking will be as reliable.
 
 I'm not either.  Particularly not on NFS.  (Although on NFS you have
 other issues to worry about too, like postmasters on different machines
 being able to reach the same data directory.  I wonder if we should do
 both SysV and fcntl locking ...)

Is there an example of a recent system where fcntl is broken (ignoring NFS)? I 
believe my patch addresses all potential race conditions and uses the APIs 
properly to guarantee single-postmaster data directory usage and I tested on 
Darwin and a two-year-old Linux kernel. In the end, fcntl locking relies on the 
same kernel which provides the SysV user count, so I'm not sure what makes it 
less reliable, but I have heard that twice now, so I am open to hearing about 
your experiences.

 I know Tom shot that down before, but I still think it's probably the
 best way forward.
 
 Did I?  I think I pointed out that there's zero gain in portability as
 long as we still depend on SysV shmem to work.  However, if you're doing
 it for other reasons than portability, it might make sense anyway.  The
 question is whether there are adequate other reasons.

I provided an example of postmaster-failover relying on F_SETLKW in the email 
with the patch. Also, as you point out above, fcntl locking at least has a 
chance of working over NFS.

 
 The advantage I see is that we would be able to
 more easily allocate larger chunks of shared memory with changing
 kernel parameters,
 
 Yes, getting out from under the SHMMAX bugaboo would be awfully nice.

Yes, please! That is my primary motivation for this patch.

 
 and perhaps even to dynamically resize shared memory chunks.
 
 This I don't really believe will ever work reliably, especially not in
 32-bit machines.  Whatever your kernel API is, you still have the
 problem of finding address space contiguous to what you were already
 using.

Even if expanding shmem involves copying large regions of memory, it could be 
at least useful to adjust buffer sizes live without a restart.

Cheers,
M


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2010-11-14 Thread Martijn van Oosterhout
On Sat, Nov 13, 2010 at 08:07:52PM -0500, Tom Lane wrote:
 A.M. age...@themactionfaction.com writes:
  The goal of this work is to address all of the shortcomings of previous 
  POSIX shared memory patches as pointed out mostly by Tom Lane.
 
 It seems like you've failed to understand the main shortcoming of this
 whole idea, which is the loss of ability to detect pre-existing backends
 still running in a cluster whose postmaster has crashed.  The nattch
 variable of SysV shmem segments is really pretty critical to us, and
 AFAIK there simply is no substitute for it in POSIX-land.

I've been looking and there really doesn't appear to be. This is
consistant as there is nothing else in POSIX where you can determine
how many other people have the same file, pipe, tty, etc open.

I asked a few people for ideas and got answers like: just walk through
/proc and check. Apart from the portability issues, this won't work if
there are different user-IDs in play.

The only real solution seems to me to be to keep a small SysV shared
memory segment for the locking and allocate the rest of the shared
memory some other way. If all backends map the SysV memory before the
other way, then you can use the non-existance of the SysV SHM to
determine the non-existance of the other segment.

Quite a bit more work, ISTM.

Haveva nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 Patriotism is when love of your own people comes first; nationalism,
 when hate for people other than your own comes first. 
   - Charles de Gaulle


signature.asc
Description: Digital signature


Re: [HACKERS] POSIX shared memory redux

2010-11-14 Thread Tom Lane
Martijn van Oosterhout klep...@svana.org writes:
 The only real solution seems to me to be to keep a small SysV shared
 memory segment for the locking and allocate the rest of the shared
 memory some other way.

Yeah, that's been discussed.  It throws all the portability gains out
the window.  It might get you out from under the need to readjust a
machine's SHMMAX setting before you can use a large amount of shared
memory, but it's not clear that's enough of a win to be worth the
trouble.

The other direction that we could possibly go is to find some other way
entirely of interlocking access to the data directory.  If for example
we could rely on a file lock held by the postmaster and all backends,
we could check that instead of having to rely on a shmem behavior.
The killer objection to that so far is that file locking is unreliable
in some environments, particularly NFS.  But it'd have some advantages
too --- in particular, in the NFS context, the fact that the lock is
visible to would-be postmasters on different machines might be thought
a huge safety improvement over what we do now.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2010-11-14 Thread Robert Haas
On Sun, Nov 14, 2010 at 11:06 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Martijn van Oosterhout klep...@svana.org writes:
 The only real solution seems to me to be to keep a small SysV shared
 memory segment for the locking and allocate the rest of the shared
 memory some other way.

 Yeah, that's been discussed.  It throws all the portability gains out
 the window.  It might get you out from under the need to readjust a
 machine's SHMMAX setting before you can use a large amount of shared
 memory, but it's not clear that's enough of a win to be worth the
 trouble.

One of the things that would be really nice to be able to do is resize
our shm after startup, in response to changes in configuration
parameters.  That's not so easy to make work, of course, but I feel
like this might be going in the right direction, since POSIX shms can
be resized using ftruncate().

 The other direction that we could possibly go is to find some other way
 entirely of interlocking access to the data directory.  If for example
 we could rely on a file lock held by the postmaster and all backends,
 we could check that instead of having to rely on a shmem behavior.
 The killer objection to that so far is that file locking is unreliable
 in some environments, particularly NFS.  But it'd have some advantages
 too --- in particular, in the NFS context, the fact that the lock is
 visible to would-be postmasters on different machines might be thought
 a huge safety improvement over what we do now.

I've never had a lot of luck making filesystem locks work reliably,
but I don't discount the possibility that I was doing it wrong.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POSIX shared memory redux

2010-11-13 Thread Tom Lane
A.M. age...@themactionfaction.com writes:
 The goal of this work is to address all of the shortcomings of previous POSIX 
 shared memory patches as pointed out mostly by Tom Lane.

It seems like you've failed to understand the main shortcoming of this
whole idea, which is the loss of ability to detect pre-existing backends
still running in a cluster whose postmaster has crashed.  The nattch
variable of SysV shmem segments is really pretty critical to us, and
AFAIK there simply is no substitute for it in POSIX-land.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers