Re: [PATCHES] Nameless IPC on POSIX systems

2005-05-06 Thread Tom Lane
[EMAIL PROTECTED] (=?iso-8859-1?q?Dag-Erling_Sm=F8rgrav?=) writes:
> Tom Lane <[EMAIL PROTECTED]> writes:
>> The check we need is "are there any other processes (still) attached to
>> this shmem" and AFAIK that is not available in the mmap API.  Do you
>> know how to get it?

> You can hack something up with fcntl() locks.  If a process has a
> shared lock on the shm file, F_GETLK will get you its pid.  Then grab
> your own shared lock.

Seems fairly race-condition-prone: what about recently spawned child
processes that haven't yet taken their own locks?  If I read the fork()
page correctly, a forked child doesn't inherit any file locks.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PATCHES] Nameless IPC on POSIX systems

2005-05-06 Thread Dag-Erling Smørgrav
Tom Lane <[EMAIL PROTECTED]> writes:
> The check we need is "are there any other processes (still) attached to
> this shmem" and AFAIK that is not available in the mmap API.  Do you
> know how to get it?

You can hack something up with fcntl() locks.  If a process has a
shared lock on the shm file, F_GETLK will get you its pid.  Then grab
your own shared lock.

DES
-- 
Dag-Erling Smørgrav - [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PATCHES] Nameless IPC on POSIX systems

2005-05-06 Thread Tom Lane
[EMAIL PROTECTED] (=?iso-8859-1?q?Dag-Erling_Sm=F8rgrav?=) writes:
> You can use file-backed shared memory instead.  You need a directory
> which you know is writeable and unique to this instance, on a file
> system with enough free space to accomodate the full size of the
> shared memory segment.  DataDir is probably a good choice.  If the
> file does not exist, you create it at startup.  If it does exist, you
> map it in and perform the same checks as in the SysV case.

The check we need is "are there any other processes (still) attached to
this shmem" and AFAIK that is not available in the mmap API.  Do you
know how to get it?

> Anyway, I'm not sure you fully understand the problem this patch
> addresses.

Yes, I do.  I'm not interested in substituting a risk of data corruption
for them.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PATCHES] Nameless IPC on POSIX systems

2005-05-06 Thread Dag-Erling Smørgrav
Tom Lane <[EMAIL PROTECTED]> writes:
> This is not acceptable in the slightest, because it offers no protection
> against the situation where the old postmaster has crashed but there are
> still live backends.  If a new postmaster and new backends are allowed
> to start in that situation, using a new shared memory segment, you
> *will* have major database corruption (eg, duplicate use of transaction
> IDs).

I assumed the backends would terminate if postmaster crashed, and that
"reattach" was only necessary for the EXEC_BACKEND case.

You can use file-backed shared memory instead.  You need a directory
which you know is writeable and unique to this instance, on a file
system with enough free space to accomodate the full size of the
shared memory segment.  DataDir is probably a good choice.  If the
file does not exist, you create it at startup.  If it does exist, you
map it in and perform the same checks as in the SysV case.

> The semaphore code may be functionally OK, but I'm not thrilled with the
> fact that it requires two open file descriptors per semaphore, which
> have to be passed down to each postmaster child process.  That's a lot
> of files if MaxBackends is large; not only does it constrain the number
> of file slots available for fd.c to use, but you run the risk of
> overflowing what an fd_set can handle, which I notice breaks this code
> :-(.

#define FD_SETSIZE BIG_NUMBER

Anyway, I'm not sure you fully understand the problem this patch
addresses.  It is currently impractical if not impossible to run
PostgreSQL in jails on FreeBSD, because:

 - SysV IPC is normally not allowed in jails, and must be explicitly
   enabled.

 - the namespace is global, not per-jail, so separate instances in
   separate jails risk collision (I believe there is a workaround for
   this in 8.0, but I haven't tested it)

 - even if collision is avoided, SysV IPC breaches the separation
   between jails, allowing anyone who manages to compromise one jail
   to crash or corrupt any process using SysV IPC in any other jail on
   the system.

DES
-- 
Dag-Erling Smørgrav - [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PATCHES] Nameless IPC on POSIX systems

2005-05-06 Thread Tom Lane
[EMAIL PROTECTED] (=?iso-8859-1?q?Dag-Erling_Sm=F8rgrav?=) writes:
> The attached patch implements new semaphore and shared memory
> mechanisms for POSIX systems.

I'm afraid we'll have to reject this out of hand:

> +bool
> +PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2)
> +{
> +/*
> + * This is never the case when using mmap(), since the segments will
> + * vanish into thin air when postmaster exits or crashes.
> + */
> + return false;
> +}

This is not acceptable in the slightest, because it offers no protection
against the situation where the old postmaster has crashed but there are
still live backends.  If a new postmaster and new backends are allowed
to start in that situation, using a new shared memory segment, you
*will* have major database corruption (eg, duplicate use of transaction
IDs).  We need the SysV ability to detect whether any backends are still
connected to the old shared memory segment in order to be safe against
this scenario.

The semaphore code may be functionally OK, but I'm not thrilled with the
fact that it requires two open file descriptors per semaphore, which
have to be passed down to each postmaster child process.  That's a lot
of files if MaxBackends is large; not only does it constrain the number
of file slots available for fd.c to use, but you run the risk of
overflowing what an fd_set can handle, which I notice breaks this code
:-(.  For comparison, the Darwin implementation needs one descriptor per
semaphore, and we have seen performance issues with that.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


[PATCHES] Nameless IPC on POSIX systems

2005-05-06 Thread Dag-Erling Smørgrav
The attached patch implements new semaphore and shared memory
mechanisms for POSIX systems.

Semaphores are implemented using unnamed pipes.  A semaphore is
incremented by writing a single character to the pipe, and decremented
by reading a single character.  The only semaphore operation we can't
reliably simulate in this manner is sem_getvalue(), but PostgreSQL
doesn't use it.

Shared memory is implemented using file-less (swap-backed) mmap(),
either with MAP_ANON on systems which support it, or with /dev/zero
(SysV-style).  Note that I've only tested this on systems which
support MAP_ANON, so there may be bugs in the /dev/zero code.

One system which will definitely benefit from this is FreeBSD.
FreeBSD has both SysV and POSIX semaphores and shared memory, but
unnamed POSIX semaphores can't be shared between processes, and POSIX
shared memory is implemented using plain files, so the POSIX
primitives can't be used.  The SysV primitives use a global namespace,
which causes problems when multiple PostgreSQL instances run in
separate jails (they can't run on the same port, and a compromised
postmaster in one jail can be used to crash postmasters in other
jails)

The patch was developed and tested on FreeBSD 6, and has also been
tested cursorily on SuSE Linux 9.2.  It passes 'make check', and osdb
(for what it's worth) shows no difference in performance between
patched and unpatched postmasters built from the same source.

Remember to run autoconf and configure before testing, as the patch
modifies configure.in and the FreeBSD and Linux templates.

DES
-- 
Dag-Erling Smørgrav - [EMAIL PROTECTED]

Index: configure.in
===
RCS file: /home/pqcvs/pgsql/configure.in,v
retrieving revision 1.409
diff -u -u -r1.409 configure.in
--- configure.in	5 May 2005 19:15:54 -	1.409
+++ configure.in	6 May 2005 12:03:26 -
@@ -1240,20 +1240,26 @@
 if test x"$USE_NAMED_POSIX_SEMAPHORES" = x"1" ; then
   AC_DEFINE(USE_NAMED_POSIX_SEMAPHORES, 1, [Define to select named POSIX semaphores.])
   SEMA_IMPLEMENTATION="src/backend/port/posix_sema.c"
+elif test x"$USE_UNNAMED_POSIX_SEMAPHORES" = x"1" ; then
+  AC_DEFINE(USE_UNNAMED_POSIX_SEMAPHORES, 1, [Define to select unnamed POSIX semaphores.])
+  SEMA_IMPLEMENTATION="src/backend/port/posix_sema.c"
+elif test x"$USE_PIPE_SEMAPHORES" = x"1" ; then
+  AC_DEFINE(USE_PIPE_SEMAPHORES, 1, [Define to select pipe()-based semaphores.])
+  SEMA_IMPLEMENTATION="src/backend/port/pipe_sema.c"
 else
-  if test x"$USE_UNNAMED_POSIX_SEMAPHORES" = x"1" ; then
-AC_DEFINE(USE_UNNAMED_POSIX_SEMAPHORES, 1, [Define to select unnamed POSIX semaphores.])
-SEMA_IMPLEMENTATION="src/backend/port/posix_sema.c"
-  else
-AC_DEFINE(USE_SYSV_SEMAPHORES, 1, [Define to select SysV-style semaphores.])
-SEMA_IMPLEMENTATION="src/backend/port/sysv_sema.c"
-  fi
+  AC_DEFINE(USE_SYSV_SEMAPHORES, 1, [Define to select SysV-style semaphores.])
+  SEMA_IMPLEMENTATION="src/backend/port/sysv_sema.c"
 fi
 
 
 # Select shared-memory implementation type.
-AC_DEFINE(USE_SYSV_SHARED_MEMORY, 1, [Define to select SysV-style shared memory.])
-SHMEM_IMPLEMENTATION="src/backend/port/sysv_shmem.c"
+if test x"$USE_MMAP_SHARED_MEMORY" = x"1" ; then
+  AC_DEFINE(USE_MMAP_SHARED_MEMORY, 1, [Define to select mmap()-based shared memory.])
+  SHMEM_IMPLEMENTATION="src/backend/port/mmap_shmem.c"
+else
+  AC_DEFINE(USE_SYSV_SHARED_MEMORY, 1, [Define to select SysV-style shared memory.])
+  SHMEM_IMPLEMENTATION="src/backend/port/sysv_shmem.c"
+fi
 
 
 if test "$enable_nls" = yes ; then
Index: src/backend/port/mmap_shmem.c
===
RCS file: src/backend/port/mmap_shmem.c
diff -N src/backend/port/mmap_shmem.c
--- /dev/null	1 Jan 1970 00:00:00 -
+++ src/backend/port/mmap_shmem.c	6 May 2005 12:09:45 -
@@ -0,0 +1,138 @@
+/*-
+ *
+ * mmap_shmem.c
+ *	  Implement shared memory using mmap()
+ *
+ * Portions Copyright (c) 2005 Dag-Erling Coïdan Smørgrav
+ * Portions Copyright (c) 1996-2005, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  $PostgreSQL$
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include 
+
+#include 
+#include 
+#include 
+
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/pg_shmem.h"
+
+#ifdef EXEC_BACKEND
+/* there is no way to reattach to a segment after exec */
+#error mmap()-based shared memory can not be combined with EXEC_BACKEND
+#endif /* EXEC_BACKEND */
+
+#if !defined(MAP_ANON)
+#if defined(MAP_ANONYMOUS)
+#define MAP_ANON MAP_ANONYMOUS
+#else
+static const char  *DevZero = "/dev/zero";
+#endif
+#endif
+
+static int			ShmemFileDescriptor = -1;
+static void		   *ShmemSegmentAddress = NULL;
+static size_t		ShmemSegmentSize = 0;
+
+bool
+PGSh