subject:"\[HACKERS\] making use of large TLB pages"

Re: [HACKERS] making use of large TLB pages

2002-09-29 Thread Bruce Momjian


Neil Conway wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Is TLB Linux-only?
 
 Well, the TLB is a feature of the CPU, so no. Many modern processors
 support large TLB pages in some fashion.
 
 However, the specific API for using large TLB pages differs between
 operating systems. The API I'm planning to implement is the one
 provided by recent versions of Linux (2.5.38+).
 
 I've only looked briefly at enabling the usage of large pages on other
 operating systems. On Solaris, we already use large pages (due to
 using Intimate Shared Memory). On HPUX, you apparently need call
 chattr on the executable for it to use large pages. AFAIK the BSDs
 don't support large pages for user-land apps -- if I'm incorrect, let
 me know.
 
  Why use it and non SysV memory?
 
 It's faster, at least in theory. I posted these links at the start of
 the thread:
 
 http://lwn.net/Articles/6535/
 http://lwn.net/Articles/10293/
 
  Is it a lot of code?
 
 I haven't implemented it yet, so I'm not sure. However, I don't think
 it will be a lot of code.

OK, personally, I would like to see an actual speedup of PostgreSQL
queries before I would apply such a OS-specific, version-specific patch.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] making use of large TLB pages

2002-09-29 Thread Neil Conway


Bruce Momjian [EMAIL PROTECTED] writes:
 OK, personally, I would like to see an actual speedup of PostgreSQL
 queries before I would apply such a OS-specific, version-specific
 patch.

Don't be silly. A performance improvement is a performance
improvement. According to your logic, using assembly-optimized locking
primitives shouldn't be done unless we've exhausted every possible
optimization in every other part of the system (a process which will
likely never be finished).

If the optimization was for some obscure UNIX variant and/or an
obscure processor, I would agree that it wouldn't be worth the
bother. But given that

(a) Linux on IA32 is likely our most popular platform [1]

(b) In theory, this will help performance where we need it
most, IMHO (high-end systems using large shared buffers)

I think it's at least worth implementing -- if it doesn't provide a
noticeable performance improvement, then we don't need to merge it.

Cheers,

Neil

[1] It's worth noting that the huge tlb patch currently works in IA64,
SPARC, and may well be ported to additional architectures in the
future.

-- 
Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] making use of large TLB pages

2002-09-29 Thread Tom Lane


Neil Conway [EMAIL PROTECTED] writes:
 Bruce Momjian [EMAIL PROTECTED] writes:
 OK, personally, I would like to see an actual speedup of PostgreSQL
 queries before I would apply such a OS-specific, version-specific
 patch.

 Don't be silly. A performance improvement is a performance
 improvement.

No, Bruce was saying that he wanted to see demonstrable improvement
*due to this specific change* before committing to support a
platform-specific API.  I agree with him, actually.  If you do the
TLB code and can't measure any meaningful performance improvement
when using it vs. when not, I'd not be excited about cluttering the
distribution with it.

 I think it's at least worth implementing -- if it doesn't provide a
 noticeable performance improvement, then we don't need to merge it.

You're on the same page, you just don't realize it...

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] making use of large TLB pages

2002-09-29 Thread Jonah H. Harris


Neil,

I agree with Bruce and Tom.  AFAIK and in my experience I don't think it
will be a significantly measurable increase.  Not only that, but the
portability issue itself tends to make it less desireable.  I recently
ported SAP DB and the coinciding DevTools over to OpenBSD and learned again
first-hand what a pain in the ass having platform-specific code is.  I guess
it's up to you, Neil.  If you want to spend the time trying to implement it,
and it does prove to have a significant performance increase I'd say maybe.
IMHO, I just think that time could be better spent improving the current
system rather than trying to add to it in a singular way.  Sorry if my
comments are out-of-line on this one but it has been a thread for some time
I'm just kinda tired of reading theory vs proof.

Since you are so set on trying to implement this, I'm just wondering what
documentation has tested evidence of measurable increases in similar
situations?  I just like arguments to be backed by proof... and I'm sure
there is documentation on this somewhere.

-Jonah

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Bruce Momjian
Sent: Sunday, September 29, 2002 3:30 PM
To: Tom Lane
Cc: Neil Conway; PostgreSQL Hackers
Subject: Re: [HACKERS] making use of large TLB pages


Tom Lane wrote:
 Neil Conway [EMAIL PROTECTED] writes:
  Bruce Momjian [EMAIL PROTECTED] writes:
  OK, personally, I would like to see an actual speedup of PostgreSQL
  queries before I would apply such a OS-specific, version-specific
  patch.

  Don't be silly. A performance improvement is a performance
  improvement.

 No, Bruce was saying that he wanted to see demonstrable improvement
 *due to this specific change* before committing to support a
 platform-specific API.  I agree with him, actually.  If you do the
 TLB code and can't measure any meaningful performance improvement
 when using it vs. when not, I'd not be excited about cluttering the
 distribution with it.

  I think it's at least worth implementing -- if it doesn't provide a
  noticeable performance improvement, then we don't need to merge it.

 You're on the same page, you just don't realize it...

I see what he thought I said, I just can't figure out how he read it
that way.

--
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] making use of large TLB pages

2002-09-29 Thread Neil Conway


Jonah H. Harris [EMAIL PROTECTED] writes:
 I agree with Bruce and Tom.

AFAIK Bruce and Tom (and myself) agree that this is a good idea,
provided it makes a noticeable performance difference (and if it
doesn't, it's not worth applying).

  AFAIK and in my experience I don't think it will be a significantly
 measurable increase.

Can you elaborate on this experience?
  
 Not only that, but the portability issue itself tends to make it
 less desireable.

Well, that's obvious: code that improves PostgreSQL on *all* platforms
is clearly superior to code that only improves it on a couple. That's
not to say that the latter code is absolutely without merit, however.

 Sorry if my comments are out-of-line on this one but it has been a
 thread for some time I'm just kinda tired of reading theory vs
 proof.

Well, ISTM the easiest way to get some proof is to implement it and
benchmark the results. IMHO any claims about performance prior to that
are mostly hand waving.

 Since you are so set on trying to implement this, I'm just wondering
 what documentation has tested evidence of measurable increases in
 similar situations?

(/me wonders if people bother reading the threads they reply to)

http://lwn.net/Articles/10293/

According to the HP guys, Oracle saw an 8% performance improvement in
TPC-C when they started using large pages.

To be perfectly honest, I really have no idea if that will translate
into an 8% performance gain for PostgreSQL, or whether the performance
gain only applies if you're using a machine with 16GB of RAM, or
whether the speedup from large pages is really just a correction of
some Oracle deficiency that we don't suffer from, etc. However, I do
think it's worth finding out.

Cheers,

Neil

-- 
Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] making use of large TLB pages

2002-09-28 Thread Tom Lane


Neil Conway [EMAIL PROTECTED] writes:
 If we used a key that would remain the same between runs of the
 postmaster, this should ensure that there isn't a possibility of two
 independant sets of backends operating on the same data dir. The most
 logical way to do this IMHO would be to just hash the data dir, but I
 suppose the current method of using the port number should work as
 well.

You should stick as closely as possible to the key logic currently used
for SysV shmem keys.  That logic is intended to cope with the case where
someone else is already using the key# that we initially generate, as
well as the case where we discover a collision with a pre-existing
backend set.  (We tell the difference by looking for a magic number at
the start of the shmem segment.)

Note that we do not assume the key is the same on each run; that's why
we store it in postmaster.pid.

 (1) call sys_alloc_hugepages() without IPC_EXCL. If it returns
 an error, we're in the clear: there's no page matching
 that key. If it returns a pointer to a previously existing
 segment, panic: it is very likely that there are some
 orphaned backends still active.

s/panic/and the PG magic number appears in the segment header, panic/

 - if we're compiling on a Linux system but the kernel headers
   don't define the syscalls we need, use some reasonable
   defaults (e.g. the syscall numbers for the current hugepage
   syscalls in Linux 2.5)

I think this is overkill, and quite possibly dangerous.  If we don't see
the symbols then don't try to compile the code.

On the whole it seems that this allows a very nearly one-to-one mapping
to the existing SysV functionality.  We don't have the number of
connected processes syscall, perhaps, but we don't need it: if a
hugepages segment exists we can assume the number of connected processes
is greater than 0, and that's all we really need to know.

I think it's okay to stuff this support into the existing
port/sysv_shmem.c file, rather than make a separate file (particularly
given your point that we have to be able to fall back to SysV calls at
runtime).  I'd suggest reorganizing the code in that file slightly to
separate the actual syscalls from the controlling logic in
PGSharedMemoryCreate().  Probably also will have to extend the API for
PGSharedMemoryIsInUse() and RecordSharedMemoryInLockFile() to allow
three fields to be recorded in postmaster.pid, not two --- you'll want
a boolean indicating whether the stored key is for a SysV or hugepage
segment.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] making use of large TLB pages

2002-09-28 Thread Bruce Momjian



I haven't been following this thread.  Can someone answer:

Is TLB Linux-only?
Why use it and non SysV memory?
Is it a lot of code?

---

Tom Lane wrote:
 Neil Conway [EMAIL PROTECTED] writes:
  If we used a key that would remain the same between runs of the
  postmaster, this should ensure that there isn't a possibility of two
  independant sets of backends operating on the same data dir. The most
  logical way to do this IMHO would be to just hash the data dir, but I
  suppose the current method of using the port number should work as
  well.
 
 You should stick as closely as possible to the key logic currently used
 for SysV shmem keys.  That logic is intended to cope with the case where
 someone else is already using the key# that we initially generate, as
 well as the case where we discover a collision with a pre-existing
 backend set.  (We tell the difference by looking for a magic number at
 the start of the shmem segment.)
 
 Note that we do not assume the key is the same on each run; that's why
 we store it in postmaster.pid.
 
  (1) call sys_alloc_hugepages() without IPC_EXCL. If it returns
  an error, we're in the clear: there's no page matching
  that key. If it returns a pointer to a previously existing
  segment, panic: it is very likely that there are some
  orphaned backends still active.
 
 s/panic/and the PG magic number appears in the segment header, panic/
 
  - if we're compiling on a Linux system but the kernel headers
don't define the syscalls we need, use some reasonable
defaults (e.g. the syscall numbers for the current hugepage
syscalls in Linux 2.5)
 
 I think this is overkill, and quite possibly dangerous.  If we don't see
 the symbols then don't try to compile the code.
 
 On the whole it seems that this allows a very nearly one-to-one mapping
 to the existing SysV functionality.  We don't have the number of
 connected processes syscall, perhaps, but we don't need it: if a
 hugepages segment exists we can assume the number of connected processes
 is greater than 0, and that's all we really need to know.
 
 I think it's okay to stuff this support into the existing
 port/sysv_shmem.c file, rather than make a separate file (particularly
 given your point that we have to be able to fall back to SysV calls at
 runtime).  I'd suggest reorganizing the code in that file slightly to
 separate the actual syscalls from the controlling logic in
 PGSharedMemoryCreate().  Probably also will have to extend the API for
 PGSharedMemoryIsInUse() and RecordSharedMemoryInLockFile() to allow
 three fields to be recorded in postmaster.pid, not two --- you'll want
 a boolean indicating whether the stored key is for a SysV or hugepage
 segment.
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 4: Don't 'kill -9' the postmaster
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] making use of large TLB pages

2002-09-28 Thread Neil Conway


Bruce Momjian [EMAIL PROTECTED] writes:
   Is TLB Linux-only?

Well, the TLB is a feature of the CPU, so no. Many modern processors
support large TLB pages in some fashion.

However, the specific API for using large TLB pages differs between
operating systems. The API I'm planning to implement is the one
provided by recent versions of Linux (2.5.38+).

I've only looked briefly at enabling the usage of large pages on other
operating systems. On Solaris, we already use large pages (due to
using Intimate Shared Memory). On HPUX, you apparently need call
chattr on the executable for it to use large pages. AFAIK the BSDs
don't support large pages for user-land apps -- if I'm incorrect, let
me know.

   Why use it and non SysV memory?

It's faster, at least in theory. I posted these links at the start of
the thread:

http://lwn.net/Articles/6535/
http://lwn.net/Articles/10293/

   Is it a lot of code?

I haven't implemented it yet, so I'm not sure. However, I don't think
it will be a lot of code.

Cheers,

Neil

-- 
Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] making use of large TLB pages

2002-09-27 Thread Neil Conway


Okay, I did some more research into this area. It looks like it will
be feasible to use large TLB pages for PostgreSQL.

Tom Lane [EMAIL PROTECTED] writes:
 It wasn't clear from your description whether large-TLB shmem segments
 even have IDs that one could use to determine whether the segment still
 exists.

There are two types of hugepages:

(a) private: Not shared on fork(), not accessible to processes
other than the one that allocates the pages.

(b) shared: Shared across a fork(), accessible to other
processes: different processes can access the same segment
if they call sys_alloc_hugepages() with the same key.

So for a standalone backend, we can just use private pages (probably
worth using private hugepages rather than malloc, although I doubt it
matters much either way).

  Another possibility might be to still allocate a small SysV shmem
  area, and use that to provide the interlock, while we allocate the
  buffer area using sys_alloc_hugepages. That's somewhat of a hack, but
  I think it would resolve the interlock problem, at least.
 
 Not a bad idea ... I have not got a better one offhand ... but watch
 out for SHMMIN settings.

As it turns out, this will be completely unnecessary. Since hugepages
are an in-kernel data structure, the kernel takes care of ensuring
that dieing processes don't orphan any unused hugepage segments. The
logic works like this: (for shared hugepages)

(a) sys_alloc_hugepages() without IPC_EXCL will return a
pointer to an existing segment, if there is one that
matches the key. If an existing segment is found, the
usage counter for that segment is incremented. If no
matching segment exists, an error is returned. (I'm pretty
sure the usage counter is also incremented after a fork(),
but I'll double-check that.)

(b) sys_free_hugepages() decrements the usage counter

(c) when a process that has allocated a shared hugepage dies
for *any reason* (even kill -9), the usage counter is
decremented

(d) if the usage counter for a given segment ever reaches
zero, the segment is deleted and the memory is free'd.

If we used a key that would remain the same between runs of the
postmaster, this should ensure that there isn't a possibility of two
independant sets of backends operating on the same data dir. The most
logical way to do this IMHO would be to just hash the data dir, but I
suppose the current method of using the port number should work as
well.

To elaborate on (a) a bit, we'd want to use this logic when allocating
a new set of hugepages on postmaster startup:

(1) call sys_alloc_hugepages() without IPC_EXCL. If it returns
an error, we're in the clear: there's no page matching
that key. If it returns a pointer to a previously existing
segment, panic: it is very likely that there are some
orphaned backends still active.

(2) If the previous call didn't find anything, call
sys_alloc_hugepages() again, specifying IPC_EXCL to create
a new segment.

Now, the question is: how should this be implemented? You recently
did some of the legwork toward supporting different APIs for shared
memory / semaphores, which makes this work easier -- unfortunately,
some additional stuff is still needed. Specifically, support for
hugepages is a configuration option, that may or may not be enabled
(if it's disabled, the syscall returns a specific error). So I believe
the logic is something like:

- if compiling on a Linux system, enable support for hugepages
  (the regular SysV stuff is still needed as a backup)

- if we're compiling on a Linux system but the kernel headers
  don't define the syscalls we need, use some reasonable
  defaults (e.g. the syscall numbers for the current hugepage
  syscalls in Linux 2.5)

- at runtime, try to make one of these syscalls. If it fails,
  fall back to the SysV stuff.

Does that sound reasonable?

Any other comments would be appreciated.

Cheers,

Neil

-- 
Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] making use of large TLB pages

2002-09-25 Thread Neil Conway


Tom Lane [EMAIL PROTECTED] writes:
 Neil Conway [EMAIL PROTECTED] writes:
  I'd like to enable PostgreSQL to use large TLB pages, if the OS
  and processor support them.
 
 Hmm ... it seems interesting, but I'm hesitant to do a lot of work
 to support something that's only available on one hardware-and-OS
 combination.

True; further, I personally find the current API a little
cumbersome. For example, we get 4MB pages on Solaris with a few lines
of code:

#if defined(solaris)  defined(__sparc__) /* use intimate shared
memory on SPARC Solaris */ memAddress = shmat(shmid, 0,
SHM_SHARE_MMU);

But given that

(a) Linux on x86 is probably our most popular platform

(b) Every x86 since the Pentium has supported large pages

(c) Other archs, like IA64 and SPARC, also support large pages

I think it's worthwhile implementing this, if possible.

 I trust it at least supports inheriting the page mapping over a
 fork()?

I'll check on this, but I'm pretty sure that it does.

 The SysV API provides a reliable interlock to prevent this scenario:
 we read the old shared memory block ID from the old postmaster's
 postmaster.pid file, and look to see if that block (a) still exists
 and (b) still has attached processes (presumably backends).

If the postmaster is starting up and the segment still exists, could
we assume that's an error condition, and force the admin to manually
fix it? It does make the system less robust, but I'm suspicious of any
attempts to automagically fix a situation in which we *know* something
has gone seriously wrong...

Another possibility might be to still allocate a small SysV shmem
area, and use that to provide the interlock, while we allocate the
buffer area using sys_alloc_hugepages. That's somewhat of a hack, but
I think it would resolve the interlock problem, at least.

 Any ideas for better answers?

Still scratching my head on this one, and I'll let you know if I think
of anything better.

Cheers,

Neil

-- 
Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] making use of large TLB pages

2002-09-25 Thread Tom Lane


Neil Conway [EMAIL PROTECTED] writes:
 I think it's worthwhile implementing this, if possible.

I wasn't objecting (I work for Red Hat, remember ;-)).  I was just
saying there's a limit to the messiness I think we should accept.

 The SysV API provides a reliable interlock to prevent this scenario:
 we read the old shared memory block ID from the old postmaster's
 postmaster.pid file, and look to see if that block (a) still exists
 and (b) still has attached processes (presumably backends).

 If the postmaster is starting up and the segment still exists, could
 we assume that's an error condition, and force the admin to manually
 fix it?

It wasn't clear from your description whether large-TLB shmem segments
even have IDs that one could use to determine whether the segment still
exists.  If the segments are anonymous then how do you do that?

 It does make the system less robust, but I'm suspicious of any
 attempts to automagically fix a situation in which we *know* something
 has gone seriously wrong...

We've spent a lot of effort on trying to ensure that we (a) start up
when it's safe and (b) refuse to start up when it's not safe.  While (b)
is clearly the more critical point, backsliding on (a) isn't real nice
either.  People don't like postmasters that randomly fail to start.

 Another possibility might be to still allocate a small SysV shmem
 area, and use that to provide the interlock, while we allocate the
 buffer area using sys_alloc_hugepages. That's somewhat of a hack, but
 I think it would resolve the interlock problem, at least.

Not a bad idea ... I have not got a better one offhand ... but watch
out for SHMMIN settings.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

[HACKERS] making use of large TLB pages

2002-09-24 Thread Neil Conway


Rohit Seth recently added support for the use of large TLB pages on
Linux if the processor architecture supports them (I believe the
SPARC, IA32, and IA64 have hugetlb support, more archs will probably
be added). The patch was merged into Linux 2.5.36, so it will more
than likely be in Linux 2.6. For more information on large TLB pages
and why they are generally viewed to improve database performance, see
here:

http://lwn.net/Articles/6535/ (the patch this refers to is an
earlier implementation, I believe, but the idea is the same)
http://lwn.net/Articles/10293/ (item #4)

I'd like to enable PostgreSQL to use large TLB pages, if the OS and
processor support them. In talking to the author of the TLB patches
for Linux (Rohit Seth), he described the current API:

==
1) Only two system calls. These are:

sys_alloc_hugepages(int key, unsigned long addr, unsigned long len,
int prot, int flag)

sys_free_hugepages(unsigned long addr)

Key will be equal to zero if user wants these huge pages as private.
A positive int value will be used for unrelated apps to share the same
physical huge pages.

addr is the user prefered address.  The kernel may decide to allocate
a different virtual address (depending on availability and alignment
factors).

len is the requested size of memory wanted by user app.

prot could get the value of PROT_READ, PROT_WRITE, PROT_EXEC

flag: The only allowed value right now is IPC_CREAT, which in case of
shred hugepages (across processes) tells the kernel to create a new
segment if none is already created.  If this flag is not provided and
there is no hugepage segment corresponding to the key then ENOENT is
returned.  More like on the lines of IPC_CREAT flag for shmget
routine.

On success sys_alloc_hugepages returns the virtual address allocated
by kernel.
=

So as I understand it, we would basically replace the calls to
shmget(), shmdt(), etc. with these system calls. The behavior will be
slightly different, however -- I'm not sure if this API supports
everything we expect the SysV IPC API to support (e.g. telling the #
of clients attached to a given segment). Can anyone comment on
exactly what functionality we expect when dealing with the storage
mechanism of the shared buffer?

Any comments would be appreciated.

Cheers,

Neil

-- 
Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] making use of large TLB pages

2002-09-24 Thread Tom Lane


Neil Conway [EMAIL PROTECTED] writes:
 I'd like to enable PostgreSQL to use large TLB pages, if the OS and
 processor support them.

Hmm ... it seems interesting, but I'm hesitant to do a lot of work
to support something that's only available on one hardware-and-OS
combination.  (If we were talking about a Windows-specific hack,
you'd already have lost the audience, no?  But I digress.)

 So as I understand it, we would basically replace the calls to
 shmget(), shmdt(), etc. with these system calls. The behavior will be
 slightly different, however -- I'm not sure if this API supports
 everything we expect the SysV IPC API to support (e.g. telling the #
 of clients attached to a given segment).

I trust it at least supports inheriting the page mapping over a fork()?

 Can anyone comment on
 exactly what functionality we expect when dealing with the storage
 mechanism of the shared buffer?

The only thing we use beyond the obvious here's some memory accessible
by both parent and child processes is the #-of-clients functionality
you mentioned.  The reason that that is interesting is it provides a
safety interlock against the case where a postmaster has crashed but
left child backends running.  If a new postmaster is started and starts
its own collection of children then we are in very bad hot water,
because the old and new backend sets will be modifying the same database
files without any mutual awareness or interlocks.  This *will* lead to
serious, possibly unrecoverable database corruption.

The SysV API provides a reliable interlock to prevent this scenario:
we read the old shared memory block ID from the old postmaster's
postmaster.pid file, and look to see if that block (a) still exists
and (b) still has attached processes (presumably backends).  If it's
gone or has no attached processes, it's safe for the new postmaster
to continue startup.

I have little love for the SysV shmem API, but I haven't thought of
an equivalently reliable interlock for this scenario without it.
(For example, something along the lines of requiring each backend
to write its PID into a file isn't very reliable at all: it leaves
a window at each backend start where the backend hasn't yet written
its PID, and it increases by a large factor the risk we've already
seen wherein stale PID entries in lockfiles might by chance match the
PIDs of other, unrelated processes.)

Any ideas for better answers?

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Re: [HACKERS] making use of large TLB pages

Re: [HACKERS] making use of large TLB pages

Re: [HACKERS] making use of large TLB pages

Re: [HACKERS] making use of large TLB pages

Re: [HACKERS] making use of large TLB pages

Re: [HACKERS] making use of large TLB pages

Re: [HACKERS] making use of large TLB pages

Re: [HACKERS] making use of large TLB pages

Re: [HACKERS] making use of large TLB pages

Re: [HACKERS] making use of large TLB pages

Re: [HACKERS] making use of large TLB pages

[HACKERS] making use of large TLB pages

Re: [HACKERS] making use of large TLB pages

13 matches

Site Navigation

Mail list logo

Footer information