Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management

2017-03-09 Thread Peter Geoghegan
On Thu, Mar 9, 2017 at 6:25 PM, Robert Haas  wrote:
> On Thu, Mar 9, 2017 at 7:29 PM, Thomas Munro
>  wrote:
>> Suppressing ENOENT because it's not clear which backend actually owns
>> a file is a bit different from using it to detect that there are no
>> more segments...
>
> +1, emphatically.

I don't feel strongly either way on this, FWIW.

>> Obviously I screwed some things up in the code I
>> posted, but I think the general idea that the DSM segment owns the
>> files on disk is a good one.
>
> +1 to that, too.  The DSM has exactly the lifetime that we want here;
> no backend or resowner involved in the transaction does.

I actually agree with you, from a theoretical perspective. But the
consequences of that theoretical point seem pretty extensive.
Following through with that by pushing much more file state into
shared memory has significant complexity and risk, if it's possible at
all without something like "palloc(), but for shared memory". I would
like to see a practical benefit.

>> I figure that it (via the detach hook)
>> should be able to delete the files using only data in shmem, without
>> reference to any backend-local memory, so that file cleanup is
>> entirely disconnected from backend-local resource cleanup (fds and
>> memory).
>
> That's one approach.  Another approach is to somehow tie the two
> together; for example, I thought about teaching FileClose that where
> it currently calls unlink() to get rid of the temporary file, it would
> first go check some shared reference count and decrement it, skipping
> the unlink if the result was >0.  But that only works if you have a
> separate chunk of shared memory per File, which seems like it won't
> work for what you need.

It won't work for what Thomas needs because that decision is made per
segment. But, the decision to decrement refcount really does need to
be made at the BufFile level, so Thomas is still not wrong to
structure things that way.

What somewhat justifies the idea of an on_dsm_detach() callback using
a pointer located in shared memory to get to local memory in one
backend (for error handling purposes) is the fact that it's already
tacitly assumed that the local memory used for the BufFile is released
after the resowner clean-up of BufFiles. Otherwise, somebody might get
in big trouble if they called BufFileClose() or something in an error
path. Arguably, the reliance on ordering already exists today.

I'm not saying that that's a good plan, or even an acceptable
trade-off. Pick your poison.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management

2017-03-09 Thread Robert Haas
On Thu, Mar 9, 2017 at 7:29 PM, Thomas Munro
 wrote:
> Suppressing ENOENT because it's not clear which backend actually owns
> a file is a bit different from using it to detect that there are no
> more segments...

+1, emphatically.

> Obviously I screwed some things up in the code I
> posted, but I think the general idea that the DSM segment owns the
> files on disk is a good one.

+1 to that, too.  The DSM has exactly the lifetime that we want here;
no backend or resowner involved in the transaction does.

> I figure that it (via the detach hook)
> should be able to delete the files using only data in shmem, without
> reference to any backend-local memory, so that file cleanup is
> entirely disconnected from backend-local resource cleanup (fds and
> memory).

That's one approach.  Another approach is to somehow tie the two
together; for example, I thought about teaching FileClose that where
it currently calls unlink() to get rid of the temporary file, it would
first go check some shared reference count and decrement it, skipping
the unlink if the result was >0.  But that only works if you have a
separate chunk of shared memory per File, which seems like it won't
work for what you need.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management

2017-03-09 Thread Robert Haas
On Thu, Mar 9, 2017 at 2:19 PM, Peter Geoghegan  wrote:
> Quite a lot of thought seems to have gone into making the
> fd.c/resowner mechanism leak-proof in the face of errors. So, quite
> apart from what that approaches misses out on, I really don't want to
> change resource management too much. As I went into in my "recap", it
> seems useful to change as little as possible about temp files.

I think this is just wishful thinking.  resowners are backend-private,
and they are robust as long as they are managing files accessed by a
single backend.  Trying to bend them into some kind of a state where
they can manage a resource shared across multiple cooperating backends
is fundamentally changing the model.  What matters is not whether the
old model was any good but whether the new one is any good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management

2017-03-09 Thread Peter Geoghegan
On Thu, Mar 9, 2017 at 4:29 PM, Thomas Munro
 wrote:
> On Fri, Mar 10, 2017 at 8:19 AM, Peter Geoghegan  wrote:
>> by having state for each segment, it ends up actually *relying on*
>> ENOENT-on-unlink() as a condition that terminates execution:
>
> Yeah, this seems to fall out of the requirement to manage a growable
> number of partition files in a fixed space.  I wonder how this could
> go wrong.  One way would be for a crash-restart to happen (which
> leaves all temporary files in place by design, though it could clean
> them up like a normal restart if it wanted to), followed by a very
> long running cluster eventually generating the same (pid, set number)
> pair.  I think I see a simple way to defend against that, which I'll
> write about in the PHJ thread.

I am not expressing any real opinion about the idea of relying on or
suppressing ENOENT-on-unlink() just yet. What's clear is that that's
unorthodox. I seldom have any practical reason to make a distinction
between unorthodox and unacceptable. It's usually easier to just not
do the unorthodox thing. Maybe this is one of the rare exceptions.

> Thanks.  I will respond with code and comments over on the PHJ thread.
> Aside from the broken extendBufFile behaviour you mentioned, I'll look
> into the general modularity complaints I'm hearing about fd.c and
> buffile.c interaction.

buffile.c should stop pretending to care about anything other than
temp files, IMV. 100% of all clients that want temporary files go
through buffile.c. 100% of all clients that want non-temp files (files
which are not marked FD_TEMPORARY) access fd.c directly, rather than
going through buffile.c.

>> But my observations illustrate the difficulty with tying
>> resource manager resources in local memory (temp segments, and
>> associated BufFile(s)) with clean-up at shared memory segment
>> detachment. It's possible, but ticklish enough that you'll end up with
>> some wart or other when you're done.

> Suppressing ENOENT because it's not clear which backend actually owns
> a file is a bit different from using it to detect that there are no
> more segments...  Obviously I screwed some things up in the code I
> posted, but I think the general idea that the DSM segment owns the
> files on disk is a good one.  I figure that it (via the detach hook)
> should be able to delete the files using only data in shmem, without
> reference to any backend-local memory, so that file cleanup is
> entirely disconnected from backend-local resource cleanup (fds and
> memory).

The problem with that is that it kind of implies that we have to
invent a mechanism that 100% supersedes fd.c resource management. We
still want to free backend local vFDs and so on, using fd.c cleanup,
so that file descriptors are not leaked, but it's not clear we can do
that by half measure. Wouldn't even that have to be duplicated to deal
with shared memory state by way of your on_dsm_detach callback()? And,
if it did, it would presumably need to keep track of every single
segment, much like fd.c. But the number of segments isn't predictable
in advance, so you're forced to confront that problem, which you hoped
to avoid. And, where does all of this leave max_files_per_process
enforcement?

> Looking forward to seeing your v9.  We need to figure out what common
> ground we can find here.  Quality issues with the code I posted aside
> (which I'm sorting out), it seems that the underlying reason we're not
> converging yet is my requirement for a variable number of partitions
> managed in a fixed shmem space, resulting in the ENOENT-as-terminator
> behaviour that you don't like.

I agree that that's the snag we're getting caught on here. In even
simpler terms, what you need to do with BufFiles is somewhat more
difficult than what I need to do (for either of us), and now that time
grows short, I feel the need to draw a line under those differences.
At the same time, I would at the very least like to avoid doing you a
disservice by not at least anticipating your needs.

> I'm 100% happy to switch parts of what
> I'm doing to using anything you come up with if it can support a
> dynamic set of N partitions from fixed set of M participants.  I'm
> fairly sure that people won't be happy with two separate ways to
> manage shared temporary files.  I also don't want patch A to be
> derailed by patch B, but even if these things finish up in different
> Postgres releases we'll still have to solve the problem.

That's all fair. And, in case it isn't clear, I think that all of this
BufFile + fd.c + resowner.c + on_dsm_detach() stuff might be the
single biggest yak shaving expedition I've ever found myself on. It
just seems like the kind of thing where the only workable approach is
to try to converge on a good plan iteratively. That isn't anybody's
fault.

So, yes, there is a bit of friction between the requirements for
parallel tuplesort, and for parallel hash join, but I see that as
useful, 

Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management

2017-03-09 Thread Thomas Munro
On Fri, Mar 10, 2017 at 8:19 AM, Peter Geoghegan  wrote:
> by having state for each segment, it ends up actually *relying on*
> ENOENT-on-unlink() as a condition that terminates execution:

Yeah, this seems to fall out of the requirement to manage a growable
number of partition files in a fixed space.  I wonder how this could
go wrong.  One way would be for a crash-restart to happen (which
leaves all temporary files in place by design, though it could clean
them up like a normal restart if it wanted to), followed by a very
long running cluster eventually generating the same (pid, set number)
pair.  I think I see a simple way to defend against that, which I'll
write about in the PHJ thread.

On Fri, Mar 10, 2017 at 10:01 AM, Peter Geoghegan  wrote:
> IOW, I cannot see why 0007-hj-shared-buf-file-v6.patch leaves parallel
> hash join with BufFiles that are even capable of being expanded past
> 1GiB (1 segment), which is surely not acceptable -- it's a bug. Have I
> just missed something?

Bleugh, yeah that's completely busted.  Fixing.

> segments). I suggest that the MAX_PHYSICAL_FILESIZE stress-test I
> sometimes perform [1] be incorporated in Thomas' testing.

Right, will do.

> (Looks more carefully...)
>
> I now see that PathNameCreateFile() is being called, a new function
> which is largely just a new interface to the existing
> OpenTemporaryFileInTablespace() temp files. So, if you look carefully,
> you notice that you do in fact end up with FD_TEMPORARY fd.c segments
> here...so maybe temp_file_limit isn't broken, because, it turns out,
> 0007-hj-shared-buf-file-v6.patch is still *halfway* buying into the
> conventional idea of file temp-ness. But buffile.c if left behind
> ("file->isTemp == false"), so AFAICT it must still be broken simply
> because new segments cannot be written, per BufFileCreate() comments
> quoted above.
>
> This has become more of a PHJ review, which is off-topic for this
> thread.

Thanks.  I will respond with code and comments over on the PHJ thread.
Aside from the broken extendBufFile behaviour you mentioned, I'll look
into the general modularity complaints I'm hearing about fd.c and
buffile.c interaction.

> But my observations illustrate the difficulty with tying
> resource manager resources in local memory (temp segments, and
> associated BufFile(s)) with clean-up at shared memory segment
> detachment. It's possible, but ticklish enough that you'll end up with
> some wart or other when you're done. The question, then, is what wart
> is the most acceptable for parallel tuplesort? I see two plausible
> paths forward right now:
>
> 1. Selectively suppress spurious ENOENT-on-unlink() LOG message in the
> brief window where it might be spurious. This is the easiest option.
> (There is less code this way.)
>
> 2. Do more or less what I've already drafted as my V9, which is
> described in opening mail to this thread, while accepting the ugliness
> that that approach brings with it.

Suppressing ENOENT because it's not clear which backend actually owns
a file is a bit different from using it to detect that there are no
more segments...  Obviously I screwed some things up in the code I
posted, but I think the general idea that the DSM segment owns the
files on disk is a good one.  I figure that it (via the detach hook)
should be able to delete the files using only data in shmem, without
reference to any backend-local memory, so that file cleanup is
entirely disconnected from backend-local resource cleanup (fds and
memory).

Looking forward to seeing your v9.  We need to figure out what common
ground we can find here.  Quality issues with the code I posted aside
(which I'm sorting out), it seems that the underlying reason we're not
converging yet is my requirement for a variable number of partitions
managed in a fixed shmem space, resulting in the ENOENT-as-terminator
behaviour that you don't like.  I'm 100% happy to switch parts of what
I'm doing to using anything you come up with if it can support a
dynamic set of N partitions from fixed set of M participants.  I'm
fairly sure that people won't be happy with two separate ways to
manage shared temporary files.  I also don't want patch A to be
derailed by patch B, but even if these things finish up in different
Postgres releases we'll still have to solve the problem.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management

2017-03-09 Thread Peter Geoghegan
On Thu, Mar 9, 2017 at 11:19 AM, Peter Geoghegan  wrote:
> That patch seems to be solving the problem by completely taking over
> management of temp files from fd.c. That is, these temp files are not
> marked as temp files in the way ordinary temp BufFiles are, with
> explicit buy-in from resowner.c about their temp-ness. It adds
> "#include "catalog/pg_tablespace.h" to buffile.c, even though that
> kind of thing generally lives within fd.c. It does use
> PG_TEMP_FILE_PREFIX, but never manages to use temp_tablespaces if
> that's set by user. It also doesn't do anything about temp_file_limit
> enforcement.

Actually, it's a bigger departure than I suggest here, even.
Currently, literally no BufFile caller ever gets anything other than
fd.c-wise temp files (those marked FD_TEMPORARY). This is the closest
that buffile.c comes to allowing this:

#ifdef NOT_USED
/*
 * Create a BufFile and attach it to an already-opened virtual File.
 *
 * This is comparable to fdopen() in stdio.  This is the only way at present
 * to attach a BufFile to a non-temporary file.  Note that BufFiles created
 * in this way CANNOT be expanded into multiple files.
 */
BufFile *
BufFileCreate(File file)
{
return makeBufFile(file);
}
#endif

IOW, I cannot see why 0007-hj-shared-buf-file-v6.patch leaves parallel
hash join with BufFiles that are even capable of being expanded past
1GiB (1 segment), which is surely not acceptable -- it's a bug. Have I
just missed something?

(This is not to be confused with BufFiles that are interXact -- those
are allowed, but won't work here either. This is why Thomas changed
PHJ to not do it that way some months back.)

At first I thought that it was okay because BufFiles are
immutable/"readonly" once handed over from workers, but of course the
PHJ SharedBufFile mechanism is up-front, and does all resource
management in shared memory. Thus, it must even create BufFiles in
workers that have this only-one-segment restriction as a consequence
of not being temp files (in the conventional sense: FD_TEMPORARY fd.c
segments). I suggest that the MAX_PHYSICAL_FILESIZE stress-test I
sometimes perform [1] be incorporated in Thomas' testing.

(Looks more carefully...)

I now see that PathNameCreateFile() is being called, a new function
which is largely just a new interface to the existing
OpenTemporaryFileInTablespace() temp files. So, if you look carefully,
you notice that you do in fact end up with FD_TEMPORARY fd.c segments
here...so maybe temp_file_limit isn't broken, because, it turns out,
0007-hj-shared-buf-file-v6.patch is still *halfway* buying into the
conventional idea of file temp-ness. But buffile.c if left behind
("file->isTemp == false"), so AFAICT it must still be broken simply
because new segments cannot be written, per BufFileCreate() comments
quoted above.

This has become more of a PHJ review, which is off-topic for this
thread. But my observations illustrate the difficulty with tying
resource manager resources in local memory (temp segments, and
associated BufFile(s)) with clean-up at shared memory segment
detachment. It's possible, but ticklish enough that you'll end up with
some wart or other when you're done. The question, then, is what wart
is the most acceptable for parallel tuplesort? I see two plausible
paths forward right now:

1. Selectively suppress spurious ENOENT-on-unlink() LOG message in the
brief window where it might be spurious. This is the easiest option.
(There is less code this way.)

2. Do more or less what I've already drafted as my V9, which is
described in opening mail to this thread, while accepting the ugliness
that that approach brings with it.

[1] 
postgr.es/m/cam3swzrwdntkhig0gyix_1muaypik3dv6-6542pye2iel-f...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management

2017-03-09 Thread Peter Geoghegan
On Wed, Mar 8, 2017 at 5:23 PM, Robert Haas  wrote:
> I think something like 0007-hj-shared-buf-file-v6.patch from
> https://www.postgresql.org/message-id/CAEepm=3g=dMG+84083fkFzLvgMJ7HdhbGB=aezabnukbzm3...@mail.gmail.com
> is probably a good approach to this problem.  In essence, it dodges
> the problem of trying to transfer ownership by making ownership be
> common from the beginning.  That's what I've been recommending, or
> trying to, anyway.

I think that I'm already doing that, to at least the same extent that
0007-hj-shared-buf-file-v6.patch is.

That patch seems to be solving the problem by completely taking over
management of temp files from fd.c. That is, these temp files are not
marked as temp files in the way ordinary temp BufFiles are, with
explicit buy-in from resowner.c about their temp-ness. It adds
"#include "catalog/pg_tablespace.h" to buffile.c, even though that
kind of thing generally lives within fd.c. It does use
PG_TEMP_FILE_PREFIX, but never manages to use temp_tablespaces if
that's set by user. It also doesn't do anything about temp_file_limit
enforcement.

Quite a lot of thought seems to have gone into making the
fd.c/resowner mechanism leak-proof in the face of errors. So, quite
apart from what that approaches misses out on, I really don't want to
change resource management too much. As I went into in my "recap", it
seems useful to change as little as possible about temp files.
Besides, because parallel hash join cannot know the size of the
BufFiles from workers in advance, and thus cannot replicate fd.c fully
by having state for each segment, it ends up actually *relying on*
ENOENT-on-unlink() as a condition that terminates execution:

> +bool
> +BufFileDeleteShared(Oid tablespace, pid_t pid, int set, int partition,
> +   int participant)
> +{
> +   chartempdirpath[MAXPGPATH];
> +   chartempfilepath[MAXPGPATH];
> +   int segment = 0;
> +   boolfound = false;
> +
> +   /*
> +* We don't know if the BufFile really exists, because SharedBufFile
> +* tracks only the range of file numbers.  If it does exists, we don't
> +* khow many 1GB segments it has, so we'll delete until we hit ENOENT or
> +* an IO error.
> +*/
> +   for (;;)
> +   {
> +   make_shareable_path(tempdirpath, tempfilepath,
> +   tablespace, pid, set, partition,
> +   participant, segment);
> +   if (!PathNameDelete(tempfilepath))
> +   break;
> +   found = true;
> +   ++segment;
> +   }
> +
> +   return found;
> +}

However, the whole point of my fortifying the refcount mechanism for
V9 of parallel tuplesort is to not have to ignore a ENOENT like this,
on the grounds that that's ugly/weird (I pointed this out when I
posted my V8, actually). Obviously I could very easily teach fd.c not
to LOG a complaint about an ENOENT on unlink() for the relevant
parallel case, on the assumption that it's only down to an error
during the brief period of co-ownership of a Buffile at the start of
leader unification. Is that acceptable?

Anyway, the only advantage I immediately see with the approach
0007-hj-shared-buf-file-v6.patch takes (over what I've provisionally
written as my V9) is that by putting everything in shared memory all
along, there is no weirdness with tying local memory clean-up to a
shared memory on_dsm_detach() callback. As I said, stashing a local
pointer for the leader in shared memory is weird, even if it
accomplishes the stated goals for V9.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management

2017-03-08 Thread Robert Haas
On Mon, Mar 6, 2017 at 7:33 PM, Peter Geoghegan  wrote:
> Recap
> =
>
> A design goal of parallel tuplesort is that the parallel case be as
> close to possible as the serial case is already. Very little new code
> is needed in tuplesort.c and logtape.c, most of which is about using a
> new lower-level facility which is very well encapsulated. And, even
> buffile.c "unification", the guts of everything, doesn't have many new
> special cases.
>
> So, workers start with "unifiable" BufFiles, which have very little
> difference with conventional temp BufFiles -- they just create
> filenames in a deterministic fashion, making them discoverable to the
> leader process, through a process of unification (plus you have the
> refcount state in shared memory, though very little). In principle,
> workers could decide to not go through with unification long after the
> fact, and close their unifiable temp files without doing anything for
> the leader, and that would be just fine. By the same token, the
> unified BufFile owned by the leader can be extended if needs be, in a
> way that is virtually indistinguishable from just extending the end of
> any other BufFile. logtape.c recycling for future randomAccess cases
> works just the same as before.

I think something like 0007-hj-shared-buf-file-v6.patch from
https://www.postgresql.org/message-id/CAEepm=3g=dMG+84083fkFzLvgMJ7HdhbGB=aezabnukbzm3...@mail.gmail.com
is probably a good approach to this problem.  In essence, it dodges
the problem of trying to transfer ownership by making ownership be
common from the beginning.  That's what I've been recommending, or
trying to, anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] on_dsm_detach() callback and parallel tuplesort BufFile resource management

2017-03-06 Thread Peter Geoghegan
Recap
=

A design goal of parallel tuplesort is that the parallel case be as
close to possible as the serial case is already. Very little new code
is needed in tuplesort.c and logtape.c, most of which is about using a
new lower-level facility which is very well encapsulated. And, even
buffile.c "unification", the guts of everything, doesn't have many new
special cases.

So, workers start with "unifiable" BufFiles, which have very little
difference with conventional temp BufFiles -- they just create
filenames in a deterministic fashion, making them discoverable to the
leader process, through a process of unification (plus you have the
refcount state in shared memory, though very little). In principle,
workers could decide to not go through with unification long after the
fact, and close their unifiable temp files without doing anything for
the leader, and that would be just fine. By the same token, the
unified BufFile owned by the leader can be extended if needs be, in a
way that is virtually indistinguishable from just extending the end of
any other BufFile. logtape.c recycling for future randomAccess cases
works just the same as before.

I think that this is a nice state of affairs, since having no special
cases will tend to make the code robust. Applying leader-wise offsets
within logtape.c is done at one precise choke-point (one function),
and so it seems very likely that logtape.c routines that CREATE INDEX
happens to not use will "just work" when parallel tuplesort has more
clients at some point in the future. A logical tapeset doesn't even
remember specifically that it is using unification (unless you count
that it has a non-zero offset to apply).

V8 of the parallel tuplesort patch added a refcount mechanism that is
associated with a "unified" BufFile, consisting of $nworker unifiable
BufFiles, with metadata in shared memory per worker/unifiable BufFile.
We have a refcount per unifiable BufFile, not per fd.c segment. The
refcount thing lets worker processes go away as soon as the leader
begins its final on-the-fly merge -- the leader assumes ownership of
the BufFiles/segments initially owned only by workers. There is a very
brief period of co-ownership, which occurs during unification in the
leader.

Concern
===

My pending V9 of the patch has to solve a problem that V8 had, namely
that it's not very nice that an error in the leader (e.g. palloc()
failure) during this brief period of co-ownership will make the
resource manager attempt a double unlink() (one from the worker,
another from the leader). There is no danger of data loss, but it's
ugly. The general consensus is that the way to handle it is with a
on_dsm_detach() callback with the top-level parallel context shared
memory segment. I have already written some rough code that does all
this, and demonstrably removes the double unlink() hazard, but I have
some outstanding concerns, that I'd like to clear up.

We have at least 3 different "resource contexts" in play in this rough code:

1. The DSM segment.

2. fd.c segment clean-up (i.e., cleanup that falls on resourceOwner
stored within buffile.c for BufFile segments).

3. The memory context in which BufFile stuff is allocated.

I am bridging the gap between local memory and shared memory here, in
essence. The local state like virtual FDs is what we need to mark as
non-temp, so the second unlink() is never attempted on fd.c resource
cleanup, which comes last of all (per resource manager) on the one
hand. We do need to touch refcounts in shared memory too, but shared
memory only has only some relevant state, which seems like something
to avoid changing, because we want to keep all of those nice
advantages I went into above. On the other hand, it seems wrong to
jump from shared memory to shared local memory in my on_dsm_detach()
callback, though that's what works for me right now. The callback is
only registered in the leader process, and only really needs to be
around for as long as there is a brief period of co-ownership of
BufFiles.

Say the caller never gets around to a BufFileClose() call. That's a
bug anyway, and will produce a temp file reference leak warning today.
But this change turns that bug into a hard crash, which is a more
serious bug. This happens because the on_dsm_detach() callback ends up
calling BufFileClose() against a pointer to garbage (freed local
memory). Had BufFileClose() been called before resource cleanup called
the callback, which should happen anyway, then it would have marked
the pointer to local memory (stored in *shared* memory) NULL. No hard
crash would happen from the callback, which returns early, never
spuriously calling BufFileClose() a second time within leader.

ISTM that even today, buffile.c caller's memory context tacitly shares
the lifetime of caller's resourceOwner. Otherwise, everything would
break. It doesn't seem that much of a stretch to assume that the DSM
seg callback implicitly has the opportunity to "go first", since ISTM
that things