Re: [HACKERS] background workers, round three

2013-10-16 Thread Robert Haas
On Tue, Oct 15, 2013 at 4:02 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 Hmm. It probably allows to clean-up smaller fraction of data structure
 constructed on dynamic shared memory segment, if we map / unmap
 for each transactions.

I think the primary use of dynamic shared memory will be for segments
that get created for a single transaction, used, and then destroyed.
Perhaps people will find other uses for them that involve keeping them
mapped for longer periods of time, and that is fine.  But whether or
long or short, it seems clear to me that shared memory data structures
will need cleanup on detach, just as backends must clean up the main
shared memory segment before they detach.

 I assumed smaller chunks allocated on static or dynamic shared
 memory segment to be used for communicate between main
 process and worker processes because of my motivation.
 When we move a chunk of data to co-processor using asynchronous
 DMA transfer, API requires the source buffer is mlock()'ed to avoid
 unintentional swap out during DMA transfer. On the other hand,
 cost of mlock() operation is not ignorable, so it may be a reasonable
 design to lock a shared memory segment on start-up time then
 continue to use it, without unmapping.
 So, I wondered how to handle the situation when extension tries
 to manage a resource with smaller granularity than the one
 managed by PostgreSQL core.

I'm still not sure exactly what your concern is here, but I agree
there are some thorny issues around resource management here.  I've
spent a lot of the last couple months trying to sort through them.  As
it seems to me, the problem is basically that we're introducing major
new types of resources (background workers, dynamic shared memory
segments, and perhaps other things) and they all require management -
just as our existing types of resources (locks, buffer pins, etc.)
require management.  But the code to manage existing resources has
been around for so long that we just rely on it without thinking about
it, so when we suddenly DO need to think about it, it's an unpleasant
surprise.

As far as shared memory resources specifically, one consideration is
that some systems (e.g. some versions of HP-UX) have fairly small
limits ( 15) on the number of shared memory segments that can be
mapped, and all 32-bit systems are prone to running out of usable
address space.  So portable code is going to have to use these
resources judiciously.  On the other hand, I think that people wanting
to write code that only needs to run on 64-bit Linux will be able to
pretty much go nuts.

Unless there are further objections to the terminate patch as written,
I'm going to go ahead and commit that, with the additional of
documentation (as pointed out by Michael) and a change to the lock
mode (as pointed out by KaiGai).  The other patches, at least for the
time being, are withdrawn.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] background workers, round three

2013-10-15 Thread Kohei KaiGai
2013/10/14 Robert Haas robertmh...@gmail.com:
 * ephemeral-precious-v1.patch
 AtEOXact_BackgroundWorker() is located around other AtEOXact_*
 routines. Doesn't it makes resource management complicated?
 In case when main process goes into error handler but worker
 process is still running in health, it may continue to calculate
 something and put its results on shared memory segment, even
 though main process suggest postmaster to kill it.

 Since I wrote this patch set, I've been thinking a lot more about
 error recovery.  Obviously, one of the big problems as we think about
 parallel query is that you've now got multiple backends floating
 around, and if the transaction aborts (in any backend), the other
 backends don't automatically know that; they need some way to know
 that they, too, short abort processing.  There are a lot of details to
 get right here, and the time I've spent on it so far convinces me that
 the problem is anything but easy.

 Having said that, I'm not too concerned about the particular issue
 that you raise here.  The resources that need to be cleaned up during
 transaction abort are backend-private resources.  If, for example, the
 user backend detaches a dynamic shared memory segment that is being
 used for a parallel computation, they're not actually *destroying* the
 segment; they are just detaching it *from their address space*.  The
 last process to detach it will also destroy it.  So the ordering in
 which the various processes detach it doesn't matter much.

 One of the things I do this is necessary is a set of on_dsm_detach
 callbacks that work pretty much the way that on_shmem_exit callbacks
 work today.  Just as we can't detach from the main shared memory
 segment without releasing locks and buffer pins and lwlocks and our
 PGXACT, we can't release from a dynamic shared memory segment without
 performing any similar cleanup that is needed.  I'm currently working
 on a patch for that.

Hmm. It probably allows to clean-up smaller fraction of data structure
constructed on dynamic shared memory segment, if we map / unmap
for each transactions.

 All the ResourceOwnerRelease() callbacks are located prior to
 AtEOXact_BackgroundWorker(), it is hard to release resources
 being in use by background worker, because they are healthy
 running until it receives termination signal, but sent later.
 In addition, it makes implementation complicated if we need to
 design background workers to release resources if and when it
 is terminated. I don't think it is a good coding style, if we need
 to release resources in different location depending on context.

 Which specific resources are you concerned about?

I assumed smaller chunks allocated on static or dynamic shared
memory segment to be used for communicate between main
process and worker processes because of my motivation.
When we move a chunk of data to co-processor using asynchronous
DMA transfer, API requires the source buffer is mlock()'ed to avoid
unintentional swap out during DMA transfer. On the other hand,
cost of mlock() operation is not ignorable, so it may be a reasonable
design to lock a shared memory segment on start-up time then
continue to use it, without unmapping.
So, I wondered how to handle the situation when extension tries
to manage a resource with smaller granularity than the one
managed by PostgreSQL core.

 So, I'd like to propose to add a new invocation point of
 ResourceOwnerRelease() after all AtEOXact_* jobs, with
 new label something like RESOURCE_RELEASE_FINAL.

 In addition, AtEOXact_BackgroundWorker() does not synchronize
 termination of background worker processes being killed.
 Of course it depends on situation, I think it is good idea to wait
 for completion of worker processes to be terminated, to ensure
 resource to be released is backed to the main process if above
 ResourceOwnerRelease() do the job.

 Again, which resources are we talking about here?  I tend to think
 it's an essential property of the system that we *shouldn't* have to
 care about the order in which processes are terminated.  First, that
 will be difficult to control; if an ERROR or FATAL condition has
 occurred and we need to terminate, then there are real limits to what
 guarantees we can provide after that point.  Second, it's also
 *expensive*.  The point of parallelism is to make things faster; any
 steps we add that involve waiting for other processes to do things
 will eat away at the available gains.  For a query that'll run for an
 hour that hardly matters, but for short queries it's important to
 avoid unnecessary overhead.

Indeed, you are right. Error path has to be terminated soon.
Probably, ResourceOwnerRelease() callback needs to inform healthy
performing worker process the transaction got aborted thus no need
to return its calculation result, using some way, if I implement it.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes 

Re: [HACKERS] background workers, round three

2013-10-14 Thread Robert Haas
On Fri, Oct 11, 2013 at 9:27 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Few comments about the code:
 1) In postmaster.c, what about adding a comment here telling that we
 can forget about this bgworker as it has already been requested for a
 termination:
 +   if (rw-rw_worker.bgw_restart_time == 
 BGW_NEVER_RESTART
 +   || rw-rw_terminate)

I think that'd be just repeating what the code already says.

 2) Trying to think about this patch as an independent feature, I think
 that it would have been nice to demonstrate the capacity of
 TerminateBackgroundWorker directly with an example in worker_spi to
 show a direct application of it, with for example the addition of a
 function like worker_spi_terminate. However this needs:
 - either an additional API using as input the PID, pid used to fetch a
 bgworker handle terminating the worker afterwards. This is basically
 what I did in the patch attached when playing with your patch. You
 might find it useful... Or not! It introduces as well

I think this is kind of missing the point of this interface.  If
you've already got the PID, you're can just kill the PID yourself.
But suppose you've just registered a background worker in the parent
process, and then there's an ERROR.  You want the background worker to
die, since you don't need it any more, but the postmaster might not
have even started it yet, so there's no PID.  That's the case in which
this is really useful.  It's not a good match for what worker_spi
needs, because in the worker_spi, you're intending to start a process
that you *expect* to outlive the user backend's transaction, and even
the user backend's lifetime.

I agree we need better testing for this code, but the problem is that
this is all pretty low-level plumbing.  This patch set was prompted by
problems that came up when I tried to build an application using this
facility; and I don't know that this is the last patch that will be
needed to solve all of those problems.  I'm working on another patch
set that adds a shared-memory message queue through which messages can
be pushed, and I think that will lend itself well to testing of both
background workers and dynamic shared memory.  I hope to have that
ready in time for the next CommitFest.

 3) Documentation is needed, following comment 2).

Good point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] background workers, round three

2013-10-14 Thread Robert Haas
On Sat, Oct 12, 2013 at 4:53 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 I briefly checked these patches. Let me add some comments.

Thanks!

 * terminate-worker-v1.patch
 TerminateBackgroundWorker() turns on slot-terminate flag under
 LW_SHARED lock. Is it reasonable because all the possible caller
 is the background worker process itself, isn't it?

Hmm.  It's probably harmless the way it is, just because if two
processes do that at the same time, it's not going to change the
outcome.  But it might be better to use LW_EXCLUSIVE just to make it
easy to reason about the logic.  I think I cut-and-pasted without
thinking carefully about that.

 * ephemeral-precious-v1.patch
 AtEOXact_BackgroundWorker() is located around other AtEOXact_*
 routines. Doesn't it makes resource management complicated?
 In case when main process goes into error handler but worker
 process is still running in health, it may continue to calculate
 something and put its results on shared memory segment, even
 though main process suggest postmaster to kill it.

Since I wrote this patch set, I've been thinking a lot more about
error recovery.  Obviously, one of the big problems as we think about
parallel query is that you've now got multiple backends floating
around, and if the transaction aborts (in any backend), the other
backends don't automatically know that; they need some way to know
that they, too, short abort processing.  There are a lot of details to
get right here, and the time I've spent on it so far convinces me that
the problem is anything but easy.

Having said that, I'm not too concerned about the particular issue
that you raise here.  The resources that need to be cleaned up during
transaction abort are backend-private resources.  If, for example, the
user backend detaches a dynamic shared memory segment that is being
used for a parallel computation, they're not actually *destroying* the
segment; they are just detaching it *from their address space*.  The
last process to detach it will also destroy it.  So the ordering in
which the various processes detach it doesn't matter much.

One of the things I do this is necessary is a set of on_dsm_detach
callbacks that work pretty much the way that on_shmem_exit callbacks
work today.  Just as we can't detach from the main shared memory
segment without releasing locks and buffer pins and lwlocks and our
PGXACT, we can't release from a dynamic shared memory segment without
performing any similar cleanup that is needed.  I'm currently working
on a patch for that.

 All the ResourceOwnerRelease() callbacks are located prior to
 AtEOXact_BackgroundWorker(), it is hard to release resources
 being in use by background worker, because they are healthy
 running until it receives termination signal, but sent later.
 In addition, it makes implementation complicated if we need to
 design background workers to release resources if and when it
 is terminated. I don't think it is a good coding style, if we need
 to release resources in different location depending on context.

Which specific resources are you concerned about?

 So, I'd like to propose to add a new invocation point of
 ResourceOwnerRelease() after all AtEOXact_* jobs, with
 new label something like RESOURCE_RELEASE_FINAL.

 In addition, AtEOXact_BackgroundWorker() does not synchronize
 termination of background worker processes being killed.
 Of course it depends on situation, I think it is good idea to wait
 for completion of worker processes to be terminated, to ensure
 resource to be released is backed to the main process if above
 ResourceOwnerRelease() do the job.

Again, which resources are we talking about here?  I tend to think
it's an essential property of the system that we *shouldn't* have to
care about the order in which processes are terminated.  First, that
will be difficult to control; if an ERROR or FATAL condition has
occurred and we need to terminate, then there are real limits to what
guarantees we can provide after that point.  Second, it's also
*expensive*.  The point of parallelism is to make things faster; any
steps we add that involve waiting for other processes to do things
will eat away at the available gains.  For a query that'll run for an
hour that hardly matters, but for short queries it's important to
avoid unnecessary overhead.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] background workers, round three

2013-10-12 Thread Kohei KaiGai
I briefly checked these patches. Let me add some comments.

* terminate-worker-v1.patch
TerminateBackgroundWorker() turns on slot-terminate flag under
LW_SHARED lock. Is it reasonable because all the possible caller
is the background worker process itself, isn't it?

* ephemeral-precious-v1.patch
AtEOXact_BackgroundWorker() is located around other AtEOXact_*
routines. Doesn't it makes resource management complicated?
In case when main process goes into error handler but worker
process is still running in health, it may continue to calculate
something and put its results on shared memory segment, even
though main process suggest postmaster to kill it.

All the ResourceOwnerRelease() callbacks are located prior to
AtEOXact_BackgroundWorker(), it is hard to release resources
being in use by background worker, because they are healthy
running until it receives termination signal, but sent later.
In addition, it makes implementation complicated if we need to
design background workers to release resources if and when it
is terminated. I don't think it is a good coding style, if we need
to release resources in different location depending on context.

So, I'd like to propose to add a new invocation point of
ResourceOwnerRelease() after all AtEOXact_* jobs, with
new label something like RESOURCE_RELEASE_FINAL.

In addition, AtEOXact_BackgroundWorker() does not synchronize
termination of background worker processes being killed.
Of course it depends on situation, I think it is good idea to wait
for completion of worker processes to be terminated, to ensure
resource to be released is backed to the main process if above
ResourceOwnerRelease() do the job.

Thanks,

2013/10/11 Robert Haas robertmh...@gmail.com:
 On Fri, Oct 11, 2013 at 9:27 AM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 Finally I got the chance to put my hands on this code. Really sorry
 for the late replay.

 Thanks for the review.  I'll respond to this in more detail later, but
 to make a long story short, I'm looking to apply
 terminate-worker-v1.patch (possibly with modifications after going
 over your review comments), but I'm not feeling too good any more
 about ephemeral-precious-v1.patch, because my experience with those
 facilities has so far proved unsatisfying.  I think I'd like to
 withdraw the latter patch pending further study.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] background workers, round three

2013-10-11 Thread Michael Paquier
Hi,

Finally I got the chance to put my hands on this code. Really sorry
for the late replay.

On Fri, Sep 13, 2013 at 10:42 AM, Robert Haas robertmh...@gmail.com wrote:
 Last week, I attempted to write some code to perform a trivial
 operation in parallel by launching background workers.  Despite my
 earlier convictions that I'd built enough infrastructure to make this
 possible, the experiment turned out badly - so, patches!

 It's been pretty obvious to me from the beginning that any sort of
 parallel computation would need a way to make sure that if any worker
 dies, everybody dies.  Conversely, if the parent aborts, all the
 workers should die.  My thought was that this could be implemented
 using PG_TRY()/PG_CATCH() blocks over the existing infrastructure, but
 this turned out to be naive.  If the original backend encounters an
 error before the child manages to start up, there's no good recovery
 strategy.  The parent can't kill the child because it doesn't exist
 yet, and the parent can't stop the postmaster from starting the child
 later.  The parent could try waiting until the child starts up and
 THEN killing it, but that's probably unreliable and surely
 mind-numbingly lame.  The first attached patch,
 terminate-worker-v1.patch, rectifies this problem by providing a
 TerminateBackgroundWorker() function.  When this function is invoked,
 the postmaster will become unwilling to restart the worker and will
 send it a SIGTERM if it's already running.
And here are some comments after reading the first patch... The patch
looks good, creates no warnings and is not that invasive in the
existing code.

Few comments about the code:
1) In postmaster.c, what about adding a comment here telling that we
can forget about this bgworker as it has already been requested for a
termination:
+   if (rw-rw_worker.bgw_restart_time == BGW_NEVER_RESTART
+   || rw-rw_terminate)
2) Trying to think about this patch as an independent feature, I think
that it would have been nice to demonstrate the capacity of
TerminateBackgroundWorker directly with an example in worker_spi to
show a direct application of it, with for example the addition of a
function like worker_spi_terminate. However this needs:
- either an additional API using as input the PID, pid used to fetch a
bgworker handle terminating the worker afterwards. This is basically
what I did in the patch attached when playing with your patch. You
might find it useful... Or not! It introduces as well
worker_spi_terminate, a small API scanning the array of bgworker slots
. Not sure that this is much compatible with the concept of generation
though...
- return not only the PID of the worker started dynamically in
worker_spi_launch, but also the generation number of the worker to be
able to generate a bgworker handle that could be afterwards be used
with TerminateBackgroundWorker.
Note that this comment is based on my personal experience woth
bgworkers, and I think that it is important to provide examples of
what is implemented such as users are not off the track, even if
playing with bgworker is high-level hacking. TerminateBackgroundWorker
can offer a way to users to kill a bgworker more native than
pg_terminate_backend for example, especially if they implement a
bgworker structure of the type launcher/workers like what autovacuum
does.
3) Documentation is needed, following comment 2).

 It's important that the
 SIGTERM be sent by the postmaster, because if the original backend
 tries to send it, there's a race condition: the process might not be
 started at the time the original backend tries to send the signal, but
 the postmaster might start it before it sees the terminate request.)
That's interesting. Nothing more to say about that (perhaps because of
my lack of knowledge of the code in this area).

 By itself, this is useful, but painful.  The pain comes from the fact
 that all of the house-keeping is left to the programmer.
If this is necessary, so be it. But I think that newcomers to
background workers would appreciate at least an exampe in worker_spi
(using as a base what I provided in the patch attached) they could
refer to when beginning to implement a new feature. This is a thing
that people could, and for sure would refer to when implementing a
complex thing using this infrastructure.

This is all I have about the 1st patch... It is already late here, so
I'll have a look at the 2nd/3rd patch hopefully tomorrow or the day
after.
Regards,
-- 
Michael


20131011_bgworker_terminate_pid.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] background workers, round three

2013-10-11 Thread Robert Haas
On Fri, Oct 11, 2013 at 9:27 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Finally I got the chance to put my hands on this code. Really sorry
 for the late replay.

Thanks for the review.  I'll respond to this in more detail later, but
to make a long story short, I'm looking to apply
terminate-worker-v1.patch (possibly with modifications after going
over your review comments), but I'm not feeling too good any more
about ephemeral-precious-v1.patch, because my experience with those
facilities has so far proved unsatisfying.  I think I'd like to
withdraw the latter patch pending further study.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] background workers, round three

2013-09-12 Thread Robert Haas
Last week, I attempted to write some code to perform a trivial
operation in parallel by launching background workers.  Despite my
earlier convictions that I'd built enough infrastructure to make this
possible, the experiment turned out badly - so, patches!

It's been pretty obvious to me from the beginning that any sort of
parallel computation would need a way to make sure that if any worker
dies, everybody dies.  Conversely, if the parent aborts, all the
workers should die.  My thought was that this could be implemented
using PG_TRY()/PG_CATCH() blocks over the existing infrastructure, but
this turned out to be naive.  If the original backend encounters an
error before the child manages to start up, there's no good recovery
strategy.  The parent can't kill the child because it doesn't exist
yet, and the parent can't stop the postmaster from starting the child
later.  The parent could try waiting until the child starts up and
THEN killing it, but that's probably unreliable and surely
mind-numbingly lame.  The first attached patch,
terminate-worker-v1.patch, rectifies this problem by providing a
TerminateBackgroundWorker() function.  When this function is invoked,
the postmaster will become unwilling to restart the worker and will
send it a SIGTERM if it's already running.  (It's important that the
SIGTERM be sent by the postmaster, because if the original backend
tries to send it, there's a race condition: the process might not be
started at the time the original backend tries to send the signal, but
the postmaster might start it before it sees the terminate request.)

By itself, this is useful, but painful.  The pain comes from the fact
that all of the house-keeping is left to the programmer.  It is
possible but not elegant to use something like
PG_ENSURE_ERROR_CLEANUP() to ensure that all background workers are
terminated even on an error exit from the affected code.  The other
half of the problem is harder: how do we ensure not only that the
untimely demise of a worker process aborts the original backend's
transaction, but that it does so in a relatively timely fashion?  If
the original backend is executing only a very limited amount of code
while the parallel workers remain in existence, it would be possible
to add explicit checks for the demise of a worker at periodic
intervals through all of that code.  But this seems a very limiting
approach.  In the hope of making things better, the second patch
attached herewith, ephemeral-precious-v1.patch, adds two new flags,
BGWORKER_EPHEMERAL and BGWORKER_PRECIOUS.

Setting the BGWORKER_EPHEMERAL flag causes the background worker to be
killed when the registrant's (sub)transaction ends.  This eliminates
the need to catch errors and explicitly invoke
TerminateBackgroundWorker() in the error path.  You can simply
register an ephemeral worker, write code to do stuff with it, and then
terminate it.  If an error occurs part-way through, the worker will be
terminated as part of the abort path.

Setting the BGWORKER_PRECIOUS flag causes the unexpected death of the
worker to abort the registrant's current (sub)transaction.  This
eliminates the need to sprinkle the code with checks for a deceased
worker.  Instead, you can simply register a precious worker, and then
just remember to CHECK_FOR_INTERRUPTS().  There were a couple of
awkward cases here.  First, all the existing stuff that hooks into
ProcessInterrupts() makes provision for handling the
ImmediateInterruptOK case.  I felt that was superfluous here, so
instead simply prohibited leaving a precious background worker running
beyond the end of the statement.  The point is to enable parallel
computation, which will, I think, begin and end within the lifespan of
one query.  Second, odd things happen if the original backend launches
precious workers and then begins a subtransaction.  Throwing an error
in the subtransaction will not do the right thing; the subtransaction
may easily be something like a PL/pgsql exception block and an error
will be caught and result in unexpected behavior.  So we don't.  I
just added a comment saying that if you do decide to start a
subtransaction while you've got precious workers outstanding, you'd
better insert an explicit check for whether they're still alive after
unwinding the subtransaction (there's a function to make that easy).
We could probably build a mechanism to allow an error to be thrown
against a (sub)xact other than the innermost one, implicitly aborting
everything below that with extreme prejudice, but it seems like
overkill.  I can't but imagine that early versions of
parallel-anything will include starting subtransactions on the list
of activities which are prohibited in parallel mode.

Using the infrastructure provided by those patches, I was able to
write some test code, attached as pingpong-v1.patch.  You can make a
backend fire up a background worker, and the two will take turns
setting each others latches for a number of iterations you specify.
This could