Re: [HACKERS] dynamic background workers, round two
On 2013-08-28 14:04:59 -0400, Robert Haas wrote: > >> + RegisterDynamicBackgroundWorker(BackgroundWorker > >> + *worker, BackgroundWorkerHandle **handle). Unlike > >> + RegisterBackgroundWorker, which can only be called from > >> within > >> + the postmaster, RegisterDynamicBackgroundWorker > >> must be > >> + called from a regular backend. > >> > > > > s/which can only be called from within the posmaster/during initial > > startup, via shared_preload_libraries/? > > This seems like a possible doc clarification not intimately related to > the patch at hand. The only reason that paragraph is getting reflowed > is because of the additional argument to > RegisterDynamicBackgroundWorker(). On the substance, I could go > either way on whether your proposed text is better than what's there > now. I agree it's unrelated. I just noticed because it was in the diff ;) > It seems like it might need a bit of wordsmithing, anyway. Yes, looks like it could use some. It's not going to be me doing it tho... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic background workers, round two
On Wed, Aug 28, 2013 at 2:04 PM, Robert Haas wrote: >> Hm. Not this patches fault, but We seem to allow bgw_start_time == >> BgWorkerStart_PostmasterStart here which doesn't make sense... > > I can add a check for that. I agree that it's a separate patch. On third thought, is there really any point to such a check? I mean, no background worker is going to start before it's registered, and by the time any background worker is registered, we'll be passed the time indicated by all of those constants: BgWorkerStart_PostmasterStart, BgWorkerStart_ConsistentState, BgWorkerStart_RecoveryFinished. I think we should view that field as fixing the earliest time at which the worker should be started, rather than the exact time at which it must be started. Otherwise no value is sensible. And if we take that approach, then for a dynamic background worker, any value is OK. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic background workers, round two
On Wed, Aug 28, 2013 at 2:04 PM, Robert Haas wrote: > I certainly can't promise that the code is bug-free. But I think it's > probably better to get this into the tree and let people start playing > around with it than to continue to maintain it in my private sandbox. > At this point it's just infrastructure anyway, though there seems to > be a good deal of interest in it from more than just myself... and I > do think it's a useful step forward from where we are today. Committed with fixes for the issues you mentioned, plus some updates to out-of-date comments. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic background workers, round two
On Tue, Aug 27, 2013 at 9:50 AM, Andres Freund wrote: >> > BgwHandleStatus GetBackgroundWorkerPid(BackgroundWorkerHandle *handle, >> > pid_t *pid); >> > BgwHandleStatus WaitForBackgroundWorkerStartup(BackgroundWorkerHandle >> > *handle, pid_t *pid); >> >> OK, here's a patch that API. I renamed the constants a bit, because a >> process that has stopped is not necessarily gone; it could be >> configured for restart. But we can say that it is stopped, at the >> moment. >> >> I'm not sure that this API is an improvement. But I think it's OK, if >> you prefer it. > > Thanks, looks more readable to me. I like code that tells me what it > does without having to look in other places ;) Well, fair enough. But you might have to look around a bit to figure out that you now have two functions each of which can return 3 out of a possible 4 values. But whatever. >> + RegisterDynamicBackgroundWorker(BackgroundWorker >> + *worker, BackgroundWorkerHandle **handle). Unlike >> + RegisterBackgroundWorker, which can only be called from >> within >> + the postmaster, RegisterDynamicBackgroundWorker must >> be >> + called from a regular backend. >> > > s/which can only be called from within the posmaster/during initial > startup, via shared_preload_libraries/? This seems like a possible doc clarification not intimately related to the patch at hand. The only reason that paragraph is getting reflowed is because of the additional argument to RegisterDynamicBackgroundWorker(). On the substance, I could go either way on whether your proposed text is better than what's there now. It seems like it might need a bit of wordsmithing, anyway. > s/STATED/STARTED/ Good catch. >> bool >> -RegisterDynamicBackgroundWorker(BackgroundWorker *worker) >> +RegisterDynamicBackgroundWorker(BackgroundWorker *worker, >> + >> BackgroundWorkerHandle **handle) >> { > > Hm. Not this patches fault, but We seem to allow bgw_start_time == > BgWorkerStart_PostmasterStart here which doesn't make sense... I can add a check for that. I agree that it's a separate patch. > Maybe add something like Assert(hanlde->slot < max_worker_processes);? Sure. > Shouldn't that loop have a CHECK_FOR_INTERRUPTS()? Yep. > > Theoretically this would unset set_latch_on_sigusr1 if it was previously > already set to 'true'. If you feel defensive you could safe the previous > value... Probably a good plan. > So, besides those I don't see much left to be done. I haven't tested > EXEC_BACKEND, but I don't anything specific to that here. I certainly can't promise that the code is bug-free. But I think it's probably better to get this into the tree and let people start playing around with it than to continue to maintain it in my private sandbox. At this point it's just infrastructure anyway, though there seems to be a good deal of interest in it from more than just myself... and I do think it's a useful step forward from where we are today. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic background workers, round two
On 2013-07-26 08:50:51 -0400, Robert Haas wrote: > > > > Btw, you seem to want to support this in bgworkers started by a > > > > bgworker. That's not going to work without some changes if the > > > > "intermediate" bgworker is one without a backend since those don't use > > > > procsignal_sigusr1_handler. > > > Right. I think it's OK for now to limit it to cases where the > > > intermediate bgworker has a backend. If someone else finds that > > > restriction unacceptable, they can fix it. > > I don't have a problem with the restriction, but I'd like to see a check > > against it. Maybe check for MyBackendId != InvalidBackendId in > > RegisterDynamicBackgroundWorker()? That would also prevent starting > > further bgworkers before BackgroundWorkerInitializeConnection() is done > > in a connected bgworker which seems to be a good thing. > > Well, that's easy enough to fix. Should we Assert() or elog() or > what? I think that's not in the patch yet either. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic background workers, round two
Hi Robert, On 2013-08-17 12:08:17 -0400, Robert Haas wrote: > On Sun, Aug 11, 2013 at 1:31 AM, Andres Freund wrote: > > So, I'd suggest something like: > > > > typedef enum BgwHandleStatus { > >BGWH_SUCCESS, /* sucessfully got status */ > >BGWH_NOT_YET, /* worker hasn't started yet */ > >BGWH_GONE, /* worker had been started, but shut down already */ > >BGWH_POSTMASTER_DIED /* well, there you go */ > > } BgwHandleStatus; > > > > > > BgwHandleStatus GetBackgroundWorkerPid(BackgroundWorkerHandle *handle, > > pid_t *pid); > > BgwHandleStatus WaitForBackgroundWorkerStartup(BackgroundWorkerHandle > > *handle, pid_t *pid); > > OK, here's a patch that API. I renamed the constants a bit, because a > process that has stopped is not necessarily gone; it could be > configured for restart. But we can say that it is stopped, at the > moment. > > I'm not sure that this API is an improvement. But I think it's OK, if > you prefer it. Thanks, looks more readable to me. I like code that tells me what it does without having to look in other places ;) > + RegisterDynamicBackgroundWorker(BackgroundWorker > + *worker, BackgroundWorkerHandle **handle). Unlike > + RegisterBackgroundWorker, which can only be called from within > + the postmaster, RegisterDynamicBackgroundWorker must > be > + called from a regular backend. > s/which can only be called from within the posmaster/during initial startup, via shared_preload_libraries/? > > + When a background worker is registered using the > + RegisterDynamicBackgroundWorker function, it is > + possible for the backend performing the registration to obtain information > + the status of the worker. Backends wishing to do this should pass the > + address of a BackgroundWorkerHandle * as the second argument > + to RegisterDynamicBackgroundWorker. If the worker > + is successfully registered, this pointer will be initialized with an > + opaque handle that can subsequently be passed to > + GetBackgroundWorkerPid(BackgroundWorkerHandle > *, pid_t *). > + This function can be used to poll the status of the worker: a return > + value of BGWH_NOT_YET_STARTED indicates that the worker has > not > + yet been started by the postmaster; BGWH_STOPPED > + indicates that it has been started but is no longer running; and > + BGWH_STATED indicates that it is currently running. > + In this last case, the PID will also be returned via the second argument. > + s/STATED/STARTED/ > diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c > index f7ebd1a..6414291 100644 > --- a/src/backend/commands/async.c > +++ b/src/backend/commands/async.c > -#define InvalidPid (-1) > - Hrmpf ;) > bool > -RegisterDynamicBackgroundWorker(BackgroundWorker *worker) > +RegisterDynamicBackgroundWorker(BackgroundWorker *worker, > + > BackgroundWorkerHandle **handle) > { Hm. Not this patches fault, but We seem to allow bgw_start_time == BgWorkerStart_PostmasterStart here which doesn't make sense... > +/* > + * Get the PID of a dynamically-registered background worker. > + * > + * If the return value is InvalidPid, it indicates that the worker has not > + * yet been started by the postmaster. > + * > + * If the return value is 0, it indicates that the worker was started by > + * the postmaster but is no longer running. > + * > + * Any other return value is the PID of the running worker. > + */ > +BgwHandleStatus > +GetBackgroundWorkerPid(BackgroundWorkerHandle *handle, pid_t *pidp) > +{ > + BackgroundWorkerSlot *slot; > + pid_t pid; > + > + slot = &BackgroundWorkerData->slot[handle->slot]; Maybe add something like Assert(hanlde->slot < max_worker_processes);? > +/* > + * Wait for a background worker to start up. > + * > + * The return value is the same as for GetBackgroundWorkerPID, except that > + * we only return InvalidPid if the postmaster has died (and therefore we > + * know that the worker will never be started). > + */ > +BgwHandleStatus > +WaitForBackgroundWorkerStartup(BackgroundWorkerHandle *handle, pid_t *pidp) > +{ > + BgwHandleStatus status; > + pid_t pid; > + int rc; > + > + set_latch_on_sigusr1 = true; > + > + PG_TRY(); > + { > + for (;;) > + { > + status = GetBackgroundWorkerPid(handle, &pid); > + if (status != BGWH_NOT_YET_STARTED) > + break; > + rc = WaitLatch(&MyProc->procLatch, > +WL_LATCH_SET | > WL_POSTMASTER_DEATH, 0); > + if (rc & WL_POSTMASTER_DEATH) > + { > + status = BGWH_POSTMASTER_DIED; > + break; > + } > + ResetLatch(&MyProc->procLatch); > + } > + } > +
Re: [HACKERS] dynamic background workers, round two
On Sun, Aug 11, 2013 at 1:31 AM, Andres Freund wrote: > So, I'd suggest something like: > > typedef enum BgwHandleStatus { >BGWH_SUCCESS, /* sucessfully got status */ >BGWH_NOT_YET, /* worker hasn't started yet */ >BGWH_GONE, /* worker had been started, but shut down already */ >BGWH_POSTMASTER_DIED /* well, there you go */ > } BgwHandleStatus; > > > BgwHandleStatus GetBackgroundWorkerPid(BackgroundWorkerHandle *handle, pid_t > *pid); > BgwHandleStatus WaitForBackgroundWorkerStartup(BackgroundWorkerHandle > *handle, pid_t *pid); OK, here's a patch that API. I renamed the constants a bit, because a process that has stopped is not necessarily gone; it could be configured for restart. But we can say that it is stopped, at the moment. I'm not sure that this API is an improvement. But I think it's OK, if you prefer it. ...Robert bgworker-feedback-v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic background workers, round two
[sent again, previously sent as reply, instead of reply-all, thanks Robert] On 2013-08-09 09:09:08 -0400, Robert Haas wrote: > On Fri, Jul 26, 2013 at 8:50 AM, Robert Haas wrote: > > On Fri, Jul 26, 2013 at 5:34 AM, Andres Freund > > wrote: > >> It doesn't need to be the postmaster, but I think we need to provide > >> central infrastructure for that. I don't want this to end up being > >> redone poorly in multiple places. > >> I just wanted to mention it, it obviously doesn't need to be implemented > >> at the same time. > > > > OK, makes sense. I'm still working out a design for this part of it; > > I think I want to get the dynamic shared memory stuff working first. Makes sense. I just wanted to mention that we should keep on the back of our minds while working on this. > >>> >> +#define InvalidPid (-1) > >>> >> + > >>> > > >>> > That value strikes me as a rather bad idea. As a pid that represents > >>> > init's processgroup. I.e. kill(InvalidPid) will try to kill init... I'd > >>> > rather not spread that outside async.c. > >>> > >>> Well, if we ever try to kill(InvalidPid), that's a PostgreSQL bug. > >>> And if the postgres process has permission to send signals to init, > >>> that's an OS bug (or a PostgreSQL bug, since we refuse to run as > >>> root). So I'm not very worried about it. > >> > >> I don't think somebody mistakenly kill()ing somebody with -1 is all too > >> likely, but it's a valid value to pass. That's why I dislike using it as > >> constant for an invalid value. > > > > Well, so is any negative number, but we should never be trying to kill > > process groups. Maybe it's just my dislike for constants that are easy to misuse - even if it's not really relevant for the specific case. > >>> I've got two functions that > >>> need distinguishable return values for the not-yet-started case and > >>> the already-dead case, neither of which can be real PIDs. If we don't > >>> use 0 and -1, the alternative is presumably to complicate the calling > >>> signature with another out parameter. That does not seem like a net > >>> improvement. > >> > >> I actually think those cases would be better served by an error code > >> mechanism which is extensible. > > > > Hmm. What signature would you suggest? so, the two functions are: * int GetBackgroundWorkerPid(BackgroundWorkerHandle *handle); * int WaitForBackgroundWorkerStartup(BackgroundWorkerHandle *handle); The return value means about the same with: * If the return value is InvalidPid, it indicates that the worker has not yet been started by the postmaster. * If the return value is 0, it indicates that the worker was started by the postmaster but is no longer running. * Any other return value is the PID of the running worker. So, I'd suggest something like: typedef enum BgwHandleStatus { BGWH_SUCCESS, /* sucessfully got status */ BGWH_NOT_YET, /* worker hasn't started yet */ BGWH_GONE, /* worker had been started, but shut down already */ BGWH_POSTMASTER_DIED /* well, there you go */ } BgwHandleStatus; BgwHandleStatus GetBackgroundWorkerPid(BackgroundWorkerHandle *handle, pid_t *pid); BgwHandleStatus WaitForBackgroundWorkerStartup(BackgroundWorkerHandle *handle, pid_t *pid); > >> I can see the point in the argument, but it still seems to be grotty > >> architecturally. > > > > Suck it up. :-) That I am able to do ;) Holidays help :P > > I considered the idea of using SIGUSR2, but it didn't seem worth it. > > > >>> I don't understand what you mean by this. Why does it ignore postmaster > >>> death? > >> > >> The other callers that check for WL_POSTMASTER_DEATH actually perform > >> drastic measures like exit()ing upon postmaster. Here you just > >> return. With the unix latch implementation it seems to be guaranteed > >> that the next WaitLatch() done somewhere else will also return > >> WL_POSTMASTER_DEATH. I am not sure whether it might actually "consume" > >> the death in the windows implementation. > > > > Well, IMHO, every backend should exit as quick as possible upon > > discovering that the postmaster has died. However, for complex > > political reasons (ah-Tom-Lane-choo!), our policy is to have the > > auxiliary processes exit and regular backends soldier on. This seems > > completely stupid to me, but it's not this patch's job to reverse that > > policy decision. While I'd phrase it more politely/complex, agreed. And agreed with not changing policy here. I am not trying to argue for a different behaviour, but basically wondering whether we need something like "LatchRearmPostmasterDeath()" for the benefit of the windows implementation. Maybe I am just dumb or years of not really touching windows got to me, but I find the documentation around windows specific stuff utterly lacking. There doesn't seem to be any comment on how the windows signal queue stuff works on a high level. > > To put that another way, one could equally well ask why WaitLatch(), > > or for that matter
Re: [HACKERS] dynamic background workers, round two
On Fri, Jul 26, 2013 at 5:34 AM, Andres Freund wrote: > It doesn't need to be the postmaster, but I think we need to provide > central infrastructure for that. I don't want this to end up being > redone poorly in multiple places. > I just wanted to mention it, it obviously doesn't need to be implemented > at the same time. OK, makes sense. I'm still working out a design for this part of it; I think I want to get the dynamic shared memory stuff working first. >> >> +#define InvalidPid (-1) >> >> + >> > >> > That value strikes me as a rather bad idea. As a pid that represents >> > init's processgroup. I.e. kill(InvalidPid) will try to kill init... I'd >> > rather not spread that outside async.c. >> >> Well, if we ever try to kill(InvalidPid), that's a PostgreSQL bug. >> And if the postgres process has permission to send signals to init, >> that's an OS bug (or a PostgreSQL bug, since we refuse to run as >> root). So I'm not very worried about it. > > I don't think somebody mistakenly kill()ing somebody with -1 is all too > likely, but it's a valid value to pass. That's why I dislike using it as > constant for an invalid value. Well, so is any negative number, but we should never be trying to kill process groups. >> I've got two functions that >> need distinguishable return values for the not-yet-started case and >> the already-dead case, neither of which can be real PIDs. If we don't >> use 0 and -1, the alternative is presumably to complicate the calling >> signature with another out parameter. That does not seem like a net >> improvement. > > I actually think those cases would be better served by an error code > mechanism which is extensible. Hmm. What signature would you suggest? > I can see the point in the argument, but it still seems to be grotty > architecturally. Suck it up. :-) I considered the idea of using SIGUSR2, but it didn't seem worth it. >> I don't understand what you mean by this. Why does it ignore postmaster >> death? > > The other callers that check for WL_POSTMASTER_DEATH actually perform > drastic measures like exit()ing upon postmaster. Here you just > return. With the unix latch implementation it seems to be guaranteed > that the next WaitLatch() done somewhere else will also return > WL_POSTMASTER_DEATH. I am not sure whether it might actually "consume" > the death in the windows implementation. Well, IMHO, every backend should exit as quick as possible upon discovering that the postmaster has died. However, for complex political reasons (ah-Tom-Lane-choo!), our policy is to have the auxiliary processes exit and regular backends soldier on. This seems completely stupid to me, but it's not this patch's job to reverse that policy decision. To put that another way, one could equally well ask why WaitLatch(), or for that matter PostmasterIsAlive(), doesn't contain ereport(FATAL) upon discovering that the postmaster has given up the ghost. And the answer is that it's the job of those functions to provide information, not make policy decisions. > I don't have a problem with the restriction, but I'd like to see a check > against it. Maybe check for MyBackendId != InvalidBackendId in > RegisterDynamicBackgroundWorker()? That would also prevent starting > further bgworkers before BackgroundWorkerInitializeConnection() is done > in a connected bgworker which seems to be a good thing. Well, that's easy enough to fix. Should we Assert() or elog() or what? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic background workers, round two
On 2013-07-25 12:35:30 -0400, Robert Haas wrote: > On Wed, Jul 24, 2013 at 5:34 PM, Andres Freund wrote: > > This seems like a sensible idea to me. But, in the context of dynamic > > query, don't we also need the reverse infrastructure of notifying a > > bgworker that the client, that requested it to be started, has died? > > Ending up with a dozen bgworkers running because the normal backend > > FATALed doesn't seem nice. > > I think that will be desirable for many, although not all, uses of > background workers. For example, you might want to be able to do > something like SELECT pg_background('CREATE INDEX CONCURRENTLY ...') > and then exit your session and have the background worker continue to > run. Or maybe somebody writes a trigger that starts a background > worker (if there's not one already working) and queues up work for the > trigger, and the background worker continues processing the queue > until it is empty, and then exits. > But for parallel query specifically, yeah, we'll probably want the > untimely demise of either the original backend or any of its workers > to result in the death of all the workers and an abort of the parent's > transaction. However, there's no need for the postmaster to be > involved in any of that. For example, once the worker process is up > and running and the parent has its PID, it can use an on_shmem_exit > hook or a PG_ENSURE_ERROR_CLEANUP() block to send SIGTERM to any > still-living workers. It doesn't need to be the postmaster, but I think we need to provide central infrastructure for that. I don't want this to end up being redone poorly in multiple places. I just wanted to mention it, it obviously doesn't need to be implemented at the same time. > >> +#define InvalidPid (-1) > >> + > > > > That value strikes me as a rather bad idea. As a pid that represents > > init's processgroup. I.e. kill(InvalidPid) will try to kill init... I'd > > rather not spread that outside async.c. > > Well, if we ever try to kill(InvalidPid), that's a PostgreSQL bug. > And if the postgres process has permission to send signals to init, > that's an OS bug (or a PostgreSQL bug, since we refuse to run as > root). So I'm not very worried about it. I don't think somebody mistakenly kill()ing somebody with -1 is all too likely, but it's a valid value to pass. That's why I dislike using it as constant for an invalid value. > I've got two functions that > need distinguishable return values for the not-yet-started case and > the already-dead case, neither of which can be real PIDs. If we don't > use 0 and -1, the alternative is presumably to complicate the calling > signature with another out parameter. That does not seem like a net > improvement. I actually think those cases would be better served by an error code mechanism which is extensible. > >> +/* > >> + * Report the PID of a newly-launched background worker in shared memory. > >> + * > >> + * This function should only be called from the postmaster. > >> + */ > >> +void > >> +ReportBackgroundWorkerPID(RegisteredBgWorker *rw) > >> +{ > >> + BackgroundWorkerSlot *slot; > >> + > >> + Assert(rw->rw_shmem_slot < max_worker_processes); > >> + slot = &BackgroundWorkerData->slot[rw->rw_shmem_slot]; > >> + slot->pid = rw->rw_pid; > >> + > >> + if (rw->rw_worker.bgw_notify_pid != 0) > >> + kill(rw->rw_worker.bgw_notify_pid, SIGUSR1); > >> +} > > > > Any reason not to use SendProcSignal() properly here? Just that you > > don't know the BackendId? I seems unclean to reuse SIGUSR1 without using > > the infrastructure built for that. > > We're in the postmaster here. The postmaster currently uses kill > directly in all cases, rather than SendProcSignal(), and I'd rather > not change that. Any code that the postmaster calls has got to be > safe against arbitrary shared memory corruption, and I'd rather keep > that footprint as small as possible. For example, if we added > bgw_backend_id, the postmaster would have to bounds check it to make > sure it didn't seg fault. The more of that kind of stuff we add, the > more chances there are to destabilize the postmaster. I'd like to > keep it as minimal as possible. I can see the point in the argument, but it still seems to be grotty architecturally. > >> + for (;;) > >> + { > >> + pid = GetBackgroundWorkerPid(handle); > >> + if (pid != InvalidPid) > >> + break; > >> + rc = WaitLatch(&MyProc->procLatch, > >> +WL_LATCH_SET | > >> WL_POSTMASTER_DEATH, 0); > >> + if (rc & WL_POSTMASTER_DEATH) > >> + return InvalidPid; > > > > This basically ignores postmaster death. With unix_latch.c that > > shouldn't be a problem because afaics it's implementation correctly > > should report an error in the next WaitLatch without waiti
Re: [HACKERS] dynamic background workers, round two
On Wed, Jul 24, 2013 at 5:34 PM, Andres Freund wrote: > This seems like a sensible idea to me. But, in the context of dynamic > query, don't we also need the reverse infrastructure of notifying a > bgworker that the client, that requested it to be started, has died? > Ending up with a dozen bgworkers running because the normal backend > FATALed doesn't seem nice. I think that will be desirable for many, although not all, uses of background workers. For example, you might want to be able to do something like SELECT pg_background('CREATE INDEX CONCURRENTLY ...') and then exit your session and have the background worker continue to run. Or maybe somebody writes a trigger that starts a background worker (if there's not one already working) and queues up work for the trigger, and the background worker continues processing the queue until it is empty, and then exits. But for parallel query specifically, yeah, we'll probably want the untimely demise of either the original backend or any of its workers to result in the death of all the workers and an abort of the parent's transaction. However, there's no need for the postmaster to be involved in any of that. For example, once the worker process is up and running and the parent has its PID, it can use an on_shmem_exit hook or a PG_ENSURE_ERROR_CLEANUP() block to send SIGTERM to any still-living workers. In the other direction, the parent needs to know not only whether or not the child has died after startup, but most likely also needs to receive errors, notices, tuples, or other data from the child. If the parent receives an ERROR or FATAL indication from a parallel-worker child, it should most likely kill all other workers and then abort the transaction. But in neither of those cases does the postmaster need to be involved. The reason why this particular mechanism is needed is because the worker isn't able to report its own failure to start. Only the postmaster can do that. Once both processes are up and running, it's up to them to handle their own IPC needs. > It really should be pid_t even though we're not consistently doing that > in the code. Hmm, I thought I saw we weren't using that type. But I guess we are, so I can go make this consistent. >> +#define InvalidPid (-1) >> + > > That value strikes me as a rather bad idea. As a pid that represents > init's processgroup. I.e. kill(InvalidPid) will try to kill init... I'd > rather not spread that outside async.c. Well, if we ever try to kill(InvalidPid), that's a PostgreSQL bug. And if the postgres process has permission to send signals to init, that's an OS bug (or a PostgreSQL bug, since we refuse to run as root). So I'm not very worried about it. I've got two functions that need distinguishable return values for the not-yet-started case and the already-dead case, neither of which can be real PIDs. If we don't use 0 and -1, the alternative is presumably to complicate the calling signature with another out parameter. That does not seem like a net improvement. >> +/* >> + * Report the PID of a newly-launched background worker in shared memory. >> + * >> + * This function should only be called from the postmaster. >> + */ >> +void >> +ReportBackgroundWorkerPID(RegisteredBgWorker *rw) >> +{ >> + BackgroundWorkerSlot *slot; >> + >> + Assert(rw->rw_shmem_slot < max_worker_processes); >> + slot = &BackgroundWorkerData->slot[rw->rw_shmem_slot]; >> + slot->pid = rw->rw_pid; >> + >> + if (rw->rw_worker.bgw_notify_pid != 0) >> + kill(rw->rw_worker.bgw_notify_pid, SIGUSR1); >> +} > > Any reason not to use SendProcSignal() properly here? Just that you > don't know the BackendId? I seems unclean to reuse SIGUSR1 without using > the infrastructure built for that. We're in the postmaster here. The postmaster currently uses kill directly in all cases, rather than SendProcSignal(), and I'd rather not change that. Any code that the postmaster calls has got to be safe against arbitrary shared memory corruption, and I'd rather keep that footprint as small as possible. For example, if we added bgw_backend_id, the postmaster would have to bounds check it to make sure it didn't seg fault. The more of that kind of stuff we add, the more chances there are to destabilize the postmaster. I'd like to keep it as minimal as possible. >> + * If handle != NULL, we'll set *handle to a pointer that can subsequently >> + * be used as an argument to GetBackgroundWorkerPid(). The caller can >> + * free this pointer using pfree(), if desired. >> */ >> bool >> -RegisterDynamicBackgroundWorker(BackgroundWorker *worker) >> +RegisterDynamicBackgroundWorker(BackgroundWorker *worker, >> + >> BackgroundWorkerHandle **handle) > > I'd just let the caller provide the memory, but whatever ;) It could, but then the structure couldn't be opaque. That wouldn't be a disaster but this provides better protection against future API violations. > None of the containing code
Re: [HACKERS] dynamic background workers, round two
On Thu, Jul 25, 2013 at 8:05 AM, Andres Freund wrote: > On 2013-07-25 08:03:17 +0900, Michael Paquier wrote: > > On Thu, Jul 25, 2013 at 6:34 AM, Andres Freund >wrote: > > > > > > --- a/contrib/worker_spi/worker_spi.c > > > > +++ b/contrib/worker_spi/worker_spi.c > > > > > > Btw, I've posted a minimal regression test for bworkers/worker_spi in > > > 20130724175742.gd10...@alap2.anarazel.de - I'd like to see some > coverage > > > of this... > > > > > I could not find this email ID in the archives. A link perhaps? > > Hm. Works for me: > > http://www.postgresql.org/message-id/20130724175742.gd10...@alap2.anarazel.de > Oops. Thanks, the search field does not like if spaces are added with the ID. -- Michael
Re: [HACKERS] dynamic background workers, round two
On 2013-07-25 08:03:17 +0900, Michael Paquier wrote: > On Thu, Jul 25, 2013 at 6:34 AM, Andres Freund wrote: > > > > --- a/contrib/worker_spi/worker_spi.c > > > +++ b/contrib/worker_spi/worker_spi.c > > > > Btw, I've posted a minimal regression test for bworkers/worker_spi in > > 20130724175742.gd10...@alap2.anarazel.de - I'd like to see some coverage > > of this... > > > I could not find this email ID in the archives. A link perhaps? Hm. Works for me: http://www.postgresql.org/message-id/20130724175742.gd10...@alap2.anarazel.de Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic background workers, round two
On Thu, Jul 25, 2013 at 6:34 AM, Andres Freund wrote: > > --- a/contrib/worker_spi/worker_spi.c > > +++ b/contrib/worker_spi/worker_spi.c > > Btw, I've posted a minimal regression test for bworkers/worker_spi in > 20130724175742.gd10...@alap2.anarazel.de - I'd like to see some coverage > of this... > I could not find this email ID in the archives. A link perhaps? -- Michael
Re: [HACKERS] dynamic background workers, round two
On Thu, Jul 25, 2013 at 6:34 AM, Andres Freund wrote: > Hi, > > On 2013-07-24 12:46:21 -0400, Robert Haas wrote: > > The attached patch attempts to remedy this problem. When you register > > a background worker, you can obtain a "handle" that can subsequently > > be used to query for the worker's PID. If you additionally initialize > > bgw_notify_pid = MyProcPid, then the postmaster will send you SIGUSR1 > > when worker startup has been attempted (successfully or not). You can > > wait for this signal by passing your handle to > > WaitForBackgroundWorkerStartup(), which will return only when either > > (1) an attempt has been made to start the worker or (2) the postmaster > > is determined to have died. This interface seems adequate for > > something like worker_spi, where it's useful to know whether the child > > was started or not (and return the PID if so) but that's about all > > that's really needed. > > This seems like a sensible idea to me. But, in the context of dynamic > query, don't we also need the reverse infrastructure of notifying a > bgworker that the client, that requested it to be started, has died? > Ending up with a dozen bgworkers running because the normal backend > FATALed doesn't seem nice. > Yes, this would be something necessary to have, but definitely it should be a separate patch. In order to satisfy that, you could register for each worker the PID of the parent process that started it, if it was started by an an other bgworker, and have postmaster scan the list of bgworkers that need to be stopped after the death of their parent if it was a bgworker. By the way, the PID registered in this case should not be bgw_notify_pid but something saved in BackgroundWorkerSlot. My 2c on that though, feel free to comment. > More complex notification interfaces can also be coded using the > > primitives introduced by this patch. Setting bgw_notify_pid will > > cause the postmaster to send SIGUSR1 every time it starts the worker > > and every time the worker dies. Every time you receive that signal, > > you can call GetBackgroundWorkerPid() for each background worker to > > find out which ones have started or terminated. The only slight > > difficulty is in waiting for receipt of that signal, which I > > implemented by adding a new global called set_latch_on_sigusr1. > > WaitForBackgroundWorkerStartup() uses that flag internally, but > > there's nothing to prevent a caller from using it as part of their own > > event loop. I find the set_latch_on_sigusr1 flag to be slightly > > ugly, but it seems better and far safer than having the postmaster try > > to actually set the latch itself - for example, it'd obviously be > > unacceptable to pass a Latch * to the postmaster and have the > > postmaster assume that pointer valid. > > I am with you with finding it somewhat ugly. > After a quick look at the code, yeah using a boolean here looks like an overkill. -- Michael
Re: [HACKERS] dynamic background workers, round two
Hi, On 2013-07-24 12:46:21 -0400, Robert Haas wrote: > The attached patch attempts to remedy this problem. When you register > a background worker, you can obtain a "handle" that can subsequently > be used to query for the worker's PID. If you additionally initialize > bgw_notify_pid = MyProcPid, then the postmaster will send you SIGUSR1 > when worker startup has been attempted (successfully or not). You can > wait for this signal by passing your handle to > WaitForBackgroundWorkerStartup(), which will return only when either > (1) an attempt has been made to start the worker or (2) the postmaster > is determined to have died. This interface seems adequate for > something like worker_spi, where it's useful to know whether the child > was started or not (and return the PID if so) but that's about all > that's really needed. This seems like a sensible idea to me. But, in the context of dynamic query, don't we also need the reverse infrastructure of notifying a bgworker that the client, that requested it to be started, has died? Ending up with a dozen bgworkers running because the normal backend FATALed doesn't seem nice. > More complex notification interfaces can also be coded using the > primitives introduced by this patch. Setting bgw_notify_pid will > cause the postmaster to send SIGUSR1 every time it starts the worker > and every time the worker dies. Every time you receive that signal, > you can call GetBackgroundWorkerPid() for each background worker to > find out which ones have started or terminated. The only slight > difficulty is in waiting for receipt of that signal, which I > implemented by adding a new global called set_latch_on_sigusr1. > WaitForBackgroundWorkerStartup() uses that flag internally, but > there's nothing to prevent a caller from using it as part of their own > event loop. I find the set_latch_on_sigusr1 flag to be slightly > ugly, but it seems better and far safer than having the postmaster try > to actually set the latch itself - for example, it'd obviously be > unacceptable to pass a Latch * to the postmaster and have the > postmaster assume that pointer valid. I am with you with finding it somewhat ugly. > --- a/contrib/worker_spi/worker_spi.c > +++ b/contrib/worker_spi/worker_spi.c Btw, I've posted a minimal regression test for bworkers/worker_spi in 20130724175742.gd10...@alap2.anarazel.de - I'd like to see some coverage of this... > +int bgw_notify_pid; It really should be pid_t even though we're not consistently doing that in the code. > diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c > index f7ebd1a..6414291 100644 > --- a/src/backend/commands/async.c > +++ b/src/backend/commands/async.c > @@ -207,8 +207,6 @@ typedef struct QueueBackendStatus > QueuePosition pos; /* backend has read queue up to > here */ > } QueueBackendStatus; > > -#define InvalidPid (-1) > - > diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h > index edced29..0aa540a 100644 > --- a/src/include/miscadmin.h > +++ b/src/include/miscadmin.h > @@ -28,6 +28,8 @@ > > #define PG_BACKEND_VERSIONSTR "postgres (PostgreSQL) " PG_VERSION "\n" > > +#define InvalidPid (-1) > + That value strikes me as a rather bad idea. As a pid that represents init's processgroup. I.e. kill(InvalidPid) will try to kill init... I'd rather not spread that outside async.c. > +/* > + * Report the PID of a newly-launched background worker in shared memory. > + * > + * This function should only be called from the postmaster. > + */ > +void > +ReportBackgroundWorkerPID(RegisteredBgWorker *rw) > +{ > + BackgroundWorkerSlot *slot; > + > + Assert(rw->rw_shmem_slot < max_worker_processes); > + slot = &BackgroundWorkerData->slot[rw->rw_shmem_slot]; > + slot->pid = rw->rw_pid; > + > + if (rw->rw_worker.bgw_notify_pid != 0) > + kill(rw->rw_worker.bgw_notify_pid, SIGUSR1); > +} Any reason not to use SendProcSignal() properly here? Just that you don't know the BackendId? I seems unclean to reuse SIGUSR1 without using the infrastructure built for that. I first thought you didn't want to do it because you it touches shmem, but given that ReportBackgroundWorkerPID() does so itself... > + * If handle != NULL, we'll set *handle to a pointer that can subsequently > + * be used as an argument to GetBackgroundWorkerPid(). The caller can > + * free this pointer using pfree(), if desired. > */ > bool > -RegisterDynamicBackgroundWorker(BackgroundWorker *worker) > +RegisterDynamicBackgroundWorker(BackgroundWorker *worker, > + > BackgroundWorkerHandle **handle) I'd just let the caller provide the memory, but whatever ;) > +/* > + * Wait for a background worker to start up. > + * > + * The return value is the same as for GetBackgroundWorkerPID, except that > + * we only return InvalidPid if the postmaster has died (and therefore we > + * know that the worker will
[HACKERS] dynamic background workers, round two
The dynamic background workers patch that I submitted for CF1 was generally well-received, but several people commented on a significant limitation: there's currently no way for a backend that requests a new background worker to know whether that background worker was successfully started. If you're using the background worker mechanism to run daemon processes of some sort, this is probably not a huge problem. You most likely don't start and stop those processes very frequently, and when you do manually start them, you can always examine the server log to see whether everything worked as planned. Maybe not ideal, but not unbearably awful, either. However, the goal I'm working towards here is parallel query, and for that application, and some others, things don't look so good. In that case, you are probably launching a background worker flagged as BGW_NEVER_RESTART, and so the postmaster is going to try to start it just once, and if it doesn't work, you really need some way of knowing that. Of course, if the worker is launched successfully, you can have it notify the process that started it via any mechanism you choose: creating a sentinel file, inserting data into a table, setting the process latch. Sky's the limit. However, if the worker isn't launched successfully (fork fails, or the process crashes before it reaches your code) you have no way of knowing. If you don't receive the agreed-upon notification from the child, it means that EITHER the process was never started in the first place OR the postmaster just hasn't gotten around to starting it yet. Of course, you could wait for a long time (5s ?) and then give up, but there's no good way to set the wait time. If you make it long, then it takes a long time for you to report errors to the client even when those errors happen quickly. If you make it short, you may time out spuriously on a busy system. The attached patch attempts to remedy this problem. When you register a background worker, you can obtain a "handle" that can subsequently be used to query for the worker's PID. If you additionally initialize bgw_notify_pid = MyProcPid, then the postmaster will send you SIGUSR1 when worker startup has been attempted (successfully or not). You can wait for this signal by passing your handle to WaitForBackgroundWorkerStartup(), which will return only when either (1) an attempt has been made to start the worker or (2) the postmaster is determined to have died. This interface seems adequate for something like worker_spi, where it's useful to know whether the child was started or not (and return the PID if so) but that's about all that's really needed. More complex notification interfaces can also be coded using the primitives introduced by this patch. Setting bgw_notify_pid will cause the postmaster to send SIGUSR1 every time it starts the worker and every time the worker dies. Every time you receive that signal, you can call GetBackgroundWorkerPid() for each background worker to find out which ones have started or terminated. The only slight difficulty is in waiting for receipt of that signal, which I implemented by adding a new global called set_latch_on_sigusr1. WaitForBackgroundWorkerStartup() uses that flag internally, but there's nothing to prevent a caller from using it as part of their own event loop. I find the set_latch_on_sigusr1 flag to be slightly ugly, but it seems better and far safer than having the postmaster try to actually set the latch itself - for example, it'd obviously be unacceptable to pass a Latch * to the postmaster and have the postmaster assume that pointer valid. I'm hopeful that this is about as much fiddling with the background worker mechanism per se as will be needed for parallel query. Once we have this, I think the next hurdle will be designing suitable IPC mechanisms, so that there's a relatively easy way for the "parent" process to pass information to its background workers and visca versa. I expect the main tool for that to be a dynamic shared memory facility; but I'm hoping that won't be closely tied to background workers, though they may be heavy users of it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company bgworker-feedback-v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers