Re: [openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait?
On Thu, Sep 14, 2017 at 7:33 PM, Adam Spierswrote: > > Hi Ken, > > Thanks a lot for the analysis, and sorry for the slow reply! > Comments inline... > > Ken Giusti wrote: > > Hi Adam, > > > > I think there's a couple of problems here. > > > > Regardless of worker count, the service.wait() is called before > > service.start(). And from looking at the oslo.service code, the 'wait()' > > method is call after start(), then again after stop(). This doesn't match > > up with the intended use of oslo.messaging.server.wait(), which should only > > be called after .stop(). > > Hmm, so are you saying that there might be a bug in oslo.service's > usage of oslo.messaging, and that this Sahara bugfix was the wrong > approach too? > > https://review.openstack.org/#/c/280741/1/sahara/cli/sahara_engine.py > Well, I don't think the explicit call to start() is going to help, esp. if the number of workers is > 1 since the workers are forked and need to call start() from their own process space.. In fact, if # of workers > 1 then you not only get an RPC server in each worker process, you'll end up with an extra RPC server in the calling thread. Take a look at a test service I've created for oslo.messaging: https://pastebin.com/rSA6AD82 If you change the main code to call the new sequence, you'll end up with 3 rpc servers (2 in the workers, one in the main process). In that code I've made the wait() call a no op if the server hasn't been started first. And the stop method will call stop and wait on the rpc server, which is the expected sequence as far as oslo.messaging is concerned. To me it seems that the bug is in oslo.service - calling wait() before start() doesn't make sense to me. > > Perhaps a bigger issue is that in the multi threaded case all threads > > appear to be calling start, wait, and stop on the same instance of the > > service (oslo.messaging rpc server). At least that's what I'm seeing in my > > muchly reduced test code: I was wrong about this - I failed to notice that each service had forked and was dealing with its own copy of the server. > > > > https://paste.fedoraproject.org/paste/-73zskccaQvpSVwRJD11cA > > > > The log trace shows multiple calls to start, wait, stop via different > > threads to the same TaskServer instance: > > > > https://paste.fedoraproject.org/paste/dyPq~lr26sQZtMzHn5w~Vg > > > > Is that expected? > > Unfortunately in the interim, your pastes seem to have vanished - any > chance you could repaste them? > Ugh - didn't keep a copy. If you pull down that test code you can use it to generate those traces. > Thanks, > Adam > > > On Mon, Jul 31, 2017 at 9:32 PM, Adam Spiers wrote: > > > Ken Giusti wrote: > > >> On Mon, Jul 31, 2017 at 10:01 AM, Adam Spiers wrote: > > >>> I recently discovered a bug where barbican-worker would hang on > > >>> shutdown if queue.asynchronous_workers was changed from 1 to 2: > > >>> > > >>>https://bugs.launchpad.net/barbican/+bug/1705543 > > >>> > > >>> resulting in a warning like this: > > >>> > > >>>WARNING oslo_messaging.server [-] Possible hang: stop is waiting for > > >>> start to complete > > >>> > > >>> I found a similar bug in Sahara: > > >>> > > >>>https://bugs.launchpad.net/sahara/+bug/1546119 > > >>> > > >>> where the fix was to call start() on the RPC service before making the > > >>> launcher wait() on it, so I ported the fix to Barbican, and it seems > > >>> to work fine: > > >>> > > >>>https://review.openstack.org/#/c/485755 > > >>> > > >>> I noticed that both projects use ProcessLauncher; barbican uses > > >>> oslo_service.service.launch() which has: > > >>> > > >>>if workers is None or workers == 1: > > >>>launcher = ServiceLauncher(conf, restart_method=restart_method) > > >>>else: > > >>>launcher = ProcessLauncher(conf, restart_method=restart_method) > > >>> > > >>> However, I'm not an expert in oslo.service or oslo.messaging, and one > > >>> of Barbican's core reviewers (thanks Kaitlin!) noted that not many > > >>> other projects start the task before calling wait() on the launcher, > > >>> so I thought I'd check here whether that is the correct fix, or > > >>> whether there's something else odd going on. > > >>> > > >>> Any oslo gurus able to shed light on this? > > >>> > > >> > > >> As far as an oslo.messaging server is concerned, the order of operations > > >> is: > > >> > > >> server.start() > > >> # do stuff until ready to stop the server... > > >> server.stop() > > >> server.wait() > > >> > > >> The final wait blocks until all requests that are in progress when stop() > > >> is called finish and cleanup. > > > > > > Thanks - that makes sense. So the question is, why would > > > barbican-worker only hang on shutdown when there are multiple workers? > > > Maybe the real bug is somewhere in oslo_service.service.ProcessLauncher > > > and it's not calling start() correctly? -- Ken
Re: [openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait?
Hi Ken, Thanks a lot for the analysis, and sorry for the slow reply! Comments inline... Ken Giustiwrote: > Hi Adam, > > I think there's a couple of problems here. > > Regardless of worker count, the service.wait() is called before > service.start(). And from looking at the oslo.service code, the 'wait()' > method is call after start(), then again after stop(). This doesn't match > up with the intended use of oslo.messaging.server.wait(), which should only > be called after .stop(). Hmm, so are you saying that there might be a bug in oslo.service's usage of oslo.messaging, and that this Sahara bugfix was the wrong approach too? https://review.openstack.org/#/c/280741/1/sahara/cli/sahara_engine.py > Perhaps a bigger issue is that in the multi threaded case all threads > appear to be calling start, wait, and stop on the same instance of the > service (oslo.messaging rpc server). At least that's what I'm seeing in my > muchly reduced test code: > > https://paste.fedoraproject.org/paste/-73zskccaQvpSVwRJD11cA > > The log trace shows multiple calls to start, wait, stop via different > threads to the same TaskServer instance: > > https://paste.fedoraproject.org/paste/dyPq~lr26sQZtMzHn5w~Vg > > Is that expected? Unfortunately in the interim, your pastes seem to have vanished - any chance you could repaste them? Thanks, Adam > On Mon, Jul 31, 2017 at 9:32 PM, Adam Spiers wrote: > > Ken Giusti wrote: > >> On Mon, Jul 31, 2017 at 10:01 AM, Adam Spiers wrote: > >>> I recently discovered a bug where barbican-worker would hang on > >>> shutdown if queue.asynchronous_workers was changed from 1 to 2: > >>> > >>>https://bugs.launchpad.net/barbican/+bug/1705543 > >>> > >>> resulting in a warning like this: > >>> > >>>WARNING oslo_messaging.server [-] Possible hang: stop is waiting for > >>> start to complete > >>> > >>> I found a similar bug in Sahara: > >>> > >>>https://bugs.launchpad.net/sahara/+bug/1546119 > >>> > >>> where the fix was to call start() on the RPC service before making the > >>> launcher wait() on it, so I ported the fix to Barbican, and it seems > >>> to work fine: > >>> > >>>https://review.openstack.org/#/c/485755 > >>> > >>> I noticed that both projects use ProcessLauncher; barbican uses > >>> oslo_service.service.launch() which has: > >>> > >>>if workers is None or workers == 1: > >>>launcher = ServiceLauncher(conf, restart_method=restart_method) > >>>else: > >>>launcher = ProcessLauncher(conf, restart_method=restart_method) > >>> > >>> However, I'm not an expert in oslo.service or oslo.messaging, and one > >>> of Barbican's core reviewers (thanks Kaitlin!) noted that not many > >>> other projects start the task before calling wait() on the launcher, > >>> so I thought I'd check here whether that is the correct fix, or > >>> whether there's something else odd going on. > >>> > >>> Any oslo gurus able to shed light on this? > >>> > >> > >> As far as an oslo.messaging server is concerned, the order of operations > >> is: > >> > >> server.start() > >> # do stuff until ready to stop the server... > >> server.stop() > >> server.wait() > >> > >> The final wait blocks until all requests that are in progress when stop() > >> is called finish and cleanup. > > > > Thanks - that makes sense. So the question is, why would > > barbican-worker only hang on shutdown when there are multiple workers? > > Maybe the real bug is somewhere in oslo_service.service.ProcessLauncher > > and it's not calling start() correctly? __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait?
Oop - didn't reply all -- Forwarded message -- From: Ken Giusti <kgiu...@gmail.com> Date: Tue, Aug 1, 2017 at 12:51 PM Subject: Re: [openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait? To: Adam Spiers <aspi...@suse.com> Hi Adam, I think there's a couple of problems here. Regardless of worker count, the service.wait() is called before service.start(). And from looking at the oslo.service code, the 'wait()' method is call after start(), then again after stop(). This doesn't match up with the intended use of oslo.messaging.server.wait(), which should only be called after .stop(). Perhaps a bigger issue is that in the multi threaded case all threads appear to be calling start, wait, and stop on the same instance of the service (oslo.messaging rpc server). At least that's what I'm seeing in my muchly reduced test code: https://paste.fedoraproject.org/paste/-73zskccaQvpSVwRJD11cA The log trace shows multiple calls to start, wait, stop via different threads to the same TaskServer instance: https://paste.fedoraproject.org/paste/dyPq~lr26sQZtMzHn5w~Vg Is that expected? On Mon, Jul 31, 2017 at 9:32 PM, Adam Spiers <aspi...@suse.com> wrote: > Ken Giusti <kgiu...@gmail.com> wrote: > >> On Mon, Jul 31, 2017 at 10:01 AM, Adam Spiers <aspi...@suse.com> wrote: >> >>> I recently discovered a bug where barbican-worker would hang on >>> shutdown if queue.asynchronous_workers was changed from 1 to 2: >>> >>>https://bugs.launchpad.net/barbican/+bug/1705543 >>> >>> resulting in a warning like this: >>> >>>WARNING oslo_messaging.server [-] Possible hang: stop is waiting for >>> start to complete >>> >>> I found a similar bug in Sahara: >>> >>>https://bugs.launchpad.net/sahara/+bug/1546119 >>> >>> where the fix was to call start() on the RPC service before making the >>> launcher wait() on it, so I ported the fix to Barbican, and it seems >>> to work fine: >>> >>>https://review.openstack.org/#/c/485755 >>> >>> I noticed that both projects use ProcessLauncher; barbican uses >>> oslo_service.service.launch() which has: >>> >>>if workers is None or workers == 1: >>>launcher = ServiceLauncher(conf, restart_method=restart_method) >>>else: >>>launcher = ProcessLauncher(conf, restart_method=restart_method) >>> >>> However, I'm not an expert in oslo.service or oslo.messaging, and one >>> of Barbican's core reviewers (thanks Kaitlin!) noted that not many >>> other projects start the task before calling wait() on the launcher, >>> so I thought I'd check here whether that is the correct fix, or >>> whether there's something else odd going on. >>> >>> Any oslo gurus able to shed light on this? >>> >> >> As far as an oslo.messaging server is concerned, the order of operations >> is: >> >> server.start() >> # do stuff until ready to stop the server... >> server.stop() >> server.wait() >> >> The final wait blocks until all requests that are in progress when stop() >> is called finish and cleanup. >> > > Thanks - that makes sense. So the question is, why would > barbican-worker only hang on shutdown when there are multiple workers? > Maybe the real bug is somewhere in oslo_service.service.ProcessLauncher > and it's not calling start() correctly? > -- Ken Giusti (kgiu...@gmail.com) -- Ken Giusti (kgiu...@gmail.com) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait?
Ken Giustiwrote: On Mon, Jul 31, 2017 at 10:01 AM, Adam Spiers wrote: I recently discovered a bug where barbican-worker would hang on shutdown if queue.asynchronous_workers was changed from 1 to 2: https://bugs.launchpad.net/barbican/+bug/1705543 resulting in a warning like this: WARNING oslo_messaging.server [-] Possible hang: stop is waiting for start to complete I found a similar bug in Sahara: https://bugs.launchpad.net/sahara/+bug/1546119 where the fix was to call start() on the RPC service before making the launcher wait() on it, so I ported the fix to Barbican, and it seems to work fine: https://review.openstack.org/#/c/485755 I noticed that both projects use ProcessLauncher; barbican uses oslo_service.service.launch() which has: if workers is None or workers == 1: launcher = ServiceLauncher(conf, restart_method=restart_method) else: launcher = ProcessLauncher(conf, restart_method=restart_method) However, I'm not an expert in oslo.service or oslo.messaging, and one of Barbican's core reviewers (thanks Kaitlin!) noted that not many other projects start the task before calling wait() on the launcher, so I thought I'd check here whether that is the correct fix, or whether there's something else odd going on. Any oslo gurus able to shed light on this? As far as an oslo.messaging server is concerned, the order of operations is: server.start() # do stuff until ready to stop the server... server.stop() server.wait() The final wait blocks until all requests that are in progress when stop() is called finish and cleanup. Thanks - that makes sense. So the question is, why would barbican-worker only hang on shutdown when there are multiple workers? Maybe the real bug is somewhere in oslo_service.service.ProcessLauncher and it's not calling start() correctly? __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait?
On Mon, Jul 31, 2017 at 10:01 AM, Adam Spierswrote: > Hi all, > > I recently discovered a bug where barbican-worker would hang on > shutdown if queue.asynchronous_workers was changed from 1 to 2: > >https://bugs.launchpad.net/barbican/+bug/1705543 > > resulting in a warning like this: > >WARNING oslo_messaging.server [-] Possible hang: stop is waiting for > start to complete > > I found a similar bug in Sahara: > >https://bugs.launchpad.net/sahara/+bug/1546119 > > where the fix was to call start() on the RPC service before making the > launcher wait() on it, so I ported the fix to Barbican, and it seems > to work fine: > >https://review.openstack.org/#/c/485755 > > I noticed that both projects use ProcessLauncher; barbican uses > oslo_service.service.launch() which has: > >if workers is None or workers == 1: >launcher = ServiceLauncher(conf, restart_method=restart_method) >else: >launcher = ProcessLauncher(conf, restart_method=restart_method) > > However, I'm not an expert in oslo.service or oslo.messaging, and one > of Barbican's core reviewers (thanks Kaitlin!) noted that not many > other projects start the task before calling wait() on the launcher, > so I thought I'd check here whether that is the correct fix, or > whether there's something else odd going on. > > Any oslo gurus able to shed light on this? > > Thanks! > Adam > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > As far as an oslo.messaging server is concerned, the order of operations is: server.start() # do stuff until ready to stop the server... server.stop() server.wait() The final wait blocks until all requests that are in progress when stop() is called finish and cleanup. -K -- Ken Giusti (kgiu...@gmail.com) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait?
Hi all, I recently discovered a bug where barbican-worker would hang on shutdown if queue.asynchronous_workers was changed from 1 to 2: https://bugs.launchpad.net/barbican/+bug/1705543 resulting in a warning like this: WARNING oslo_messaging.server [-] Possible hang: stop is waiting for start to complete I found a similar bug in Sahara: https://bugs.launchpad.net/sahara/+bug/1546119 where the fix was to call start() on the RPC service before making the launcher wait() on it, so I ported the fix to Barbican, and it seems to work fine: https://review.openstack.org/#/c/485755 I noticed that both projects use ProcessLauncher; barbican uses oslo_service.service.launch() which has: if workers is None or workers == 1: launcher = ServiceLauncher(conf, restart_method=restart_method) else: launcher = ProcessLauncher(conf, restart_method=restart_method) However, I'm not an expert in oslo.service or oslo.messaging, and one of Barbican's core reviewers (thanks Kaitlin!) noted that not many other projects start the task before calling wait() on the launcher, so I thought I'd check here whether that is the correct fix, or whether there's something else odd going on. Any oslo gurus able to shed light on this? Thanks! Adam __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev