Re: [openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait?

2017-09-18 Thread Ken Giusti
On Thu, Sep 14, 2017 at 7:33 PM, Adam Spiers  wrote:
>
> Hi Ken,
>
> Thanks a lot for the analysis, and sorry for the slow reply!
> Comments inline...
>
> Ken Giusti  wrote:
> > Hi Adam,
> >
> > I think there's a couple of problems here.
> >
> > Regardless of worker count, the service.wait() is called before
> > service.start().  And from looking at the oslo.service code, the 'wait()'
> > method is call after start(), then again after stop().  This doesn't match
> > up with the intended use of oslo.messaging.server.wait(), which should only
> > be called after .stop().
>
> Hmm, so are you saying that there might be a bug in oslo.service's
> usage of oslo.messaging, and that this Sahara bugfix was the wrong
> approach too?
>
> https://review.openstack.org/#/c/280741/1/sahara/cli/sahara_engine.py
>

Well, I don't think the explicit call to start() is going to help,
esp. if the number of workers is > 1 since the workers are forked and
need to call start() from their own process space..
In fact, if # of workers > 1 then you not only get an RPC server in
each worker process, you'll end up with an extra RPC
server in the calling thread.

Take a look at a test service I've created for oslo.messaging:

https://pastebin.com/rSA6AD82

If you change the main code to call the new sequence, you'll end up
with 3 rpc servers (2 in the workers, one in the main process).

In that code I've made the wait() call a no op if the server hasn't
been started first.   And the stop method will call stop and wait on
the rpc server, which is the expected sequence as far as
oslo.messaging is concerned.

To me it seems that the bug is in oslo.service - calling wait() before
start() doesn't make sense to me.

> > Perhaps a bigger issue is that in the multi threaded case all threads
> > appear to be calling start, wait, and stop on the same instance of the
> > service (oslo.messaging rpc server).  At least that's what I'm seeing in my
> > muchly reduced test code:

I was wrong about this - I failed to notice that each service had
forked and was dealing with its own copy of the server.

> >
> > https://paste.fedoraproject.org/paste/-73zskccaQvpSVwRJD11cA
> >
> > The log trace shows multiple calls to start, wait, stop via different
> > threads to the same TaskServer instance:
> >
> > https://paste.fedoraproject.org/paste/dyPq~lr26sQZtMzHn5w~Vg
> >
> > Is that expected?
>
> Unfortunately in the interim, your pastes seem to have vanished - any
> chance you could repaste them?
>

Ugh - didn't keep a copy.  If you pull down that test code you can use
it to generate those traces.


> Thanks,
> Adam
>
> > On Mon, Jul 31, 2017 at 9:32 PM, Adam Spiers  wrote:
> > > Ken Giusti  wrote:
> > >> On Mon, Jul 31, 2017 at 10:01 AM, Adam Spiers  wrote:
> > >>> I recently discovered a bug where barbican-worker would hang on
> > >>> shutdown if queue.asynchronous_workers was changed from 1 to 2:
> > >>>
> > >>>https://bugs.launchpad.net/barbican/+bug/1705543
> > >>>
> > >>> resulting in a warning like this:
> > >>>
> > >>>WARNING oslo_messaging.server [-] Possible hang: stop is waiting for
> > >>> start to complete
> > >>>
> > >>> I found a similar bug in Sahara:
> > >>>
> > >>>https://bugs.launchpad.net/sahara/+bug/1546119
> > >>>
> > >>> where the fix was to call start() on the RPC service before making the
> > >>> launcher wait() on it, so I ported the fix to Barbican, and it seems
> > >>> to work fine:
> > >>>
> > >>>https://review.openstack.org/#/c/485755
> > >>>
> > >>> I noticed that both projects use ProcessLauncher; barbican uses
> > >>> oslo_service.service.launch() which has:
> > >>>
> > >>>if workers is None or workers == 1:
> > >>>launcher = ServiceLauncher(conf, restart_method=restart_method)
> > >>>else:
> > >>>launcher = ProcessLauncher(conf, restart_method=restart_method)
> > >>>
> > >>> However, I'm not an expert in oslo.service or oslo.messaging, and one
> > >>> of Barbican's core reviewers (thanks Kaitlin!) noted that not many
> > >>> other projects start the task before calling wait() on the launcher,
> > >>> so I thought I'd check here whether that is the correct fix, or
> > >>> whether there's something else odd going on.
> > >>>
> > >>> Any oslo gurus able to shed light on this?
> > >>>
> > >>
> > >> As far as an oslo.messaging server is concerned, the order of operations
> > >> is:
> > >>
> > >> server.start()
> > >> # do stuff until ready to stop the server...
> > >> server.stop()
> > >> server.wait()
> > >>
> > >> The final wait blocks until all requests that are in progress when stop()
> > >> is called finish and cleanup.
> > >
> > > Thanks - that makes sense.  So the question is, why would
> > > barbican-worker only hang on shutdown when there are multiple workers?
> > > Maybe the real bug is somewhere in oslo_service.service.ProcessLauncher
> > > and it's not calling start() correctly?




-- 
Ken 

Re: [openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait?

2017-09-14 Thread Adam Spiers
Hi Ken,

Thanks a lot for the analysis, and sorry for the slow reply!
Comments inline...

Ken Giusti  wrote:
> Hi Adam,
> 
> I think there's a couple of problems here.
> 
> Regardless of worker count, the service.wait() is called before
> service.start().  And from looking at the oslo.service code, the 'wait()'
> method is call after start(), then again after stop().  This doesn't match
> up with the intended use of oslo.messaging.server.wait(), which should only
> be called after .stop().

Hmm, so are you saying that there might be a bug in oslo.service's
usage of oslo.messaging, and that this Sahara bugfix was the wrong
approach too?

https://review.openstack.org/#/c/280741/1/sahara/cli/sahara_engine.py

> Perhaps a bigger issue is that in the multi threaded case all threads
> appear to be calling start, wait, and stop on the same instance of the
> service (oslo.messaging rpc server).  At least that's what I'm seeing in my
> muchly reduced test code:
> 
> https://paste.fedoraproject.org/paste/-73zskccaQvpSVwRJD11cA
> 
> The log trace shows multiple calls to start, wait, stop via different
> threads to the same TaskServer instance:
> 
> https://paste.fedoraproject.org/paste/dyPq~lr26sQZtMzHn5w~Vg
> 
> Is that expected?

Unfortunately in the interim, your pastes seem to have vanished - any
chance you could repaste them?

Thanks,
Adam

> On Mon, Jul 31, 2017 at 9:32 PM, Adam Spiers  wrote:
> > Ken Giusti  wrote:
> >> On Mon, Jul 31, 2017 at 10:01 AM, Adam Spiers  wrote:
> >>> I recently discovered a bug where barbican-worker would hang on
> >>> shutdown if queue.asynchronous_workers was changed from 1 to 2:
> >>>
> >>>https://bugs.launchpad.net/barbican/+bug/1705543
> >>>
> >>> resulting in a warning like this:
> >>>
> >>>WARNING oslo_messaging.server [-] Possible hang: stop is waiting for
> >>> start to complete
> >>>
> >>> I found a similar bug in Sahara:
> >>>
> >>>https://bugs.launchpad.net/sahara/+bug/1546119
> >>>
> >>> where the fix was to call start() on the RPC service before making the
> >>> launcher wait() on it, so I ported the fix to Barbican, and it seems
> >>> to work fine:
> >>>
> >>>https://review.openstack.org/#/c/485755
> >>>
> >>> I noticed that both projects use ProcessLauncher; barbican uses
> >>> oslo_service.service.launch() which has:
> >>>
> >>>if workers is None or workers == 1:
> >>>launcher = ServiceLauncher(conf, restart_method=restart_method)
> >>>else:
> >>>launcher = ProcessLauncher(conf, restart_method=restart_method)
> >>>
> >>> However, I'm not an expert in oslo.service or oslo.messaging, and one
> >>> of Barbican's core reviewers (thanks Kaitlin!) noted that not many
> >>> other projects start the task before calling wait() on the launcher,
> >>> so I thought I'd check here whether that is the correct fix, or
> >>> whether there's something else odd going on.
> >>>
> >>> Any oslo gurus able to shed light on this?
> >>>
> >>
> >> As far as an oslo.messaging server is concerned, the order of operations
> >> is:
> >>
> >> server.start()
> >> # do stuff until ready to stop the server...
> >> server.stop()
> >> server.wait()
> >>
> >> The final wait blocks until all requests that are in progress when stop()
> >> is called finish and cleanup.
> >
> > Thanks - that makes sense.  So the question is, why would
> > barbican-worker only hang on shutdown when there are multiple workers?
> > Maybe the real bug is somewhere in oslo_service.service.ProcessLauncher
> > and it's not calling start() correctly?

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait?

2017-08-02 Thread Ken Giusti
Oop - didn't reply all
-- Forwarded message --
From: Ken Giusti <kgiu...@gmail.com>
Date: Tue, Aug 1, 2017 at 12:51 PM
Subject: Re: [openstack-dev] [oslo][barbican][sahara] start RPC service
before launcher wait?
To: Adam Spiers <aspi...@suse.com>


Hi Adam,

I think there's a couple of problems here.

Regardless of worker count, the service.wait() is called before
service.start().  And from looking at the oslo.service code, the 'wait()'
method is call after start(), then again after stop().  This doesn't match
up with the intended use of oslo.messaging.server.wait(), which should only
be called after .stop().

Perhaps a bigger issue is that in the multi threaded case all threads
appear to be calling start, wait, and stop on the same instance of the
service (oslo.messaging rpc server).  At least that's what I'm seeing in my
muchly reduced test code:

https://paste.fedoraproject.org/paste/-73zskccaQvpSVwRJD11cA

The log trace shows multiple calls to start, wait, stop via different
threads to the same TaskServer instance:

https://paste.fedoraproject.org/paste/dyPq~lr26sQZtMzHn5w~Vg

Is that expected?

On Mon, Jul 31, 2017 at 9:32 PM, Adam Spiers <aspi...@suse.com> wrote:

> Ken Giusti <kgiu...@gmail.com> wrote:
>
>> On Mon, Jul 31, 2017 at 10:01 AM, Adam Spiers <aspi...@suse.com> wrote:
>>
>>> I recently discovered a bug where barbican-worker would hang on
>>> shutdown if queue.asynchronous_workers was changed from 1 to 2:
>>>
>>>https://bugs.launchpad.net/barbican/+bug/1705543
>>>
>>> resulting in a warning like this:
>>>
>>>WARNING oslo_messaging.server [-] Possible hang: stop is waiting for
>>> start to complete
>>>
>>> I found a similar bug in Sahara:
>>>
>>>https://bugs.launchpad.net/sahara/+bug/1546119
>>>
>>> where the fix was to call start() on the RPC service before making the
>>> launcher wait() on it, so I ported the fix to Barbican, and it seems
>>> to work fine:
>>>
>>>https://review.openstack.org/#/c/485755
>>>
>>> I noticed that both projects use ProcessLauncher; barbican uses
>>> oslo_service.service.launch() which has:
>>>
>>>if workers is None or workers == 1:
>>>launcher = ServiceLauncher(conf, restart_method=restart_method)
>>>else:
>>>launcher = ProcessLauncher(conf, restart_method=restart_method)
>>>
>>> However, I'm not an expert in oslo.service or oslo.messaging, and one
>>> of Barbican's core reviewers (thanks Kaitlin!) noted that not many
>>> other projects start the task before calling wait() on the launcher,
>>> so I thought I'd check here whether that is the correct fix, or
>>> whether there's something else odd going on.
>>>
>>> Any oslo gurus able to shed light on this?
>>>
>>
>> As far as an oslo.messaging server is concerned, the order of operations
>> is:
>>
>> server.start()
>> # do stuff until ready to stop the server...
>> server.stop()
>> server.wait()
>>
>> The final wait blocks until all requests that are in progress when stop()
>> is called finish and cleanup.
>>
>
> Thanks - that makes sense.  So the question is, why would
> barbican-worker only hang on shutdown when there are multiple workers?
> Maybe the real bug is somewhere in oslo_service.service.ProcessLauncher
> and it's not calling start() correctly?
>



-- 
Ken Giusti  (kgiu...@gmail.com)



-- 
Ken Giusti  (kgiu...@gmail.com)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait?

2017-07-31 Thread Adam Spiers

Ken Giusti  wrote:

On Mon, Jul 31, 2017 at 10:01 AM, Adam Spiers  wrote:

I recently discovered a bug where barbican-worker would hang on
shutdown if queue.asynchronous_workers was changed from 1 to 2:

   https://bugs.launchpad.net/barbican/+bug/1705543

resulting in a warning like this:

   WARNING oslo_messaging.server [-] Possible hang: stop is waiting for
start to complete

I found a similar bug in Sahara:

   https://bugs.launchpad.net/sahara/+bug/1546119

where the fix was to call start() on the RPC service before making the
launcher wait() on it, so I ported the fix to Barbican, and it seems
to work fine:

   https://review.openstack.org/#/c/485755

I noticed that both projects use ProcessLauncher; barbican uses
oslo_service.service.launch() which has:

   if workers is None or workers == 1:
   launcher = ServiceLauncher(conf, restart_method=restart_method)
   else:
   launcher = ProcessLauncher(conf, restart_method=restart_method)

However, I'm not an expert in oslo.service or oslo.messaging, and one
of Barbican's core reviewers (thanks Kaitlin!) noted that not many
other projects start the task before calling wait() on the launcher,
so I thought I'd check here whether that is the correct fix, or
whether there's something else odd going on.

Any oslo gurus able to shed light on this?


As far as an oslo.messaging server is concerned, the order of operations is:

server.start()
# do stuff until ready to stop the server...
server.stop()
server.wait()

The final wait blocks until all requests that are in progress when stop()
is called finish and cleanup.


Thanks - that makes sense.  So the question is, why would
barbican-worker only hang on shutdown when there are multiple workers?
Maybe the real bug is somewhere in oslo_service.service.ProcessLauncher
and it's not calling start() correctly?

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait?

2017-07-31 Thread Ken Giusti
On Mon, Jul 31, 2017 at 10:01 AM, Adam Spiers  wrote:

> Hi all,
>
> I recently discovered a bug where barbican-worker would hang on
> shutdown if queue.asynchronous_workers was changed from 1 to 2:
>
>https://bugs.launchpad.net/barbican/+bug/1705543
>
> resulting in a warning like this:
>
>WARNING oslo_messaging.server [-] Possible hang: stop is waiting for
> start to complete
>
> I found a similar bug in Sahara:
>
>https://bugs.launchpad.net/sahara/+bug/1546119
>
> where the fix was to call start() on the RPC service before making the
> launcher wait() on it, so I ported the fix to Barbican, and it seems
> to work fine:
>
>https://review.openstack.org/#/c/485755
>
> I noticed that both projects use ProcessLauncher; barbican uses
> oslo_service.service.launch() which has:
>
>if workers is None or workers == 1:
>launcher = ServiceLauncher(conf, restart_method=restart_method)
>else:
>launcher = ProcessLauncher(conf, restart_method=restart_method)
>
> However, I'm not an expert in oslo.service or oslo.messaging, and one
> of Barbican's core reviewers (thanks Kaitlin!) noted that not many
> other projects start the task before calling wait() on the launcher,
> so I thought I'd check here whether that is the correct fix, or
> whether there's something else odd going on.
>
> Any oslo gurus able to shed light on this?
>
> Thanks!
> Adam
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>


As far as an oslo.messaging server is concerned, the order of operations is:

server.start()
# do stuff until ready to stop the server...
server.stop()
server.wait()

The final wait blocks until all requests that are in progress when stop()
is called finish and cleanup.

-K


-- 
Ken Giusti  (kgiu...@gmail.com)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait?

2017-07-31 Thread Adam Spiers

Hi all,

I recently discovered a bug where barbican-worker would hang on
shutdown if queue.asynchronous_workers was changed from 1 to 2:

   https://bugs.launchpad.net/barbican/+bug/1705543

resulting in a warning like this:

   WARNING oslo_messaging.server [-] Possible hang: stop is waiting for start 
to complete

I found a similar bug in Sahara:

   https://bugs.launchpad.net/sahara/+bug/1546119

where the fix was to call start() on the RPC service before making the
launcher wait() on it, so I ported the fix to Barbican, and it seems
to work fine:

   https://review.openstack.org/#/c/485755

I noticed that both projects use ProcessLauncher; barbican uses
oslo_service.service.launch() which has:

   if workers is None or workers == 1:
   launcher = ServiceLauncher(conf, restart_method=restart_method)
   else:
   launcher = ProcessLauncher(conf, restart_method=restart_method)

However, I'm not an expert in oslo.service or oslo.messaging, and one
of Barbican's core reviewers (thanks Kaitlin!) noted that not many
other projects start the task before calling wait() on the launcher,
so I thought I'd check here whether that is the correct fix, or
whether there's something else odd going on.

Any oslo gurus able to shed light on this?

Thanks!
Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev