Re: [openstack-dev] [oslo] debugging the failures in oslo.messaging gate

2015-08-17 Thread Doug Hellmann
Excerpts from Davanum Srinivas (dims)'s message of 2015-08-16 17:40:16 -0400:
 Doug,
 
 I've filed https://review.openstack.org/213542 to log error messages. Will
 work with oslo.messaging folks the next few days.

Thanks, Dims!

 
 Thanks,
 Dims
 
 On Fri, Aug 14, 2015 at 6:58 PM, Doug Hellmann d...@doughellmann.com
 wrote:
 
  All patches to oslo.messaging are currently failing the
  gate-tempest-dsvm-neutron-src-oslo.messaging job because the neutron
  service dies. amuller, kevinbenton, and I spent a bunch of time looking at
  it today, and I think we have an issue introduced by some asymmetric gating
  between the two projects.
 
  Neutron has 2 different modes for starting the RPC service, depending on
  the number of workers requested. The problem comes up with rpc_workers=0,
  which is the new default. In that mode, rather than using the
  ProcessLauncher, the RPC server is started directly in the current process.
  That results in wait() being called in a way that violates the new
  constraints being enforced within oslo.messaging after [1] landed. That
  patch is unreleased, so the only project seeing the problem is
  oslo.messaging. I’ve proposed a revert in [2], which passes the gate tests.
 
  I have also added [3] to neutron to see if we can get the gate job to show
  the same error messages I was seeing locally (part of the trouble we’ve had
  with debugging this is the process exits quickly enough that some of the
  log messages are never being written). I’m using [4] as a patch in
  oslo.messaging that was failing before to trigger the job to get the
  necessary log. That patch should *not* be landed, since I don’t think the
  change it reverts is related to the problem, it was just handy for
  debugging.
 
  The error message I see locally, “start/stop/wait must be called in the
  same thread”, is visible in this log snippet [5].
 
  It’s not clear what the best path forward is. Obviously neutron is doing
  something with the RPC server that oslo.messaging doesn’t expect/want/like,
  but also obviously we can’t release oslo.messaging in its current state and
  break neutron. Someone with a better understanding of both neutron and
  oslo.messaging may be able to fix neutron’s use of the RPC code to avoid
  this case. There may be other users of oslo.messaging with the same
  ‘broken’ pattern, but IIRC neutron is unique in the way it runs both RPC
  and API services in the same process. To be safe, though, it may be better
  to log error messages instead of doing whatever we’re doing now to cause
  the process to exit. We can then set up a log stash search for the error
  message and find other applications that would be broken, fix them, and
  then switch oslo.messaging back to throwing an exception.
 
  I’m going to be at the Ops summit next week, so I need to hand off
  debugging and fixing the issue to someone else on the Oslo team. We created
  an etherpad to track progress and make notes today, and all of these links
  are referenced there, too [6].
 
  Thanks again to amuller and kevinbenton for the time they spent helping
  with debugging today!
 
  Doug
 
  [1] https://review.openstack.org/#/c/209043/
  [2] https://review.openstack.org/#/c/213299/
  [3] https://review.openstack.org/#/c/213360/
  [4] https://review.openstack.org/#/c/213297/
  [6] http://paste.openstack.org/show/415030/
  [6] https://etherpad.openstack.org/p/wm2D6UGZbf
 
 
 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo] debugging the failures in oslo.messaging gate

2015-08-16 Thread Davanum Srinivas
Doug,

I've filed https://review.openstack.org/213542 to log error messages. Will
work with oslo.messaging folks the next few days.

Thanks,
Dims

On Fri, Aug 14, 2015 at 6:58 PM, Doug Hellmann d...@doughellmann.com
wrote:

 All patches to oslo.messaging are currently failing the
 gate-tempest-dsvm-neutron-src-oslo.messaging job because the neutron
 service dies. amuller, kevinbenton, and I spent a bunch of time looking at
 it today, and I think we have an issue introduced by some asymmetric gating
 between the two projects.

 Neutron has 2 different modes for starting the RPC service, depending on
 the number of workers requested. The problem comes up with rpc_workers=0,
 which is the new default. In that mode, rather than using the
 ProcessLauncher, the RPC server is started directly in the current process.
 That results in wait() being called in a way that violates the new
 constraints being enforced within oslo.messaging after [1] landed. That
 patch is unreleased, so the only project seeing the problem is
 oslo.messaging. I’ve proposed a revert in [2], which passes the gate tests.

 I have also added [3] to neutron to see if we can get the gate job to show
 the same error messages I was seeing locally (part of the trouble we’ve had
 with debugging this is the process exits quickly enough that some of the
 log messages are never being written). I’m using [4] as a patch in
 oslo.messaging that was failing before to trigger the job to get the
 necessary log. That patch should *not* be landed, since I don’t think the
 change it reverts is related to the problem, it was just handy for
 debugging.

 The error message I see locally, “start/stop/wait must be called in the
 same thread”, is visible in this log snippet [5].

 It’s not clear what the best path forward is. Obviously neutron is doing
 something with the RPC server that oslo.messaging doesn’t expect/want/like,
 but also obviously we can’t release oslo.messaging in its current state and
 break neutron. Someone with a better understanding of both neutron and
 oslo.messaging may be able to fix neutron’s use of the RPC code to avoid
 this case. There may be other users of oslo.messaging with the same
 ‘broken’ pattern, but IIRC neutron is unique in the way it runs both RPC
 and API services in the same process. To be safe, though, it may be better
 to log error messages instead of doing whatever we’re doing now to cause
 the process to exit. We can then set up a log stash search for the error
 message and find other applications that would be broken, fix them, and
 then switch oslo.messaging back to throwing an exception.

 I’m going to be at the Ops summit next week, so I need to hand off
 debugging and fixing the issue to someone else on the Oslo team. We created
 an etherpad to track progress and make notes today, and all of these links
 are referenced there, too [6].

 Thanks again to amuller and kevinbenton for the time they spent helping
 with debugging today!

 Doug

 [1] https://review.openstack.org/#/c/209043/
 [2] https://review.openstack.org/#/c/213299/
 [3] https://review.openstack.org/#/c/213360/
 [4] https://review.openstack.org/#/c/213297/
 [6] http://paste.openstack.org/show/415030/
 [6] https://etherpad.openstack.org/p/wm2D6UGZbf




-- 
Davanum Srinivas :: https://twitter.com/dims
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [oslo] debugging the failures in oslo.messaging gate

2015-08-14 Thread Doug Hellmann
All patches to oslo.messaging are currently failing the 
gate-tempest-dsvm-neutron-src-oslo.messaging job because the neutron service 
dies. amuller, kevinbenton, and I spent a bunch of time looking at it today, 
and I think we have an issue introduced by some asymmetric gating between the 
two projects.

Neutron has 2 different modes for starting the RPC service, depending on the 
number of workers requested. The problem comes up with rpc_workers=0, which is 
the new default. In that mode, rather than using the ProcessLauncher, the RPC 
server is started directly in the current process. That results in wait() being 
called in a way that violates the new constraints being enforced within 
oslo.messaging after [1] landed. That patch is unreleased, so the only project 
seeing the problem is oslo.messaging. I’ve proposed a revert in [2], which 
passes the gate tests.

I have also added [3] to neutron to see if we can get the gate job to show the 
same error messages I was seeing locally (part of the trouble we’ve had with 
debugging this is the process exits quickly enough that some of the log 
messages are never being written). I’m using [4] as a patch in oslo.messaging 
that was failing before to trigger the job to get the necessary log. That patch 
should *not* be landed, since I don’t think the change it reverts is related to 
the problem, it was just handy for debugging.

The error message I see locally, “start/stop/wait must be called in the same 
thread”, is visible in this log snippet [5].

It’s not clear what the best path forward is. Obviously neutron is doing 
something with the RPC server that oslo.messaging doesn’t expect/want/like, but 
also obviously we can’t release oslo.messaging in its current state and break 
neutron. Someone with a better understanding of both neutron and oslo.messaging 
may be able to fix neutron’s use of the RPC code to avoid this case. There may 
be other users of oslo.messaging with the same ‘broken’ pattern, but IIRC 
neutron is unique in the way it runs both RPC and API services in the same 
process. To be safe, though, it may be better to log error messages instead of 
doing whatever we’re doing now to cause the process to exit. We can then set up 
a log stash search for the error message and find other applications that would 
be broken, fix them, and then switch oslo.messaging back to throwing an 
exception.

I’m going to be at the Ops summit next week, so I need to hand off debugging 
and fixing the issue to someone else on the Oslo team. We created an etherpad 
to track progress and make notes today, and all of these links are referenced 
there, too [6].

Thanks again to amuller and kevinbenton for the time they spent helping with 
debugging today!

Doug

[1] https://review.openstack.org/#/c/209043/
[2] https://review.openstack.org/#/c/213299/
[3] https://review.openstack.org/#/c/213360/
[4] https://review.openstack.org/#/c/213297/
[6] http://paste.openstack.org/show/415030/
[6] https://etherpad.openstack.org/p/wm2D6UGZbf


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev