[Yahoo-eng-team] [Bug 1800599] [NEW] Neutron API server: unexpected behavior with multiple long live clients

Le, Huifeng Tue, 30 Oct 2018 00:51:40 -0700

Public bug reported:

High level description:
The current openstack API server uses eventlet.wsgi.server implementation. The 
default behavior of eventlet.wsgi.server will do an accept() call before 
knowing whether a greenthread is available in the pool to service that socket. 
If all socket connections are shortlived then this is not an issue as a 
greenthread will eventually become available and the request will be serviced 
(hopefully before the client times out waiting).


But in some scenarios in real system, such as during large system
deployment stage, there are many compute nodes which caused many long-
lived connections from nova-compute to the neutron API, this will cause
issue/unexpected behavior as below:

1. for single neutron server case:
if neutron server has all of its greenthreads tied up on open sockets, when one 
more connection request arrives, the server call accept() but will never 
distribute it to a working thread to process it and the client will timeout 
with long time waiting (e.g. CONF.client_socket_timeout)

Expect behavior: return quick TCP connect timeout if no processing
thread available

2. for multiple neutron server cases (e.g. cfg.CONF.api_workers>1 or 
cpu_count>1):
in this case, there are multiple neutron server child processes waiting for 
client requests (e.g. doing accept() on the same socket), if one neutron 
server's accept() is invoked by linux kernel to accept a client request but all 
of its greenthreads had tied up on open sockets then the client will timeout 
with long time waiting. But actually, at this time, other neutron child 
processes may still have available greenthreads to process this request but 
there is no opportunity for them to process it (as accepted by the first 
neutron server child process).

Expect behavior: the request can be processed if any of neutron server
process has available greenthread or return quick TCP connect timeout if
no processing thread available

Version: latest devstack

Potential solution: implement a custom pool for wsgi.server which will
block the spawn_n call (e.g. by sem.acquire()) to avoid calling accept()
until green working thread available.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1800599

Title:
  Neutron API server: unexpected behavior with multiple long live
  clients

Status in neutron:
  New

Bug description:
  High level description:
  The current openstack API server uses eventlet.wsgi.server implementation. 
The default behavior of eventlet.wsgi.server will do an accept() call before 
knowing whether a greenthread is available in the pool to service that socket. 
If all socket connections are shortlived then this is not an issue as a 
greenthread will eventually become available and the request will be serviced 
(hopefully before the client times out waiting).

  But in some scenarios in real system, such as during large system
  deployment stage, there are many compute nodes which caused many long-
  lived connections from nova-compute to the neutron API, this will
  cause issue/unexpected behavior as below:

  1. for single neutron server case:
  if neutron server has all of its greenthreads tied up on open sockets, when 
one more connection request arrives, the server call accept() but will never 
distribute it to a working thread to process it and the client will timeout 
with long time waiting (e.g. CONF.client_socket_timeout)

  Expect behavior: return quick TCP connect timeout if no processing
  thread available

  2. for multiple neutron server cases (e.g. cfg.CONF.api_workers>1 or 
cpu_count>1):
  in this case, there are multiple neutron server child processes waiting for 
client requests (e.g. doing accept() on the same socket), if one neutron 
server's accept() is invoked by linux kernel to accept a client request but all 
of its greenthreads had tied up on open sockets then the client will timeout 
with long time waiting. But actually, at this time, other neutron child 
processes may still have available greenthreads to process this request but 
there is no opportunity for them to process it (as accepted by the first 
neutron server child process).

  Expect behavior: the request can be processed if any of neutron server
  process has available greenthread or return quick TCP connect timeout
  if no processing thread available

  Version: latest devstack

  Potential solution: implement a custom pool for wsgi.server which will
  block the spawn_n call (e.g. by sem.acquire()) to avoid calling
  accept() until green working thread available.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1800599/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1800599] [NEW] Neutron API server: unexpected behavior with multiple long live clients

Reply via email to