Reviewed:  https://review.openstack.org/596275
Committed: 
https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=d37c74d6382690a05ba3ca10edd7f2acb0fbbb2e
Submitter: Zuul
Branch:    master

commit d37c74d6382690a05ba3ca10edd7f2acb0fbbb2e
Author: Bogdan Dobrelya <bdobr...@redhat.com>
Date:   Fri Aug 24 12:15:51 2018 +0200

    Fix postrotate to notify holders of rotated logs
    
    Lsof +L1 locates unlinked and open files and does not work for
    logrotate, neither with copyteuncate not w/o that option.
    
    Instead, find *.X (X - number) files held and notify the processes
    owning those to make an apropriate actions and reopen new log files to
    stop writing to the rotated files.
    
    The actions to be taken by such processes are:
    
    * For httpd processes, use USR1 to gracefully reload
    * For neutron-server, restart the container as it cannot process
      HUP signal well (LP bug #1276694, LP bug #1780139).
    * For nova-compute, restart the container as it cannot process
      HUP signal well (LP bug #1276694, LP bug #1715374).
    * For other processes, use HUP to reload
    
    This also fixes the filter to match logfiles ending with *err,
    like rabbitmq startup errors log.
    
    Closes-Bug: #1780139
    Closes-Bug: #1785659
    Closes-Bug: #1715374
    
    Change-Id: I5110426aa26e5fce7ebb4d80d8a2082cbf80519c
    Signed-off-by: Bogdan Dobrelya <bdobr...@redhat.com>


** Changed in: tripleo
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1780139

Title:
  Sending SIGHUP to neutron-server process causes it to hang

Status in neutron:
  Triaged
Status in tripleo:
  Fix Released

Bug description:
  * High level description

  When sending SIGHUP to the neutron-server process in a neutron_api container, 
it looks like the main process locks up in a tight loop. strace output shows 
that it's waiting for a process that doesn't exist:
  wait4(0, 0x7ffe97e025a4, WNOHANG, NULL) = -1 ECHILD (No child processes)

  This is problematic, because logrotate uses SIGHUP in the
  containerized environment. It doesn't happen always: it might take one
  or two signals reasonably interspersed, to trigger this.

  * Pre-conditions:

  I'm using CentOS 7 + Queens RDO
  "rdo_version": "c9fd24040454913b4a325741094285676fb7e7bc_a0a28280"

  I first noticed the issue when the neutron_api docker would stop
  working on the control nodes, eventually it was traced back to the
  logrotate_crond container sending SIGHUP to all the processes owning
  log files in /var/log/containers. This doesn't happen every time, but
  it's pretty easy to trigger on my system.

  * Step-by-step reproduction steps:

  # Start with a clean container

  docker restart neutron_api

  # Identify the neutron-server PID: (613782 in this case) and send
  SIGHUP

  kill -HUP 613782

  # the relevant log files generally look clean the first time:

  2018-07-04 16:50:34.730 7 INFO oslo_service.service [-] Caught SIGHUP, 
stopping children
  2018-07-04 16:50:34.739 7 INFO neutron.common.config [-] Logging enabled!
  2018-07-04 16:50:34.740 7 INFO neutron.common.config [-] 
/usr/bin/neutron-server version 12.0.3.dev17
  2018-07-04 16:50:34.761 33 INFO neutron.wsgi [-] (33) wsgi exited, 
is_accepting=True
  2018-07-04 16:50:34.761 27 INFO neutron.wsgi [-] (27) wsgi exited, 
is_accepting=True
  2018-07-04 16:50:34.761 28 INFO neutron.wsgi [-] (28) wsgi exited, 
is_accepting=True
  2018-07-04 16:50:34.761 30 INFO neutron.wsgi [-] (30) wsgi exited, 
is_accepting=True
  2018-07-04 16:50:34.761 7 INFO oslo_service.service [-] Caught SIGHUP, 
stopping children
  2018-07-04 16:50:34.761 32 INFO neutron.wsgi [-] (32) wsgi exited, 
is_accepting=True
  2018-07-04 16:50:34.761 34 INFO neutron.wsgi [-] (34) wsgi exited, 
is_accepting=True
  2018-07-04 16:50:34.761 29 INFO neutron.wsgi [-] (29) wsgi exited, 
is_accepting=True
  2018-07-04 16:50:34.761 31 INFO neutron.wsgi [-] (31) wsgi exited, 
is_accepting=True
  2018-07-04 16:50:34.771 7 INFO neutron.common.config [-] Logging enabled!
  2018-07-04 16:50:34.771 7 INFO neutron.common.config [-] 
/usr/bin/neutron-server version 12.0.3.dev17
  2018-07-04 16:50:34.792 7 INFO neutron.common.config [-] Logging enabled!
  2018-07-04 16:50:34.792 7 INFO neutron.common.config [-] 
/usr/bin/neutron-server version 12.0.3.dev17
  2018-07-04 16:50:34.807 7 INFO oslo_service.service [-] Child 27 exited with 
status 0
  2018-07-04 16:50:34.807 7 WARNING oslo_service.service [-] pid 27 not in 
child list
  2018-07-04 16:50:35.761 7 INFO oslo_service.service [-] Child 28 exited with 
status 0
  2018-07-04 16:50:35.764 7 INFO oslo_service.service [-] Child 29 exited with 
status 0
  2018-07-04 16:50:35.767 7 INFO oslo_service.service [-] Child 30 exited with 
status 0
  2018-07-04 16:50:35.768 78 INFO neutron.wsgi [-] (78) wsgi starting up on 
http://10.0.105.101:9696
  2018-07-04 16:50:35.771 79 INFO neutron.wsgi [-] (79) wsgi starting up on 
http://10.0.105.101:9696
  2018-07-04 16:50:35.770 7 INFO oslo_service.service [-] Child 31 exited with 
status 0
  2018-07-04 16:50:35.773 7 INFO oslo_service.service [-] Child 32 exited with 
status 0
  2018-07-04 16:50:35.774 80 INFO neutron.wsgi [-] (80) wsgi starting up on 
http://10.0.105.101:9696
  2018-07-04 16:50:35.776 7 INFO oslo_service.service [-] Child 33 exited with 
status 0
  2018-07-04 16:50:35.777 81 INFO neutron.wsgi [-] (81) wsgi starting up on 
http://10.0.105.101:9696
  2018-07-04 16:50:35.780 82 INFO neutron.wsgi [-] (82) wsgi starting up on 
http://10.0.105.101:9696
  2018-07-04 16:50:35.779 7 INFO oslo_service.service [-] Child 34 exited with 
status 0
  2018-07-04 16:50:35.782 7 INFO oslo_service.service [-] Child 43 exited with 
status 0
  2018-07-04 16:50:35.783 83 INFO neutron.wsgi [-] (83) wsgi starting up on 
http://10.0.105.101:9696
  2018-07-04 16:50:35.783 7 WARNING oslo_service.service [-] pid 43 not in 
child list
  2018-07-04 16:50:35.786 84 INFO neutron.wsgi [-] (84) wsgi starting up on 
http://10.0.105.101:9696

  # But on the second SIGHUP, the following happened:

  2018-07-04 16:52:08.821 7 INFO oslo_service.service [-] Caught SIGHUP, 
stopping children
  2018-07-04 16:52:08.830 7 INFO neutron.common.config [-] Logging enabled!
  2018-07-04 16:52:08.831 7 INFO neutron.common.config [-] 
/usr/bin/neutron-server version 12.0.3.dev17
  2018-07-04 16:52:08.847 7 INFO oslo_service.service [-] Wait called after 
thread killed. Cleaning up.
  2018-07-04 16:52:08.847 79 INFO neutron.wsgi [-] (79) wsgi exited, 
is_accepting=True
  2018-07-04 16:52:08.847 82 INFO neutron.wsgi [-] (82) wsgi exited, 
is_accepting=True
  2018-07-04 16:52:08.847 78 INFO neutron.wsgi [-] (78) wsgi exited, 
is_accepting=True
  2018-07-04 16:52:08.847 84 INFO neutron.wsgi [-] (84) wsgi exited, 
is_accepting=True
  2018-07-04 16:52:08.847 81 INFO neutron.wsgi [-] (81) wsgi exited, 
is_accepting=True
  2018-07-04 16:52:08.847 80 INFO neutron.wsgi [-] (80) wsgi exited, 
is_accepting=True
  2018-07-04 16:52:08.847 83 INFO neutron.wsgi [-] (83) wsgi exited, 
is_accepting=True
  2018-07-04 16:52:08.848 7 INFO oslo_service.service [-] Waiting on 10 
children to exit
  2018-07-04 16:52:08.852 7 INFO oslo_service.service [-] Child 78 exited with 
status 0
  2018-07-04 16:52:08.853 7 WARNING oslo_service.service [-] pid 78 not in 
child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:08.853 7 INFO oslo_service.service [-] Child 79 exited with 
status 0
  2018-07-04 16:52:08.853 7 WARNING oslo_service.service [-] pid 79 not in 
child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:08.853 7 INFO oslo_service.service [-] Child 44 killed by 
signal 15
  2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 80 exited with 
status 0
  2018-07-04 16:52:08.854 7 WARNING oslo_service.service [-] pid 80 not in 
child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 81 exited with 
status 0
  2018-07-04 16:52:08.854 7 WARNING oslo_service.service [-] pid 81 not in 
child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 82 exited with 
status 0
  2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 82 not in 
child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:08.855 7 INFO oslo_service.service [-] Child 83 exited with 
status 0
  2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 83 not in 
child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:08.855 7 INFO oslo_service.service [-] Child 84 exited with 
status 0
  2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 84 not in 
child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:15.039 7 INFO oslo_service.service [-] Child 98 exited with 
status 0
  2018-07-04 16:52:30.025 7 INFO oslo_service.service [-] Child 97 exited with 
status 0
  2018-07-04 16:52:38.017 7 INFO oslo_service.service [-] Child 96 exited with 
status 0

  * Version

  I'm running Centos7 RDO TripleO (commit tag 
c9fd24040454913b4a325741094285676fb7e7bc_a0a28280), the OVS ML2 plugin and 3 
controllers in a HA setup. tripleo templates and deployment config
  available on request.

  This corresponds to OpenStack Queens.

  I have not fully investigated all other services, but they seem to be
  fine with SIGHUP. In particular nova-api-metadata seems to be OK and
  doing what's expected.

  * Perceived severity

  This blocks our private production stack from running: neutron api
  breaking has a big knock-on effect, unfortunately.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1780139/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to