Re: [openstack-dev] dhcp 'Address already in use' errors when trying to start a dnsmasq
Kevin Bentonwrote: There is no side effect other than log noise and a delayed reload? I don't see why a revert would be appropriate. I looked at the logs and the issue seems to be that the process isn't tracked correctly the first time it starts. grep for the following: ea141299-ce07-4ff7-9a03-7a1b7a75a371', 'dnsmasq' in http://logs.openstack.org/26/377626/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz The first time dnsmasq is called it gives a 0 return code but the agent doesn't seem to get a pid for it. So the next time it is called it conflicts with the running proc. Id you mean those log messages: 2016-09-27 12:21:24.760 13751 DEBUG neutron.agent.linux.utils [req-128c3e79-151a-4f57-9dbc-053ff0999679 - -] Unable to access /opt/stack/data/neutron/external/pids/ea141299-ce07-4ff7-9a03-7a1b7a75a371.pid get_value_from_file /opt/stack/new/neutron/neutron/agent/linux/utils.py:204 2016-09-27 12:21:24.760 13751 DEBUG neutron.agent.linux.utils [req-128c3e79-151a-4f57-9dbc-053ff0999679 - -] Unable to access /opt/stack/data/neutron/external/pids/ea141299-ce07-4ff7-9a03-7a1b7a75a371.pid get_value_from_file /opt/stack/new/neutron/neutron/agent/linux/utils.py:204 2016-09-27 12:21:24.761 13751 DEBUG neutron.agent.linux.external_process [req-128c3e79-151a-4f57-9dbc-053ff0999679 - -] No process started for ea141299-ce07-4ff7-9a03-7a1b7a75a371 disable /opt/stack/new/neutron/neutron/agent/linux/external_process.py:123 then I don’t think that’s correct interpretation of the log messages. Notice that the pid file names there are not in dnsmasq network dir, but in external/.pid. Those pid files are not dnsmasq ones but potentially belong to metadata proxies managed by the agent. The agent attempts to disable proxy because it’s not needed (as per logic in configure_dhcp_for_network). Since the network does not have a proxy process running, it can’t find the pid file and hence cannot disable the proxy process. Then it completes configuration process. It should not influence the flow of the program. To prove that dnsmasq is properly tracked, also see that later when we restart the process for the network, we correctly extract PID from the file and use it for kill -9 call: http://logs.openstack.org/26/377626/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz#_2016-09-27_12_21_24_878 You can check for yourself that the same PID was actually used by the dnsmasq process started the first time. It’s logged in syslog. Ihar __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] dhcp 'Address already in use' errors when trying to start a dnsmasq
There is no side effect other than log noise and a delayed reload? I don't see why a revert would be appropriate. I looked at the logs and the issue seems to be that the process isn't tracked correctly the first time it starts. grep for the following: ea141299-ce07-4ff7-9a03-7a1b7a75a371', 'dnsmasq' in http://logs.openstack.org/26/377626/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz The first time dnsmasq is called it gives a 0 return code but the agent doesn't seem to get a pid for it. So the next time it is called it conflicts with the running proc. On Sep 27, 2016 11:22, "Ihar Hrachyshka"wrote: > Hi all, > > so we started getting ‘Address already in use’ when trying to start > dnsmasq after the previous instance of the process is killed with kill -9. > Armando spotted it today in logs for: https://review.openstack.org/# > /c/377626/ but as per logstash it seems like an error we saw before (the > earliest I see is 9/20), f.e.: > > http://logs.openstack.org/26/377626/1/check/gate-tempest-dsv > m-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz > > Assuming I understand the flow of the failure, it runs as follows: > > - sync_state starts dnsmasq per network; > - after agent lock is freed, some other notification event > (port_update/subnet_update/...) triggers restart for one of the processes; > - the restart is done not via reload_allocations (-SIGHUP) but thru > restart/disable (kill -9); > - once the old dnsmasq is killed with -9, we attempt to start a new > process with new config files generated and fail with: “dnsmasq: failed to > create listening socket for 10.1.15.242: Address already in use” > - surprisingly, after several failing attempts to start the process, it > succeeds to start it after a bunch of seconds and runs fine. > > It looks like once we kill the process with -9, it may hold for the socket > resource for some time and may clash with the new process we try to spawn. > It’s a bit weird because dnsmasq should have set REUSEADDR for the socket, > so a new process should have started just fine. > > Lately, we landed several patches that touched reload logic for DHCP agent > on notifications. Among those suspicious in the context are: > > - https://review.openstack.org/#/c/372595/ - note it requests ‘disable’ > (-9) where it was using ‘reload_allocations’ (-SIGHUP) before, and it also > does not unplug the port on lease release (maybe after we rip of the > device, the address clash with the old dnsmasq state is gone even though > the ’new’ port will use the same address?). > - https://review.openstack.org/#/c/372236/6 - we were requesting > reload_allocations in some cases before, and now we put the network into > resync queue > > There were other related changes lately, you can check history of Kevin’s > changes for the branch, it should capture most of them. > > I wonder whether we hit some long standing restart issue with dnsmasq here > that was just never triggered before because we were not calling kill -9 so > eagerly as we do now. > > Note: Jakub Libosvar validated that 'kill -9 && dnsmasq’ in loop does NOT > result in the failure we see in gate logs. > > We need to understand what’s going with the failure, and come up with some > plan for Newton. We either revert suspected patches as I believe Armando > proposed before, but then it’s not clear until which point to do it; or we > come up with some smart fix for that, that I don’t immediately grasp. > > I will be on vacation tomorrow, though I will check the email thread to > see if we have a plan to act on. I really hope folks give the issue a > priority since it seems like we buried ourselves under a pile of > interleaved patches and now we don’t have a clear view of how to get out of > the pile. > > Cheers, > Ihar > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] dhcp 'Address already in use' errors when trying to start a dnsmasq
Ack, and thanks for the summary Ihar, I will have a look on it tomorrow morning, please update this thread with any progress. On Tue, Sep 27, 2016 at 8:22 PM, Ihar Hrachyshkawrote: > Hi all, > > so we started getting ‘Address already in use’ when trying to start dnsmasq > after the previous instance of the process is killed with kill -9. Armando > spotted it today in logs for: https://review.openstack.org/#/c/377626/ but > as per logstash it seems like an error we saw before (the earliest I see is > 9/20), f.e.: > > http://logs.openstack.org/26/377626/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz > > Assuming I understand the flow of the failure, it runs as follows: > > - sync_state starts dnsmasq per network; > - after agent lock is freed, some other notification event > (port_update/subnet_update/...) triggers restart for one of the processes; > - the restart is done not via reload_allocations (-SIGHUP) but thru > restart/disable (kill -9); > - once the old dnsmasq is killed with -9, we attempt to start a new process > with new config files generated and fail with: “dnsmasq: failed to create > listening socket for 10.1.15.242: Address already in use” > - surprisingly, after several failing attempts to start the process, it > succeeds to start it after a bunch of seconds and runs fine. > > It looks like once we kill the process with -9, it may hold for the socket > resource for some time and may clash with the new process we try to spawn. > It’s a bit weird because dnsmasq should have set REUSEADDR for the socket, > so a new process should have started just fine. > > Lately, we landed several patches that touched reload logic for DHCP agent > on notifications. Among those suspicious in the context are: > > - https://review.openstack.org/#/c/372595/ - note it requests ‘disable’ (-9) > where it was using ‘reload_allocations’ (-SIGHUP) before, and it also does > not unplug the port on lease release (maybe after we rip of the device, the > address clash with the old dnsmasq state is gone even though the ’new’ port > will use the same address?). > - https://review.openstack.org/#/c/372236/6 - we were requesting > reload_allocations in some cases before, and now we put the network into > resync queue > > There were other related changes lately, you can check history of Kevin’s > changes for the branch, it should capture most of them. > > I wonder whether we hit some long standing restart issue with dnsmasq here > that was just never triggered before because we were not calling kill -9 so > eagerly as we do now. > > Note: Jakub Libosvar validated that 'kill -9 && dnsmasq’ in loop does NOT > result in the failure we see in gate logs. > > We need to understand what’s going with the failure, and come up with some > plan for Newton. We either revert suspected patches as I believe Armando > proposed before, but then it’s not clear until which point to do it; or we > come up with some smart fix for that, that I don’t immediately grasp. > > I will be on vacation tomorrow, though I will check the email thread to see > if we have a plan to act on. I really hope folks give the issue a priority > since it seems like we buried ourselves under a pile of interleaved patches > and now we don’t have a clear view of how to get out of the pile. > > Cheers, > Ihar > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] dhcp 'Address already in use' errors when trying to start a dnsmasq
On 27 September 2016 at 11:29, Miguel Angel Ajo Pelayowrote: > Ack, and thanks for the summary Ihar, > > I will have a look on it tomorrow morning, please update this thread > with any progress. > > > > On Tue, Sep 27, 2016 at 8:22 PM, Ihar Hrachyshka > wrote: > > Hi all, > > > > so we started getting ‘Address already in use’ when trying to start > dnsmasq > > after the previous instance of the process is killed with kill -9. > Armando > > spotted it today in logs for: https://review.openstack.org/#/c/377626/ > but > > as per logstash it seems like an error we saw before (the earliest I see > is > > 9/20), f.e.: > > > > http://logs.openstack.org/26/377626/1/check/gate-tempest-dsv > m-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz > > > > Assuming I understand the flow of the failure, it runs as follows: > > > > - sync_state starts dnsmasq per network; > > - after agent lock is freed, some other notification event > > (port_update/subnet_update/...) triggers restart for one of the > processes; > > - the restart is done not via reload_allocations (-SIGHUP) but thru > > restart/disable (kill -9); > > - once the old dnsmasq is killed with -9, we attempt to start a new > process > > with new config files generated and fail with: “dnsmasq: failed to create > > listening socket for 10.1.15.242: Address already in use” > > - surprisingly, after several failing attempts to start the process, it > > succeeds to start it after a bunch of seconds and runs fine. > > > > It looks like once we kill the process with -9, it may hold for the > socket > > resource for some time and may clash with the new process we try to > spawn. > > It’s a bit weird because dnsmasq should have set REUSEADDR for the > socket, > > so a new process should have started just fine. > > > > Lately, we landed several patches that touched reload logic for DHCP > agent > > on notifications. Among those suspicious in the context are: > > > > - https://review.openstack.org/#/c/372595/ - note it requests ‘disable’ > (-9) > > where it was using ‘reload_allocations’ (-SIGHUP) before, and it also > does > > not unplug the port on lease release (maybe after we rip of the device, > the > > address clash with the old dnsmasq state is gone even though the ’new’ > port > > will use the same address?). > > - https://review.openstack.org/#/c/372236/6 - we were requesting > > reload_allocations in some cases before, and now we put the network into > > resync queue > > > > There were other related changes lately, you can check history of Kevin’s > > changes for the branch, it should capture most of them. > > > > I wonder whether we hit some long standing restart issue with dnsmasq > here > > that was just never triggered before because we were not calling kill -9 > so > > eagerly as we do now. > > > > Note: Jakub Libosvar validated that 'kill -9 && dnsmasq’ in loop does NOT > > result in the failure we see in gate logs. > > > > We need to understand what’s going with the failure, and come up with > some > > plan for Newton. We either revert suspected patches as I believe Armando > > proposed before, but then it’s not clear until which point to do it; or > we > > come up with some smart fix for that, that I don’t immediately grasp. > > > > I will be on vacation tomorrow, though I will check the email thread to > see > > if we have a plan to act on. I really hope folks give the issue a > priority > > since it seems like we buried ourselves under a pile of interleaved > patches > > and now we don’t have a clear view of how to get out of the pile. > Personally I feel there is no time left for us to do anything about this in RC2. Nothing at this point is going to guarantee that another patch is not gonna lead us to new potential ripple effects. I am personally okay to cut RC2 as it stands, and let downstream players have some time vetting the build and give us a chance to fix one more last minute "disaster". Rest assured we'll learn from this mistake. A. > > > Cheers, > > Ihar > > > > > __ > > OpenStack Development Mailing List (not for usage questions) > > Unsubscribe: openstack-dev-requ...@lists.op > enstack.org?subject:unsubscribe > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] dhcp 'Address already in use' errors when trying to start a dnsmasq
Hi all, so we started getting ‘Address already in use’ when trying to start dnsmasq after the previous instance of the process is killed with kill -9. Armando spotted it today in logs for: https://review.openstack.org/#/c/377626/ but as per logstash it seems like an error we saw before (the earliest I see is 9/20), f.e.: http://logs.openstack.org/26/377626/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz Assuming I understand the flow of the failure, it runs as follows: - sync_state starts dnsmasq per network; - after agent lock is freed, some other notification event (port_update/subnet_update/...) triggers restart for one of the processes; - the restart is done not via reload_allocations (-SIGHUP) but thru restart/disable (kill -9); - once the old dnsmasq is killed with -9, we attempt to start a new process with new config files generated and fail with: “dnsmasq: failed to create listening socket for 10.1.15.242: Address already in use” - surprisingly, after several failing attempts to start the process, it succeeds to start it after a bunch of seconds and runs fine. It looks like once we kill the process with -9, it may hold for the socket resource for some time and may clash with the new process we try to spawn. It’s a bit weird because dnsmasq should have set REUSEADDR for the socket, so a new process should have started just fine. Lately, we landed several patches that touched reload logic for DHCP agent on notifications. Among those suspicious in the context are: - https://review.openstack.org/#/c/372595/ - note it requests ‘disable’ (-9) where it was using ‘reload_allocations’ (-SIGHUP) before, and it also does not unplug the port on lease release (maybe after we rip of the device, the address clash with the old dnsmasq state is gone even though the ’new’ port will use the same address?). - https://review.openstack.org/#/c/372236/6 - we were requesting reload_allocations in some cases before, and now we put the network into resync queue There were other related changes lately, you can check history of Kevin’s changes for the branch, it should capture most of them. I wonder whether we hit some long standing restart issue with dnsmasq here that was just never triggered before because we were not calling kill -9 so eagerly as we do now. Note: Jakub Libosvar validated that 'kill -9 && dnsmasq’ in loop does NOT result in the failure we see in gate logs. We need to understand what’s going with the failure, and come up with some plan for Newton. We either revert suspected patches as I believe Armando proposed before, but then it’s not clear until which point to do it; or we come up with some smart fix for that, that I don’t immediately grasp. I will be on vacation tomorrow, though I will check the email thread to see if we have a plan to act on. I really hope folks give the issue a priority since it seems like we buried ourselves under a pile of interleaved patches and now we don’t have a clear view of how to get out of the pile. Cheers, Ihar __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev