Re: [openstack-dev] dhcp 'Address already in use' errors when trying to start a dnsmasq

2016-09-27 Thread Ihar Hrachyshka

Kevin Benton  wrote:

There is no side effect other than log noise and a delayed reload? I  
don't see why a revert would be appropriate.


I looked at the logs and the issue seems to be that the process isn't  
tracked correctly the first time it starts.


grep for the following:

ea141299-ce07-4ff7-9a03-7a1b7a75a371', 'dnsmasq'

in
http://logs.openstack.org/26/377626/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz

The first time dnsmasq is called it gives a 0 return code but the agent  
doesn't seem to get a pid for it. So the next time it is called it  
conflicts with the running proc.


Id you mean those log messages:

2016-09-27 12:21:24.760
 13751 DEBUG neutron.agent.linux.utils 
[req-128c3e79-151a-4f57-9dbc-053ff0999679 - -] Unable to access 
/opt/stack/data/neutron/external/pids/ea141299-ce07-4ff7-9a03-7a1b7a75a371.pid 
get_value_from_file /opt/stack/new/neutron/neutron/agent/linux/utils.py:204

2016-09-27 12:21:24.760
 13751 DEBUG neutron.agent.linux.utils 
[req-128c3e79-151a-4f57-9dbc-053ff0999679 - -] Unable to access 
/opt/stack/data/neutron/external/pids/ea141299-ce07-4ff7-9a03-7a1b7a75a371.pid 
get_value_from_file /opt/stack/new/neutron/neutron/agent/linux/utils.py:204

2016-09-27 12:21:24.761
 13751 DEBUG neutron.agent.linux.external_process 
[req-128c3e79-151a-4f57-9dbc-053ff0999679 - -] No process started for 
ea141299-ce07-4ff7-9a03-7a1b7a75a371 disable 
/opt/stack/new/neutron/neutron/agent/linux/external_process.py:123

then I don’t think that’s correct interpretation of the log messages.  
Notice that the pid file names there are not in dnsmasq network dir, but in  
external/.pid. Those pid files are not dnsmasq ones but potentially  
belong to metadata proxies managed by the agent. The agent attempts to  
disable proxy because it’s not needed (as per logic in  
configure_dhcp_for_network). Since the network does not have a proxy  
process running, it can’t find the pid file and hence cannot disable the  
proxy process. Then it completes configuration process.


It should not influence the flow of the program.

To prove that dnsmasq is properly tracked, also see that later when we  
restart the process for the network, we correctly extract PID from the file  
and use it for kill -9 call:


http://logs.openstack.org/26/377626/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz#_2016-09-27_12_21_24_878

You can check for yourself that the same PID was actually used by the  
dnsmasq process started the first time. It’s logged in syslog.


Ihar

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] dhcp 'Address already in use' errors when trying to start a dnsmasq

2016-09-27 Thread Kevin Benton
There is no side effect other than log noise and a delayed reload? I don't
see why a revert would be appropriate.

I looked at the logs and the issue seems to be that the process isn't
tracked correctly the first time it starts.

grep for the following:

ea141299-ce07-4ff7-9a03-7a1b7a75a371', 'dnsmasq'

in
http://logs.openstack.org/26/377626/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz

The first time dnsmasq is called it gives a 0 return code but the agent
doesn't seem to get a pid for it. So the next time it is called it
conflicts with the running proc.

On Sep 27, 2016 11:22, "Ihar Hrachyshka"  wrote:

> Hi all,
>
> so we started getting ‘Address already in use’ when trying to start
> dnsmasq after the previous instance of the process is killed with kill -9.
> Armando spotted it today in logs for: https://review.openstack.org/#
> /c/377626/ but as per logstash it seems like an error we saw before (the
> earliest I see is 9/20), f.e.:
>
> http://logs.openstack.org/26/377626/1/check/gate-tempest-dsv
> m-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz
>
> Assuming I understand the flow of the failure, it runs as follows:
>
> - sync_state starts dnsmasq per network;
> - after agent lock is freed, some other notification event
> (port_update/subnet_update/...) triggers restart for one of the processes;
> - the restart is done not via reload_allocations (-SIGHUP) but thru
> restart/disable (kill -9);
> - once the old dnsmasq is killed with -9, we attempt to start a new
> process with new config files generated and fail with: “dnsmasq: failed to
> create listening socket for 10.1.15.242: Address already in use”
> - surprisingly, after several failing attempts to start the process, it
> succeeds to start it after a bunch of seconds and runs fine.
>
> It looks like once we kill the process with -9, it may hold for the socket
> resource for some time and may clash with the new process we try to spawn.
> It’s a bit weird because dnsmasq should have set REUSEADDR for the socket,
> so a new process should have started just fine.
>
> Lately, we landed several patches that touched reload logic for DHCP agent
> on notifications. Among those suspicious in the context are:
>
> - https://review.openstack.org/#/c/372595/ - note it requests ‘disable’
> (-9) where it was using ‘reload_allocations’ (-SIGHUP) before, and it also
> does not unplug the port on lease release (maybe after we rip of the
> device, the address clash with the old dnsmasq state is gone even though
> the ’new’ port will use the same address?).
> - https://review.openstack.org/#/c/372236/6 - we were requesting
> reload_allocations in some cases before, and now we put the network into
> resync queue
>
> There were other related changes lately, you can check history of Kevin’s
> changes for the branch, it should capture most of them.
>
> I wonder whether we hit some long standing restart issue with dnsmasq here
> that was just never triggered before because we were not calling kill -9 so
> eagerly as we do now.
>
> Note: Jakub Libosvar validated that 'kill -9 && dnsmasq’ in loop does NOT
> result in the failure we see in gate logs.
>
> We need to understand what’s going with the failure, and come up with some
> plan for Newton. We either revert suspected patches as I believe Armando
> proposed before, but then it’s not clear until which point to do it; or we
> come up with some smart fix for that, that I don’t immediately grasp.
>
> I will be on vacation tomorrow, though I will check the email thread to
> see if we have a plan to act on. I really hope folks give the issue a
> priority since it seems like we buried ourselves under a pile of
> interleaved patches and now we don’t have a clear view of how to get out of
> the pile.
>
> Cheers,
> Ihar
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] dhcp 'Address already in use' errors when trying to start a dnsmasq

2016-09-27 Thread Miguel Angel Ajo Pelayo
Ack, and thanks for the summary Ihar,

I will have a look on it tomorrow morning, please update this thread
with any progress.



On Tue, Sep 27, 2016 at 8:22 PM, Ihar Hrachyshka  wrote:
> Hi all,
>
> so we started getting ‘Address already in use’ when trying to start dnsmasq
> after the previous instance of the process is killed with kill -9. Armando
> spotted it today in logs for: https://review.openstack.org/#/c/377626/ but
> as per logstash it seems like an error we saw before (the earliest I see is
> 9/20), f.e.:
>
> http://logs.openstack.org/26/377626/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz
>
> Assuming I understand the flow of the failure, it runs as follows:
>
> - sync_state starts dnsmasq per network;
> - after agent lock is freed, some other notification event
> (port_update/subnet_update/...) triggers restart for one of the processes;
> - the restart is done not via reload_allocations (-SIGHUP) but thru
> restart/disable (kill -9);
> - once the old dnsmasq is killed with -9, we attempt to start a new process
> with new config files generated and fail with: “dnsmasq: failed to create
> listening socket for 10.1.15.242: Address already in use”
> - surprisingly, after several failing attempts to start the process, it
> succeeds to start it after a bunch of seconds and runs fine.
>
> It looks like once we kill the process with -9, it may hold for the socket
> resource for some time and may clash with the new process we try to spawn.
> It’s a bit weird because dnsmasq should have set REUSEADDR for the socket,
> so a new process should have started just fine.
>
> Lately, we landed several patches that touched reload logic for DHCP agent
> on notifications. Among those suspicious in the context are:
>
> - https://review.openstack.org/#/c/372595/ - note it requests ‘disable’ (-9)
> where it was using ‘reload_allocations’ (-SIGHUP) before, and it also does
> not unplug the port on lease release (maybe after we rip of the device, the
> address clash with the old dnsmasq state is gone even though the ’new’ port
> will use the same address?).
> - https://review.openstack.org/#/c/372236/6 - we were requesting
> reload_allocations in some cases before, and now we put the network into
> resync queue
>
> There were other related changes lately, you can check history of Kevin’s
> changes for the branch, it should capture most of them.
>
> I wonder whether we hit some long standing restart issue with dnsmasq here
> that was just never triggered before because we were not calling kill -9 so
> eagerly as we do now.
>
> Note: Jakub Libosvar validated that 'kill -9 && dnsmasq’ in loop does NOT
> result in the failure we see in gate logs.
>
> We need to understand what’s going with the failure, and come up with some
> plan for Newton. We either revert suspected patches as I believe Armando
> proposed before, but then it’s not clear until which point to do it; or we
> come up with some smart fix for that, that I don’t immediately grasp.
>
> I will be on vacation tomorrow, though I will check the email thread to see
> if we have a plan to act on. I really hope folks give the issue a priority
> since it seems like we buried ourselves under a pile of interleaved patches
> and now we don’t have a clear view of how to get out of the pile.
>
> Cheers,
> Ihar
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] dhcp 'Address already in use' errors when trying to start a dnsmasq

2016-09-27 Thread Armando M.
On 27 September 2016 at 11:29, Miguel Angel Ajo Pelayo 
wrote:

> Ack, and thanks for the summary Ihar,
>
> I will have a look on it tomorrow morning, please update this thread
> with any progress.
>
>
>
> On Tue, Sep 27, 2016 at 8:22 PM, Ihar Hrachyshka 
> wrote:
> > Hi all,
> >
> > so we started getting ‘Address already in use’ when trying to start
> dnsmasq
> > after the previous instance of the process is killed with kill -9.
> Armando
> > spotted it today in logs for: https://review.openstack.org/#/c/377626/
> but
> > as per logstash it seems like an error we saw before (the earliest I see
> is
> > 9/20), f.e.:
> >
> > http://logs.openstack.org/26/377626/1/check/gate-tempest-dsv
> m-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz
> >
> > Assuming I understand the flow of the failure, it runs as follows:
> >
> > - sync_state starts dnsmasq per network;
> > - after agent lock is freed, some other notification event
> > (port_update/subnet_update/...) triggers restart for one of the
> processes;
> > - the restart is done not via reload_allocations (-SIGHUP) but thru
> > restart/disable (kill -9);
> > - once the old dnsmasq is killed with -9, we attempt to start a new
> process
> > with new config files generated and fail with: “dnsmasq: failed to create
> > listening socket for 10.1.15.242: Address already in use”
> > - surprisingly, after several failing attempts to start the process, it
> > succeeds to start it after a bunch of seconds and runs fine.
> >
> > It looks like once we kill the process with -9, it may hold for the
> socket
> > resource for some time and may clash with the new process we try to
> spawn.
> > It’s a bit weird because dnsmasq should have set REUSEADDR for the
> socket,
> > so a new process should have started just fine.
> >
> > Lately, we landed several patches that touched reload logic for DHCP
> agent
> > on notifications. Among those suspicious in the context are:
> >
> > - https://review.openstack.org/#/c/372595/ - note it requests ‘disable’
> (-9)
> > where it was using ‘reload_allocations’ (-SIGHUP) before, and it also
> does
> > not unplug the port on lease release (maybe after we rip of the device,
> the
> > address clash with the old dnsmasq state is gone even though the ’new’
> port
> > will use the same address?).
> > - https://review.openstack.org/#/c/372236/6 - we were requesting
> > reload_allocations in some cases before, and now we put the network into
> > resync queue
> >
> > There were other related changes lately, you can check history of Kevin’s
> > changes for the branch, it should capture most of them.
> >
> > I wonder whether we hit some long standing restart issue with dnsmasq
> here
> > that was just never triggered before because we were not calling kill -9
> so
> > eagerly as we do now.
> >
> > Note: Jakub Libosvar validated that 'kill -9 && dnsmasq’ in loop does NOT
> > result in the failure we see in gate logs.
> >
> > We need to understand what’s going with the failure, and come up with
> some
> > plan for Newton. We either revert suspected patches as I believe Armando
> > proposed before, but then it’s not clear until which point to do it; or
> we
> > come up with some smart fix for that, that I don’t immediately grasp.
> >
> > I will be on vacation tomorrow, though I will check the email thread to
> see
> > if we have a plan to act on. I really hope folks give the issue a
> priority
> > since it seems like we buried ourselves under a pile of interleaved
> patches
> > and now we don’t have a clear view of how to get out of the pile.
>

Personally I feel there is no time left for us to do anything about this in
RC2. Nothing at this point is going to guarantee that another patch is not
gonna lead us to new potential ripple effects. I am personally okay to cut
RC2 as it stands, and let downstream players have some time vetting the
build and give us a chance to fix one more last minute "disaster".

Rest assured we'll learn from this mistake.

A.

>
> > Cheers,
> > Ihar
> >
> > 
> __
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: openstack-dev-requ...@lists.op
> enstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] dhcp 'Address already in use' errors when trying to start a dnsmasq

2016-09-27 Thread Ihar Hrachyshka

Hi all,

so we started getting ‘Address already in use’ when trying to start dnsmasq  
after the previous instance of the process is killed with kill -9. Armando  
spotted it today in logs for: https://review.openstack.org/#/c/377626/ but  
as per logstash it seems like an error we saw before (the earliest I see is  
9/20), f.e.:


http://logs.openstack.org/26/377626/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz

Assuming I understand the flow of the failure, it runs as follows:

- sync_state starts dnsmasq per network;
- after agent lock is freed, some other notification event  
(port_update/subnet_update/...) triggers restart for one of the processes;
- the restart is done not via reload_allocations (-SIGHUP) but thru  
restart/disable (kill -9);
- once the old dnsmasq is killed with -9, we attempt to start a new process  
with new config files generated and fail with: “dnsmasq: failed to create  
listening socket for 10.1.15.242: Address already in use”
- surprisingly, after several failing attempts to start the process, it  
succeeds to start it after a bunch of seconds and runs fine.


It looks like once we kill the process with -9, it may hold for the socket  
resource for some time and may clash with the new process we try to spawn.  
It’s a bit weird because dnsmasq should have set REUSEADDR for the socket,  
so a new process should have started just fine.


Lately, we landed several patches that touched reload logic for DHCP agent  
on notifications. Among those suspicious in the context are:


- https://review.openstack.org/#/c/372595/ - note it requests ‘disable’  
(-9) where it was using ‘reload_allocations’ (-SIGHUP) before, and it also  
does not unplug the port on lease release (maybe after we rip of the  
device, the address clash with the old dnsmasq state is gone even though  
the ’new’ port will use the same address?).
- https://review.openstack.org/#/c/372236/6 - we were requesting  
reload_allocations in some cases before, and now we put the network into  
resync queue


There were other related changes lately, you can check history of Kevin’s  
changes for the branch, it should capture most of them.


I wonder whether we hit some long standing restart issue with dnsmasq here  
that was just never triggered before because we were not calling kill -9 so  
eagerly as we do now.


Note: Jakub Libosvar validated that 'kill -9 && dnsmasq’ in loop does NOT  
result in the failure we see in gate logs.


We need to understand what’s going with the failure, and come up with some  
plan for Newton. We either revert suspected patches as I believe Armando  
proposed before, but then it’s not clear until which point to do it; or we  
come up with some smart fix for that, that I don’t immediately grasp.


I will be on vacation tomorrow, though I will check the email thread to see  
if we have a plan to act on. I really hope folks give the issue a priority  
since it seems like we buried ourselves under a pile of interleaved patches  
and now we don’t have a clear view of how to get out of the pile.


Cheers,
Ihar

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev