Re: [openstack-dev] [octavia] Sometimes amphoras are not re-created if they are not reached for more than heartbeat_timeout

2018-05-04 Thread Michael Johnson
I have commented on both of those stories. Thank you for submitting them.

As for the values,this is hard as those settings depend on a lot of
factors. The default values are targeted towards developers and likely
need to be adjusted for production. We have not yet put together our
deployment guide where we would cover this type of tuning. Sigh, so
much to do and not enough team members.

Here are some comments I can give on those settings:
[health_manager]
failover_threads - This is the maximum number of parallel failovers
each instance (process) of the octavia-healthmanager can process at
the same time. Beyond this number they queue until a thread becomes
available. If your cloud is fairly stable and you have few health
managers, this can be a reasonably low number. Consider the maximum
number of amphora you would have on a single compute host should it
fail. Also take into account the CPU power available on the health
manager host.

status_update_threads - This is the maximum number of health heartbeat
messages each instance (process) of the octavia-healthmanager can
process at the same time.  The more octavia-healthmanagers you have,
the lower this can be. The upper limit on this is related to how fast
your database is processing the updates. Should this number be too
low, the heatlh manager will start logging warnings that you need more
health managers.

[haproxy_amphora]
build_rate_limit
build_active_retries
These two settings are only used if build rate limiting is enabled
(not by default). This would be set if your Nova infrastructure cannot
handle the rate of instance builds Octavia is asking of it. This will
prioritize instance builds based on the need and will limit the rate
of instance builds Octavia asks Nova for.  The only impact to the
Octavia controllers is increased memory utilization if there are a
large number of builds being queued waiting for Nova.

You missed these two:
connection_max_retries
connection_retry_interval
These values are typically adjusted in production environments as they
are tuned for exceeding slow development systems (virtualbox, etc.)
where booting instances can take up to twenty minutes. This is the
time after Nova declares the instance "ACTIVE" and when the kernel
finishes booting in the instance and the amphora agent is running. The
default is to wait 25 minutes. In production you would expect to drop
this number significantly.  On a typical cloud this should take less
than thirty seconds, but you should give it some buffer in case a host
is especially busy.  Again this depends on the performance of your
cloud.


[controller_worker]
workers - This is the number of worker threads pulling user requests
from the oslo messaging queue for each instance of the octavia-worker
process.  This number would be tuned depending on the number of worker
controllers you have in your cloud and the rate of user requests
(create, update, delete) that need to be serviced by a worker. GET
calls do not require a worker. This will also be limited by the
controller host CPU and RAM capacities.

amp_active_retries
amp_active_wait_sec
Both of these values depend on the performance of your Nova
environment. This is how many times and how often we check Nova to see
if a requested instance has become "ACTIVE". Unless your Nova
environment is unusually slow, you should not need to change these
values.


[task_flow]
max_workers - This value limits the parallelism inside the TaskFlow
flows used by the controllers. Currently there is little reason to
adjust this value as the degrees of parallelism in our flows are not
higher than this value. However, when we release Active-Active load
balancers this value will control the number of parallel amphora
builds up to the build limit above.

Michael

On Thu, May 3, 2018 at 1:51 AM,   wrote:
> Hi Michael,
>
> I build a new amphora image with the latest patches and I reproduced two 
> different bugs that I see in my environment. One of them is similar to the 
> one initially described in this thread. I opened two stories as you advised:
>
> https://storyboard.openstack.org/#!/story/2001960
> https://storyboard.openstack.org/#!/story/2001955
>
> Meanwhile, can you provide some recommendation of values for the following 
> parameters (maybe in relation with number of workers, cores, computes etc)?
>
> [health_manager]
> failover_threads
> status_update_threads
>
> [haproxy_amphora]
> build_rate_limit
> build_active_retries
>
> [controller_worker]
> workers
> amp_active_retries
> amp_active_wait_sec
>
> [task_flow]
> max_workers
>
> Thank you for your help,
> Mihaela Balas
>
> -Original Message-
> From: Michael Johnson [mailto:johnso...@gmail.com]
> Sent: Friday, April 27, 2018 8:24 PM
> To: OpenStack Development Mailing List (not for usage questions)
> Subject: Re: [openstack-dev] [octavia] Sometimes amphoras are not re-created 
> if they are not reached for more than heartbeat_timeout
>
> Hi Mihaela,
>
> I am sorry to 

Re: [openstack-dev] [octavia] Sometimes amphoras are not re-created if they are not reached for more than heartbeat_timeout

2018-05-03 Thread mihaela.balas
Hi Michael,

I build a new amphora image with the latest patches and I reproduced two 
different bugs that I see in my environment. One of them is similar to the one 
initially described in this thread. I opened two stories as you advised:

https://storyboard.openstack.org/#!/story/2001960
https://storyboard.openstack.org/#!/story/2001955

Meanwhile, can you provide some recommendation of values for the following 
parameters (maybe in relation with number of workers, cores, computes etc)?

[health_manager]
failover_threads
status_update_threads

[haproxy_amphora]
build_rate_limit
build_active_retries

[controller_worker]
workers
amp_active_retries
amp_active_wait_sec

[task_flow]
max_workers

Thank you for your help,
Mihaela Balas

-Original Message-
From: Michael Johnson [mailto:johnso...@gmail.com] 
Sent: Friday, April 27, 2018 8:24 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [octavia] Sometimes amphoras are not re-created if 
they are not reached for more than heartbeat_timeout

Hi Mihaela,

I am sorry to hear you are having trouble with the queens release of Octavia.  
It is true that a lot of work has gone into the failover capability, 
specifically working around a python threading issue and making it more 
resistant to certain neutron failure situations (missing ports, etc.).

I know of one open bug against the failover flows, 
https://storyboard.openstack.org/#!/story/2001481, "failover breaks in 
Active/Standby mode if both amphroae are down".

Unfortunately the log snippet above does not give me enough information about 
the problem to help with this issue. From the snippet it looks like the 
failovers were initiated, but the controllers are unable to reach the 
amphora-agent on the replacement amphora. It will continue those retry 
attempts, but eventually will fail the amphora into ERROR if it doesn't succeed.

One thought I have is if you created you amphora image in the last two weeks, 
you may have built an amphora using the master branch of octavia, which had a 
bug that impacted active/standby images. This was introduced working around the 
new pip 10 issues.  That patch has been
fixed: https://review.openstack.org/#/c/564371/

If neither of these situations match your environment, please open a story 
(https://storyboard.openstack.org/#!/dashboard/stories) for us and include the 
health manager logs from the point you delete the amphora up until it starts 
these connection attempts.  We will dig through those logs to see what the 
issue might be.

Michael (johnsom)

On Wed, Apr 25, 2018 at 4:07 AM,   wrote:
> Hello,
>
>
>
> I am testing Octavia Queens and I see that the failover behavior is 
> very much different than the one in Ocata (this is the version we are 
> currently running in production).
>
> One example of such behavior is:
>
>
>
> I create 4 load balancers and after the creation is successful, I shut 
> off all the 8 amphoras. Sometimes, even the health-manager agent does 
> not reach the amphoras, they are not deleted and re-created. The logs 
> look like shown below even when the heartbeat timeout is long passed. 
> Sometimes the amphoras are deleted and re-created. Sometimes,  they 
> are partially re-created – part of them remain in shut off.
>
> Heartbeat_timeout is set to 60 seconds.
>
>
>
>
>
>
>
> [octavia-health-manager-3662231220-nxnt3] 2018-04-25 10:57:26.244 11 
> WARNING octavia.amphorae.drivers.haproxy.rest_api_driver
> [req-339b54a7-ab0c-422a-832f-a444cd710497 - 
> a5f15235c0714365b98a50a11ec956e7
> - - -] Could not connect to instance. Retrying.: ConnectionError:
> HTTPSConnectionPool(host='192.168.0.15', port=9443): Max retries 
> exceeded with url:
> /0.5/listeners/285ad342-5582-423e-b654-1f0b50d91fb2/certificates/octav
> iasrv2.orange.com.pem (Caused by 
> NewConnectionError(' object at 0x7f559862c710>: Failed to establish a new connection: 
> [Errno 113] No route to host',))
>
> [octavia-health-manager-3662231220-3lssd] 2018-04-25 10:57:26.464 13 
> WARNING octavia.amphorae.drivers.haproxy.rest_api_driver
> [req-a63b795a-4b4f-4b90-a201-a4c9f49ac68b - 
> a5f15235c0714365b98a50a11ec956e7
> - - -] Could not connect to instance. Retrying.: ConnectionError:
> HTTPSConnectionPool(host='192.168.0.14', port=9443): Max retries 
> exceeded with url:
> /0.5/listeners/a45bdef3-e7da-4a18-9f1f-53d5651efe0f/1615c1ec-249e-4fa8
> -9d73-2397e281712c/haproxy (Caused by 
> NewConnectionError(' object at 0x7f8a0de95e10>: Failed to establish a new connection: 
> [Errno 113] No route to host',))
>
> [octavia-health-manager-3662231220-nxnt3] 2018-04-25 10:57:27.772 11 
> WARNING octavia.amphorae.drivers.haproxy.rest_api_driver
> [req-10febb10-85ea-4082-9df7-daa48894b004 - 
> a5f15235c0714365b98a50a11ec956e7
> - - -] Could not connect to instance. Retrying.: ConnectionError:
> HTTPSConnectionPool(host='192.168.0.19', port=9443): Max retries 
> exceeded with url:
> 

Re: [openstack-dev] [octavia] Sometimes amphoras are not re-created if they are not reached for more than heartbeat_timeout

2018-04-27 Thread Michael Johnson
Hi Mihaela,

I am sorry to hear you are having trouble with the queens release of
Octavia.  It is true that a lot of work has gone into the failover
capability, specifically working around a python threading issue and
making it more resistant to certain neutron failure situations
(missing ports, etc.).

I know of one open bug against the failover flows,
https://storyboard.openstack.org/#!/story/2001481, "failover breaks in
Active/Standby mode if both amphroae are down".

Unfortunately the log snippet above does not give me enough
information about the problem to help with this issue. From the
snippet it looks like the failovers were initiated, but the
controllers are unable to reach the amphora-agent on the replacement
amphora. It will continue those retry attempts, but eventually will
fail the amphora into ERROR if it doesn't succeed.

One thought I have is if you created you amphora image in the last two
weeks, you may have built an amphora using the master branch of
octavia, which had a bug that impacted active/standby images. This was
introduced working around the new pip 10 issues.  That patch has been
fixed: https://review.openstack.org/#/c/564371/

If neither of these situations match your environment, please open a
story (https://storyboard.openstack.org/#!/dashboard/stories) for us
and include the health manager logs from the point you delete the
amphora up until it starts these connection attempts.  We will dig
through those logs to see what the issue might be.

Michael (johnsom)

On Wed, Apr 25, 2018 at 4:07 AM,   wrote:
> Hello,
>
>
>
> I am testing Octavia Queens and I see that the failover behavior is very
> much different than the one in Ocata (this is the version we are currently
> running in production).
>
> One example of such behavior is:
>
>
>
> I create 4 load balancers and after the creation is successful, I shut off
> all the 8 amphoras. Sometimes, even the health-manager agent does not reach
> the amphoras, they are not deleted and re-created. The logs look like shown
> below even when the heartbeat timeout is long passed. Sometimes the amphoras
> are deleted and re-created. Sometimes,  they are partially re-created – part
> of them remain in shut off.
>
> Heartbeat_timeout is set to 60 seconds.
>
>
>
>
>
>
>
> [octavia-health-manager-3662231220-nxnt3] 2018-04-25 10:57:26.244 11 WARNING
> octavia.amphorae.drivers.haproxy.rest_api_driver
> [req-339b54a7-ab0c-422a-832f-a444cd710497 - a5f15235c0714365b98a50a11ec956e7
> - - -] Could not connect to instance. Retrying.: ConnectionError:
> HTTPSConnectionPool(host='192.168.0.15', port=9443): Max retries exceeded
> with url:
> /0.5/listeners/285ad342-5582-423e-b654-1f0b50d91fb2/certificates/octaviasrv2.orange.com.pem
> (Caused by NewConnectionError(' object at 0x7f559862c710>: Failed to establish a new connection: [Errno 113]
> No route to host',))
>
> [octavia-health-manager-3662231220-3lssd] 2018-04-25 10:57:26.464 13 WARNING
> octavia.amphorae.drivers.haproxy.rest_api_driver
> [req-a63b795a-4b4f-4b90-a201-a4c9f49ac68b - a5f15235c0714365b98a50a11ec956e7
> - - -] Could not connect to instance. Retrying.: ConnectionError:
> HTTPSConnectionPool(host='192.168.0.14', port=9443): Max retries exceeded
> with url:
> /0.5/listeners/a45bdef3-e7da-4a18-9f1f-53d5651efe0f/1615c1ec-249e-4fa8-9d73-2397e281712c/haproxy
> (Caused by NewConnectionError(' object at 0x7f8a0de95e10>: Failed to establish a new connection: [Errno 113]
> No route to host',))
>
> [octavia-health-manager-3662231220-nxnt3] 2018-04-25 10:57:27.772 11 WARNING
> octavia.amphorae.drivers.haproxy.rest_api_driver
> [req-10febb10-85ea-4082-9df7-daa48894b004 - a5f15235c0714365b98a50a11ec956e7
> - - -] Could not connect to instance. Retrying.: ConnectionError:
> HTTPSConnectionPool(host='192.168.0.19', port=9443): Max retries exceeded
> with url:
> /0.5/listeners/96ce5862-d944-46cb-8809-e1e328268a66/fc5b7940-3527-4e9b-b93f-1da3957a5b71/haproxy
> (Caused by NewConnectionError(' object at 0x7f5598491c90>: Failed to establish a new connection: [Errno 113]
> No route to host',))
>
> [octavia-health-manager-3662231220-nxnt3] 2018-04-25 10:57:34.252 11 WARNING
> octavia.amphorae.drivers.haproxy.rest_api_driver
> [req-339b54a7-ab0c-422a-832f-a444cd710497 - a5f15235c0714365b98a50a11ec956e7
> - - -] Could not connect to instance. Retrying.: ConnectionError:
> HTTPSConnectionPool(host='192.168.0.15', port=9443): Max retries exceeded
> with url:
> /0.5/listeners/285ad342-5582-423e-b654-1f0b50d91fb2/certificates/octaviasrv2.orange.com.pem
> (Caused by NewConnectionError(' object at 0x7f5598520790>: Failed to establish a new connection: [Errno 113]
> No route to host',))
>
> [octavia-health-manager-3662231220-3lssd] 2018-04-25 10:57:34.476 13 WARNING
> octavia.amphorae.drivers.haproxy.rest_api_driver
> [req-a63b795a-4b4f-4b90-a201-a4c9f49ac68b - a5f15235c0714365b98a50a11ec956e7
> - - -] Could not connect to instance. Retrying.: ConnectionError:
> 

[openstack-dev] [octavia] Sometimes amphoras are not re-created if they are not reached for more than heartbeat_timeout

2018-04-25 Thread mihaela.balas
Hello,

I am testing Octavia Queens and I see that the failover behavior is very much 
different than the one in Ocata (this is the version we are currently running 
in production).
One example of such behavior is:

I create 4 load balancers and after the creation is successful, I shut off all 
the 8 amphoras. Sometimes, even the health-manager agent does not reach the 
amphoras, they are not deleted and re-created. The logs look like shown below 
even when the heartbeat timeout is long passed. Sometimes the amphoras are 
deleted and re-created. Sometimes,  they are partially re-created - part of 
them remain in shut off.
Heartbeat_timeout is set to 60 seconds.



[octavia-health-manager-3662231220-nxnt3] 2018-04-25 10:57:26.244 11 WARNING 
octavia.amphorae.drivers.haproxy.rest_api_driver 
[req-339b54a7-ab0c-422a-832f-a444cd710497 - a5f15235c0714365b98a50a11ec956e7 - 
- -] Could not connect to instance. Retrying.: ConnectionError: 
HTTPSConnectionPool(host='192.168.0.15', port=9443): Max retries exceeded with 
url: 
/0.5/listeners/285ad342-5582-423e-b654-1f0b50d91fb2/certificates/octaviasrv2.orange.com.pem
 (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 113] No 
route to host',))
[octavia-health-manager-3662231220-3lssd] 2018-04-25 10:57:26.464 13 WARNING 
octavia.amphorae.drivers.haproxy.rest_api_driver 
[req-a63b795a-4b4f-4b90-a201-a4c9f49ac68b - a5f15235c0714365b98a50a11ec956e7 - 
- -] Could not connect to instance. Retrying.: ConnectionError: 
HTTPSConnectionPool(host='192.168.0.14', port=9443): Max retries exceeded with 
url: 
/0.5/listeners/a45bdef3-e7da-4a18-9f1f-53d5651efe0f/1615c1ec-249e-4fa8-9d73-2397e281712c/haproxy
 (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 113] No 
route to host',))
[octavia-health-manager-3662231220-nxnt3] 2018-04-25 10:57:27.772 11 WARNING 
octavia.amphorae.drivers.haproxy.rest_api_driver 
[req-10febb10-85ea-4082-9df7-daa48894b004 - a5f15235c0714365b98a50a11ec956e7 - 
- -] Could not connect to instance. Retrying.: ConnectionError: 
HTTPSConnectionPool(host='192.168.0.19', port=9443): Max retries exceeded with 
url: 
/0.5/listeners/96ce5862-d944-46cb-8809-e1e328268a66/fc5b7940-3527-4e9b-b93f-1da3957a5b71/haproxy
 (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 113] No 
route to host',))
[octavia-health-manager-3662231220-nxnt3] 2018-04-25 10:57:34.252 11 WARNING 
octavia.amphorae.drivers.haproxy.rest_api_driver 
[req-339b54a7-ab0c-422a-832f-a444cd710497 - a5f15235c0714365b98a50a11ec956e7 - 
- -] Could not connect to instance. Retrying.: ConnectionError: 
HTTPSConnectionPool(host='192.168.0.15', port=9443): Max retries exceeded with 
url: 
/0.5/listeners/285ad342-5582-423e-b654-1f0b50d91fb2/certificates/octaviasrv2.orange.com.pem
 (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 113] No 
route to host',))
[octavia-health-manager-3662231220-3lssd] 2018-04-25 10:57:34.476 13 WARNING 
octavia.amphorae.drivers.haproxy.rest_api_driver 
[req-a63b795a-4b4f-4b90-a201-a4c9f49ac68b - a5f15235c0714365b98a50a11ec956e7 - 
- -] Could not connect to instance. Retrying.: ConnectionError: 
HTTPSConnectionPool(host='192.168.0.14', port=9443): Max retries exceeded with 
url: 
/0.5/listeners/a45bdef3-e7da-4a18-9f1f-53d5651efe0f/1615c1ec-249e-4fa8-9d73-2397e281712c/haproxy
 (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 113] No 
route to host',))
[octavia-health-manager-3662231220-nxnt3] 2018-04-25 10:57:35.780 11 WARNING 
octavia.amphorae.drivers.haproxy.rest_api_driver 
[req-10febb10-85ea-4082-9df7-daa48894b004 - a5f15235c0714365b98a50a11ec956e7 - 
- -] Could not connect to instance. Retrying.: ConnectionError: 
HTTPSConnectionPool(host='192.168.0.19', port=9443): Max retries exceeded with 
url: 
/0.5/listeners/96ce5862-d944-46cb-8809-e1e328268a66/fc5b7940-3527-4e9b-b93f-1da3957a5b71/haproxy
 (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 113] No 
route to host',))

Thank you,
Mihaela Balas

_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages