Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

2016-08-16 Thread Jake Yip
Hi Matt,

We seem to be doing ok with 3.6.3. IIRC 3.6.2 was causing the stats db to
fall over every now and then causing huge problems.

Regards,
Jake

Jake Yip,
DevOps Engineer,
Core Services, NeCTAR Research Cloud,
The University of Melbourne

On Tue, Aug 16, 2016 at 6:34 AM, Matt Fischer <m...@mattfischer.com> wrote:

> Has anyone had any luck improving the statsdb issue by upgrading rabbit to
> 3.6.3 or newer? We're at 3.5.6 now and 3.6.2 has parallelized stats
> processing, then 3.6.3 has additional memory leak fixes for it. What we've
> been seeing is that we occasionally get slow & steady climbs of rabbit
> memory usage until the cluster falls over when it hits the memory limit.
> The last one occurred over 12 hours once we went back and looked at the
> charts.
>
> I'm hoping to try 3.6.5 but we have no way to repro this outside of
> production and even there short of bouncing neutron and all the agents over
> and over I'm not sure I could recreate it.
>
> Note - we already have the collect interval set to 30k, per recommendation
> from the Rabbit Ops talk in Tokyo, but no other optimizations for the
> statsdb. Some folks here are considering a cron job to bounce it every few
> hours.
>
>
> On Thu, Jul 28, 2016 at 9:10 AM, Kris G. Lindgren <klindg...@godaddy.com>
> wrote:
>
>> We also believe the change from auto-delete queues to 10min expiration
>> queues was the cause of our rabbit whoes a month or so ago.  Where we had
>> rabbitmq servers filling their stats DB and consuming 20+ GB of ram before
>> hitting the rabbitmq mem high watermark.  We were running for 6+ months
>> without issue under kilo and when we moved to Liberty rabbit consistently
>> started falling on its face.  We eventually turned down the stats
>> collection interval, but I would imagine keeping stats around for queue’s
>> for 10 minutes that were used for a single RPC message when we are passing
>> 1500+ messages per second wasn’t helping anything.  We haven’t tried
>> changing the timeout values to be lower, to see if that made things
>> better.  But we did identify this change as something that could contribute
>> to our rabbitmq issues.
>>
>>
>>
>>
>>
>> ___
>>
>> Kris Lindgren
>>
>> Senior Linux Systems Engineer
>>
>> GoDaddy
>>
>>
>>
>> *From: *Dmitry Mescheryakov <dmescherya...@mirantis.com>
>> *Date: *Thursday, July 28, 2016 at 6:17 AM
>> *To: *Sam Morrison <sorri...@gmail.com>
>> *Cc: *OpenStack Operators <openstack-operators@lists.openstack.org>
>> *Subject: *Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues
>> moving to Liberty
>>
>>
>>
>>
>>
>>
>>
>> 2016-07-27 2:20 GMT+03:00 Sam Morrison <sorri...@gmail.com>:
>>
>>
>>
>> On 27 Jul 2016, at 4:05 AM, Dmitry Mescheryakov <
>> dmescherya...@mirantis.com> wrote:
>>
>>
>>
>>
>>
>>
>>
>> 2016-07-26 2:15 GMT+03:00 Sam Morrison <sorri...@gmail.com>:
>>
>> The queue TTL happens on reply queues and fanout queues. I don’t think it
>> should happen on fanout queues. They should auto delete. I can understand
>> the reason for having them on reply queues though so maybe that would be a
>> way to forward?
>>
>>
>>
>> Or am I missing something and it is needed on fanout queues too?
>>
>>
>>
>> I would say we do need fanout queues to expire for the very same reason
>> we want reply queues to expire instead of auto delete. In case of broken
>> connection, the expiration provides client time to reconnect and continue
>> consuming from the queue. In case of auto-delete queues, it was a frequent
>> case that RabbitMQ deleted the queue before client reconnects ... along
>> with all non-consumed messages in it.
>>
>>
>>
>> But in the case of fanout queues, if there is a broken connection can’t
>> the service just recreate the queue if it doesn’t exist? I guess that means
>> it needs to store the state of what the queue name is though?
>>
>>
>>
>> Yes they could loose messages directed at them but all the services I
>> know that consume on fanout queues have a re sync functionality for this
>> very case.
>>
>>
>>
>> If the connection is broken will oslo messaging know how to connect to
>> the same queue again anyway? I would’ve thought it would handle the
>> disconnect and then reconnect, either with the same queue name or a new
>> queue all together?
>>
>&

Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

2016-08-15 Thread Matt Fischer
Has anyone had any luck improving the statsdb issue by upgrading rabbit to
3.6.3 or newer? We're at 3.5.6 now and 3.6.2 has parallelized stats
processing, then 3.6.3 has additional memory leak fixes for it. What we've
been seeing is that we occasionally get slow & steady climbs of rabbit
memory usage until the cluster falls over when it hits the memory limit.
The last one occurred over 12 hours once we went back and looked at the
charts.

I'm hoping to try 3.6.5 but we have no way to repro this outside of
production and even there short of bouncing neutron and all the agents over
and over I'm not sure I could recreate it.

Note - we already have the collect interval set to 30k, per recommendation
from the Rabbit Ops talk in Tokyo, but no other optimizations for the
statsdb. Some folks here are considering a cron job to bounce it every few
hours.


On Thu, Jul 28, 2016 at 9:10 AM, Kris G. Lindgren <klindg...@godaddy.com>
wrote:

> We also believe the change from auto-delete queues to 10min expiration
> queues was the cause of our rabbit whoes a month or so ago.  Where we had
> rabbitmq servers filling their stats DB and consuming 20+ GB of ram before
> hitting the rabbitmq mem high watermark.  We were running for 6+ months
> without issue under kilo and when we moved to Liberty rabbit consistently
> started falling on its face.  We eventually turned down the stats
> collection interval, but I would imagine keeping stats around for queue’s
> for 10 minutes that were used for a single RPC message when we are passing
> 1500+ messages per second wasn’t helping anything.  We haven’t tried
> changing the timeout values to be lower, to see if that made things
> better.  But we did identify this change as something that could contribute
> to our rabbitmq issues.
>
>
>
>
>
> ___
>
> Kris Lindgren
>
> Senior Linux Systems Engineer
>
> GoDaddy
>
>
>
> *From: *Dmitry Mescheryakov <dmescherya...@mirantis.com>
> *Date: *Thursday, July 28, 2016 at 6:17 AM
> *To: *Sam Morrison <sorri...@gmail.com>
> *Cc: *OpenStack Operators <openstack-operators@lists.openstack.org>
> *Subject: *Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues
> moving to Liberty
>
>
>
>
>
>
>
> 2016-07-27 2:20 GMT+03:00 Sam Morrison <sorri...@gmail.com>:
>
>
>
> On 27 Jul 2016, at 4:05 AM, Dmitry Mescheryakov <
> dmescherya...@mirantis.com> wrote:
>
>
>
>
>
>
>
> 2016-07-26 2:15 GMT+03:00 Sam Morrison <sorri...@gmail.com>:
>
> The queue TTL happens on reply queues and fanout queues. I don’t think it
> should happen on fanout queues. They should auto delete. I can understand
> the reason for having them on reply queues though so maybe that would be a
> way to forward?
>
>
>
> Or am I missing something and it is needed on fanout queues too?
>
>
>
> I would say we do need fanout queues to expire for the very same reason we
> want reply queues to expire instead of auto delete. In case of broken
> connection, the expiration provides client time to reconnect and continue
> consuming from the queue. In case of auto-delete queues, it was a frequent
> case that RabbitMQ deleted the queue before client reconnects ... along
> with all non-consumed messages in it.
>
>
>
> But in the case of fanout queues, if there is a broken connection can’t
> the service just recreate the queue if it doesn’t exist? I guess that means
> it needs to store the state of what the queue name is though?
>
>
>
> Yes they could loose messages directed at them but all the services I know
> that consume on fanout queues have a re sync functionality for this very
> case.
>
>
>
> If the connection is broken will oslo messaging know how to connect to the
> same queue again anyway? I would’ve thought it would handle the disconnect
> and then reconnect, either with the same queue name or a new queue all
> together?
>
>
>
> oslo.messaging handles reconnect perfectly - on connect it just
> unconditionally declares the queue and starts consuming from it. If queue
> already existed, the declaration operation will just be ignored by RabbitMQ.
>
>
>
> For your earlier point that services re sync and hence messages lost in
> fanout are not that important, I can't comment on that. But after some
> thinking I do agree that having big expiration time for fanouts is
> non-adequate for big deployments anyway. How about we split
> rabbit_transient_queues_ttl into two parameters - one for reply queue and
> one for fanout ones? In that case people concerned with messages piling up
> in fanouts might set it to 1, which will virtually make these queues behave
> like auto-dele

Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

2016-07-28 Thread Sam Morrison

> On 28 Jul 2016, at 10:17 PM, Dmitry Mescheryakov  
> wrote:
> 
> 
> 
> 2016-07-27 2:20 GMT+03:00 Sam Morrison  >:
> 
>> On 27 Jul 2016, at 4:05 AM, Dmitry Mescheryakov > > wrote:
>> 
>> 
>> 
>> 2016-07-26 2:15 GMT+03:00 Sam Morrison > >:
>> The queue TTL happens on reply queues and fanout queues. I don’t think it 
>> should happen on fanout queues. They should auto delete. I can understand 
>> the reason for having them on reply queues though so maybe that would be a 
>> way to forward?
>> 
>> Or am I missing something and it is needed on fanout queues too?
>> 
>> I would say we do need fanout queues to expire for the very same reason we 
>> want reply queues to expire instead of auto delete. In case of broken 
>> connection, the expiration provides client time to reconnect and continue 
>> consuming from the queue. In case of auto-delete queues, it was a frequent 
>> case that RabbitMQ deleted the queue before client reconnects ... along with 
>> all non-consumed messages in it.
> 
> But in the case of fanout queues, if there is a broken connection can’t the 
> service just recreate the queue if it doesn’t exist? I guess that means it 
> needs to store the state of what the queue name is though?
> 
> Yes they could loose messages directed at them but all the services I know 
> that consume on fanout queues have a re sync functionality for this very case.
> 
> If the connection is broken will oslo messaging know how to connect to the 
> same queue again anyway? I would’ve thought it would handle the disconnect 
> and then reconnect, either with the same queue name or a new queue all 
> together?
> 
> oslo.messaging handles reconnect perfectly - on connect it just 
> unconditionally declares the queue and starts consuming from it. If queue 
> already existed, the declaration operation will just be ignored by RabbitMQ.
> 
> For your earlier point that services re sync and hence messages lost in 
> fanout are not that important, I can't comment on that. But after some 
> thinking I do agree that having big expiration time for fanouts is 
> non-adequate for big deployments anyway. How about we split 
> rabbit_transient_queues_ttl into two parameters - one for reply queue and one 
> for fanout ones? In that case people concerned with messages piling up in 
> fanouts might set it to 1, which will virtually make these queues behave like 
> auto-delete ones (though I strongly recommend to leave it at least at 20 
> seconds, to give service a chance to reconnect).

Hi Dmitry,

Splitting out the config options would be great, I think that would solve our 
issues. 

Thanks,
Sam


> 
> Thanks,
> 
> Dmitry
> 
>  
> 
> Sam

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

2016-07-28 Thread Kris G. Lindgren
We also believe the change from auto-delete queues to 10min expiration queues 
was the cause of our rabbit whoes a month or so ago.  Where we had rabbitmq 
servers filling their stats DB and consuming 20+ GB of ram before hitting the 
rabbitmq mem high watermark.  We were running for 6+ months without issue under 
kilo and when we moved to Liberty rabbit consistently started falling on its 
face.  We eventually turned down the stats collection interval, but I would 
imagine keeping stats around for queue’s for 10 minutes that were used for a 
single RPC message when we are passing 1500+ messages per second wasn’t helping 
anything.  We haven’t tried changing the timeout values to be lower, to see if 
that made things better.  But we did identify this change as something that 
could contribute to our rabbitmq issues.


___
Kris Lindgren
Senior Linux Systems Engineer
GoDaddy

From: Dmitry Mescheryakov <dmescherya...@mirantis.com>
Date: Thursday, July 28, 2016 at 6:17 AM
To: Sam Morrison <sorri...@gmail.com>
Cc: OpenStack Operators <openstack-operators@lists.openstack.org>
Subject: Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to 
Liberty



2016-07-27 2:20 GMT+03:00 Sam Morrison 
<sorri...@gmail.com<mailto:sorri...@gmail.com>>:

On 27 Jul 2016, at 4:05 AM, Dmitry Mescheryakov 
<dmescherya...@mirantis.com<mailto:dmescherya...@mirantis.com>> wrote:



2016-07-26 2:15 GMT+03:00 Sam Morrison 
<sorri...@gmail.com<mailto:sorri...@gmail.com>>:
The queue TTL happens on reply queues and fanout queues. I don’t think it 
should happen on fanout queues. They should auto delete. I can understand the 
reason for having them on reply queues though so maybe that would be a way to 
forward?

Or am I missing something and it is needed on fanout queues too?

I would say we do need fanout queues to expire for the very same reason we want 
reply queues to expire instead of auto delete. In case of broken connection, 
the expiration provides client time to reconnect and continue consuming from 
the queue. In case of auto-delete queues, it was a frequent case that RabbitMQ 
deleted the queue before client reconnects ... along with all non-consumed 
messages in it.

But in the case of fanout queues, if there is a broken connection can’t the 
service just recreate the queue if it doesn’t exist? I guess that means it 
needs to store the state of what the queue name is though?

Yes they could loose messages directed at them but all the services I know that 
consume on fanout queues have a re sync functionality for this very case.

If the connection is broken will oslo messaging know how to connect to the same 
queue again anyway? I would’ve thought it would handle the disconnect and then 
reconnect, either with the same queue name or a new queue all together?

oslo.messaging handles reconnect perfectly - on connect it just unconditionally 
declares the queue and starts consuming from it. If queue already existed, the 
declaration operation will just be ignored by RabbitMQ.

For your earlier point that services re sync and hence messages lost in fanout 
are not that important, I can't comment on that. But after some thinking I do 
agree that having big expiration time for fanouts is non-adequate for big 
deployments anyway. How about we split rabbit_transient_queues_ttl into two 
parameters - one for reply queue and one for fanout ones? In that case people 
concerned with messages piling up in fanouts might set it to 1, which will 
virtually make these queues behave like auto-delete ones (though I strongly 
recommend to leave it at least at 20 seconds, to give service a chance to 
reconnect).

Thanks,

Dmitry



Sam



___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

2016-07-28 Thread Davanum Srinivas
Dima, Kevin,

There are PreStop hooks that can be used to gracefully bring down
stuff running in containers:
http://kubernetes.io/docs/user-guide/container-environment/

-- Dims

On Thu, Jul 28, 2016 at 8:22 AM, Dmitry Mescheryakov
<dmescherya...@mirantis.com> wrote:
>
> 2016-07-26 21:20 GMT+03:00 Fox, Kevin M <kevin@pnnl.gov>:
>>
>> It only relates to Kubernetes in that Kubernetes can do automatic rolling
>> upgrades by destroying/replacing a service. If the services don't clean up
>> after themselves, then performing a rolling upgrade will break things.
>>
>> So, what do you think is the best approach to ensuring all the services
>> shut things down properly? Seems like its a cross project issue? Should a
>> spec be submitted?
>
>
> I think that it would be fair if Kubernates sends a sigterm to OpenStack
> service in a container, then wait for the service to shut down and only then
> destroy the container.
>
> It might be not very important for our case though, if we agree to split
> expiration time for fanout and reply queues. And I don't know of any other
> case where an OpenStack service needs to clean up on shutdown in some
> external place.
>
> Thanks,
>
> Dmitry
>
>>
>> Thanks,
>> Kevin
>> 
>> From: Dmitry Mescheryakov [dmescherya...@mirantis.com]
>> Sent: Tuesday, July 26, 2016 11:01 AM
>> To: Fox, Kevin M
>> Cc: Sam Morrison; OpenStack Operators
>>
>> Subject: Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving
>> to Liberty
>>
>>
>>
>> 2016-07-25 18:47 GMT+03:00 Fox, Kevin M <kevin@pnnl.gov>:
>>>
>>> Ah. Interesting.
>>>
>>> The graceful shutdown would really help the Kubernetes situation too.
>>> Kubernetes can do easy rolling upgrades and having the processes being able
>>> to clean up after themselves as they are upgraded is important. Is this
>>> something that needs to go into oslo.messaging or does it have to be added
>>> to all projects using it?
>>
>>
>> It both needs to be fixed on oslo.messaging side (delete fanout queue on
>> RPC server stop, which is done by Kirill's CR) and on side of projects using
>> it, as they need to actually stop RPC server before shutting down. As I
>> wrote earlier, among Neutron processes right now only openvswitch and
>> metadata agents do not stop RPC server.
>>
>> I am not sure how that relates to Kubernates, as I not much familiar with
>> it.
>>
>> Thanks,
>>
>> Dmitry
>>
>>>
>>>
>>> Thanks,
>>> Kevin
>>> 
>>> From: Dmitry Mescheryakov [dmescherya...@mirantis.com]
>>> Sent: Monday, July 25, 2016 3:47 AM
>>> To: Sam Morrison
>>> Cc: OpenStack Operators
>>> Subject: Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues
>>> moving to Liberty
>>>
>>> Sam,
>>>
>>> For your case I would suggest to lower rabbit_transient_queues_ttl until
>>> you are comfortable with volume of messages which comes during that time.
>>> Setting the parameter to 1 will essentially replicate bahaviour of
>>> auto_delete queues. But I would suggest not to set it that low, as otherwise
>>> your OpenStack will suffer from the original bug. Probably a value like 20
>>> seconds should work in most cases.
>>>
>>> I think that there is a space for improvement here - we can delete reply
>>> and fanout queues on graceful shutdown. But I am not sure if it will be easy
>>> to implement, as it requires services (Nova, Neutron, etc.) to stop RPC
>>> server on sigint and I don't know if they do it right now.
>>>
>>> I don't think we can make case with sigkill any better. Other than that,
>>> the issue could be investigated on Neutron side, maybe number of messages
>>> could be reduced there.
>>>
>>> Thanks,
>>>
>>> Dmitry
>>>
>>> 2016-07-25 9:27 GMT+03:00 Sam Morrison <sorri...@gmail.com>:
>>>>
>>>> We recently upgraded to Liberty and have come across some issues with
>>>> queue build ups.
>>>>
>>>> This is due to changes in rabbit to set queue expiries as opposed to
>>>> queue auto delete.
>>>> See https://bugs.launchpad.net/oslo.messaging/+bug/1515278 for more
>>>> information.
>>>>
>>>> The fix for this bug is in liberty and it does fix an issue however it
>>>> causes another one.
>>&

Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

2016-07-28 Thread Dmitry Mescheryakov
2016-07-26 21:20 GMT+03:00 Fox, Kevin M <kevin@pnnl.gov>:

> It only relates to Kubernetes in that Kubernetes can do automatic rolling
> upgrades by destroying/replacing a service. If the services don't clean up
> after themselves, then performing a rolling upgrade will break things.
>
> So, what do you think is the best approach to ensuring all the services
> shut things down properly? Seems like its a cross project issue? Should a
> spec be submitted?
>

I think that it would be fair if Kubernates sends a sigterm to OpenStack
service in a container, then wait for the service to shut down and only
then destroy the container.

It might be not very important for our case though, if we agree to split
expiration time for fanout and reply queues. And I don't know of any other
case where an OpenStack service needs to clean up on shutdown in some
external place.

Thanks,

Dmitry


> Thanks,
> Kevin
> --
> *From:* Dmitry Mescheryakov [dmescherya...@mirantis.com]
> *Sent:* Tuesday, July 26, 2016 11:01 AM
> *To:* Fox, Kevin M
> *Cc:* Sam Morrison; OpenStack Operators
>
> *Subject:* Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues
> moving to Liberty
>
>
>
> 2016-07-25 18:47 GMT+03:00 Fox, Kevin M <kevin@pnnl.gov>:
>
>> Ah. Interesting.
>>
>> The graceful shutdown would really help the Kubernetes situation too.
>> Kubernetes can do easy rolling upgrades and having the processes being able
>> to clean up after themselves as they are upgraded is important. Is this
>> something that needs to go into oslo.messaging or does it have to be added
>> to all projects using it?
>>
>
> It both needs to be fixed on oslo.messaging side (delete fanout queue on
> RPC server stop, which is done by Kirill's CR) and on side of projects
> using it, as they need to actually stop RPC server before shutting down. As
> I wrote earlier, among Neutron processes right now only openvswitch and
> metadata agents do not stop RPC server.
>
> I am not sure how that relates to Kubernates, as I not much familiar with
> it.
>
> Thanks,
>
> Dmitry
>
>
>>
>> Thanks,
>> Kevin
>> ------
>> *From:* Dmitry Mescheryakov [dmescherya...@mirantis.com]
>> *Sent:* Monday, July 25, 2016 3:47 AM
>> *To:* Sam Morrison
>> *Cc:* OpenStack Operators
>> *Subject:* Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues
>> moving to Liberty
>>
>> Sam,
>>
>> For your case I would suggest to lower rabbit_transient_queues_ttl until
>> you are comfortable with volume of messages which comes during that time.
>> Setting the parameter to 1 will essentially replicate bahaviour of
>> auto_delete queues. But I would suggest not to set it that low, as
>> otherwise your OpenStack will suffer from the original bug. Probably a
>> value like 20 seconds should work in most cases.
>>
>> I think that there is a space for improvement here - we can delete reply
>> and fanout queues on graceful shutdown. But I am not sure if it will be
>> easy to implement, as it requires services (Nova, Neutron, etc.) to stop
>> RPC server on sigint and I don't know if they do it right now.
>>
>> I don't think we can make case with sigkill any better. Other than that,
>> the issue could be investigated on Neutron side, maybe number of messages
>> could be reduced there.
>>
>> Thanks,
>>
>> Dmitry
>>
>> 2016-07-25 9:27 GMT+03:00 Sam Morrison <sorri...@gmail.com>:
>>
>>> We recently upgraded to Liberty and have come across some issues with
>>> queue build ups.
>>>
>>> This is due to changes in rabbit to set queue expiries as opposed to
>>> queue auto delete.
>>> See https://bugs.launchpad.net/oslo.messaging/+bug/1515278 for more
>>> information.
>>>
>>> The fix for this bug is in liberty and it does fix an issue however it
>>> causes another one.
>>>
>>> Every time you restart something that has a fanout queue. Eg.
>>> cinder-scheduler or the neutron agents you will have
>>> a queue in rabbit that is still bound to the rabbitmq exchange (and so
>>> still getting messages in) but no consumers.
>>>
>>> These messages in these queues are basically rubbish and don’t need to
>>> exist. Rabbit will delete these queues after 10 mins (although the default
>>> in master is now changed to 30 mins)
>>>
>>> During this time the queue will grow and grow with messages. This sets
>>> off our nagios alerts and our ops guys have to deal with something that
>>> 

Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

2016-07-28 Thread Dmitry Mescheryakov
2016-07-27 2:20 GMT+03:00 Sam Morrison :

>
> On 27 Jul 2016, at 4:05 AM, Dmitry Mescheryakov <
> dmescherya...@mirantis.com> wrote:
>
>
>
> 2016-07-26 2:15 GMT+03:00 Sam Morrison :
>
>> The queue TTL happens on reply queues and fanout queues. I don’t think it
>> should happen on fanout queues. They should auto delete. I can understand
>> the reason for having them on reply queues though so maybe that would be a
>> way to forward?
>>
>> Or am I missing something and it is needed on fanout queues too?
>>
>
> I would say we do need fanout queues to expire for the very same reason we
> want reply queues to expire instead of auto delete. In case of broken
> connection, the expiration provides client time to reconnect and continue
> consuming from the queue. In case of auto-delete queues, it was a frequent
> case that RabbitMQ deleted the queue before client reconnects ... along
> with all non-consumed messages in it.
>
>
> But in the case of fanout queues, if there is a broken connection can’t
> the service just recreate the queue if it doesn’t exist? I guess that means
> it needs to store the state of what the queue name is though?
>
> Yes they could loose messages directed at them but all the services I know
> that consume on fanout queues have a re sync functionality for this very
> case.
>
> If the connection is broken will oslo messaging know how to connect to the
> same queue again anyway? I would’ve thought it would handle the disconnect
> and then reconnect, either with the same queue name or a new queue all
> together?
>

oslo.messaging handles reconnect perfectly - on connect it just
unconditionally declares the queue and starts consuming from it. If queue
already existed, the declaration operation will just be ignored by RabbitMQ.

For your earlier point that services re sync and hence messages lost in
fanout are not that important, I can't comment on that. But after some
thinking I do agree that having big expiration time for fanouts is
non-adequate for big deployments anyway. How about we split
rabbit_transient_queues_ttl
into two parameters - one for reply queue and one for fanout ones? In that
case people concerned with messages piling up in fanouts might set it to 1,
which will virtually make these queues behave like auto-delete ones (though
I strongly recommend to leave it at least at 20 seconds, to give service a
chance to reconnect).

Thanks,

Dmitry



>
> Sam
>
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

2016-07-26 Thread Sam Morrison

> On 27 Jul 2016, at 4:05 AM, Dmitry Mescheryakov  
> wrote:
> 
> 
> 
> 2016-07-26 2:15 GMT+03:00 Sam Morrison  >:
> The queue TTL happens on reply queues and fanout queues. I don’t think it 
> should happen on fanout queues. They should auto delete. I can understand the 
> reason for having them on reply queues though so maybe that would be a way to 
> forward?
> 
> Or am I missing something and it is needed on fanout queues too?
> 
> I would say we do need fanout queues to expire for the very same reason we 
> want reply queues to expire instead of auto delete. In case of broken 
> connection, the expiration provides client time to reconnect and continue 
> consuming from the queue. In case of auto-delete queues, it was a frequent 
> case that RabbitMQ deleted the queue before client reconnects ... along with 
> all non-consumed messages in it.

But in the case of fanout queues, if there is a broken connection can’t the 
service just recreate the queue if it doesn’t exist? I guess that means it 
needs to store the state of what the queue name is though?

Yes they could loose messages directed at them but all the services I know that 
consume on fanout queues have a re sync functionality for this very case.

If the connection is broken will oslo messaging know how to connect to the same 
queue again anyway? I would’ve thought it would handle the disconnect and then 
reconnect, either with the same queue name or a new queue all together?

Sam


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

2016-07-26 Thread Fox, Kevin M
It only relates to Kubernetes in that Kubernetes can do automatic rolling 
upgrades by destroying/replacing a service. If the services don't clean up 
after themselves, then performing a rolling upgrade will break things.

So, what do you think is the best approach to ensuring all the services shut 
things down properly? Seems like its a cross project issue? Should a spec be 
submitted?

Thanks,
Kevin

From: Dmitry Mescheryakov [dmescherya...@mirantis.com]
Sent: Tuesday, July 26, 2016 11:01 AM
To: Fox, Kevin M
Cc: Sam Morrison; OpenStack Operators
Subject: Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to 
Liberty



2016-07-25 18:47 GMT+03:00 Fox, Kevin M 
<kevin@pnnl.gov<mailto:kevin@pnnl.gov>>:
Ah. Interesting.

The graceful shutdown would really help the Kubernetes situation too. 
Kubernetes can do easy rolling upgrades and having the processes being able to 
clean up after themselves as they are upgraded is important. Is this something 
that needs to go into oslo.messaging or does it have to be added to all 
projects using it?

It both needs to be fixed on oslo.messaging side (delete fanout queue on RPC 
server stop, which is done by Kirill's CR) and on side of projects using it, as 
they need to actually stop RPC server before shutting down. As I wrote earlier, 
among Neutron processes right now only openvswitch and metadata agents do not 
stop RPC server.

I am not sure how that relates to Kubernates, as I not much familiar with it.

Thanks,

Dmitry


Thanks,
Kevin

From: Dmitry Mescheryakov 
[dmescherya...@mirantis.com<mailto:dmescherya...@mirantis.com>]
Sent: Monday, July 25, 2016 3:47 AM
To: Sam Morrison
Cc: OpenStack Operators
Subject: Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to 
Liberty

Sam,

For your case I would suggest to lower rabbit_transient_queues_ttl until you 
are comfortable with volume of messages which comes during that time. Setting 
the parameter to 1 will essentially replicate bahaviour of auto_delete queues. 
But I would suggest not to set it that low, as otherwise your OpenStack will 
suffer from the original bug. Probably a value like 20 seconds should work in 
most cases.

I think that there is a space for improvement here - we can delete reply and 
fanout queues on graceful shutdown. But I am not sure if it will be easy to 
implement, as it requires services (Nova, Neutron, etc.) to stop RPC server on 
sigint and I don't know if they do it right now.

I don't think we can make case with sigkill any better. Other than that, the 
issue could be investigated on Neutron side, maybe number of messages could be 
reduced there.

Thanks,

Dmitry

2016-07-25 9:27 GMT+03:00 Sam Morrison 
<sorri...@gmail.com<mailto:sorri...@gmail.com>>:
We recently upgraded to Liberty and have come across some issues with queue 
build ups.

This is due to changes in rabbit to set queue expiries as opposed to queue auto 
delete.
See https://bugs.launchpad.net/oslo.messaging/+bug/1515278 for more information.

The fix for this bug is in liberty and it does fix an issue however it causes 
another one.

Every time you restart something that has a fanout queue. Eg. cinder-scheduler 
or the neutron agents you will have
a queue in rabbit that is still bound to the rabbitmq exchange (and so still 
getting messages in) but no consumers.

These messages in these queues are basically rubbish and don’t need to exist. 
Rabbit will delete these queues after 10 mins (although the default in master 
is now changed to 30 mins)

During this time the queue will grow and grow with messages. This sets off our 
nagios alerts and our ops guys have to deal with something that isn’t really an 
issue. They basically delete the queue.

A bad scenario is when you make a change to your cloud that means all your 1000 
neutron agents are restarted, this causes a couple of dead queues per agent to 
hang around. (port updates and security group updates) We get around 25 
messages / second on these queues and so you can see after 10 minutes we have a 
ton of messages in these queues.

1000 x 2 x 25 x 600 = 30,000,000 messages in 10 minutes to be precise.

Has anyone else been suffering with this before a raise a bug?

Cheers,
Sam


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org<mailto:OpenStack-operators@lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

2016-07-26 Thread Dmitry Mescheryakov
2016-07-26 2:15 GMT+03:00 Sam Morrison :

> The queue TTL happens on reply queues and fanout queues. I don’t think it
> should happen on fanout queues. They should auto delete. I can understand
> the reason for having them on reply queues though so maybe that would be a
> way to forward?
>
> Or am I missing something and it is needed on fanout queues too?
>

I would say we do need fanout queues to expire for the very same reason we
want reply queues to expire instead of auto delete. In case of broken
connection, the expiration provides client time to reconnect and continue
consuming from the queue. In case of auto-delete queues, it was a frequent
case that RabbitMQ deleted the queue before client reconnects ... along
with all non-consumed messages in it.

Thanks,

Dmitry


>
> Cheers,
> Sam
>
>
>
> On 25 Jul 2016, at 8:47 PM, Dmitry Mescheryakov <
> dmescherya...@mirantis.com> wrote:
>
> Sam,
>
> For your case I would suggest to lower rabbit_transient_queues_ttl until
> you are comfortable with volume of messages which comes during that time.
> Setting the parameter to 1 will essentially replicate bahaviour of
> auto_delete queues. But I would suggest not to set it that low, as
> otherwise your OpenStack will suffer from the original bug. Probably a
> value like 20 seconds should work in most cases.
>
> I think that there is a space for improvement here - we can delete reply
> and fanout queues on graceful shutdown. But I am not sure if it will be
> easy to implement, as it requires services (Nova, Neutron, etc.) to stop
> RPC server on sigint and I don't know if they do it right now.
>
> I don't think we can make case with sigkill any better. Other than that,
> the issue could be investigated on Neutron side, maybe number of messages
> could be reduced there.
>
> Thanks,
>
> Dmitry
>
> 2016-07-25 9:27 GMT+03:00 Sam Morrison :
>
>> We recently upgraded to Liberty and have come across some issues with
>> queue build ups.
>>
>> This is due to changes in rabbit to set queue expiries as opposed to
>> queue auto delete.
>> See https://bugs.launchpad.net/oslo.messaging/+bug/1515278 for more
>> information.
>>
>> The fix for this bug is in liberty and it does fix an issue however it
>> causes another one.
>>
>> Every time you restart something that has a fanout queue. Eg.
>> cinder-scheduler or the neutron agents you will have
>> a queue in rabbit that is still bound to the rabbitmq exchange (and so
>> still getting messages in) but no consumers.
>>
>> These messages in these queues are basically rubbish and don’t need to
>> exist. Rabbit will delete these queues after 10 mins (although the default
>> in master is now changed to 30 mins)
>>
>> During this time the queue will grow and grow with messages. This sets
>> off our nagios alerts and our ops guys have to deal with something that
>> isn’t really an issue. They basically delete the queue.
>>
>> A bad scenario is when you make a change to your cloud that means all
>> your 1000 neutron agents are restarted, this causes a couple of dead queues
>> per agent to hang around. (port updates and security group updates) We get
>> around 25 messages / second on these queues and so you can see after 10
>> minutes we have a ton of messages in these queues.
>>
>> 1000 x 2 x 25 x 600 = 30,000,000 messages in 10 minutes to be precise.
>>
>> Has anyone else been suffering with this before a raise a bug?
>>
>> Cheers,
>> Sam
>>
>>
>> ___
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

2016-07-26 Thread Dmitry Mescheryakov
2016-07-25 18:47 GMT+03:00 Fox, Kevin M <kevin@pnnl.gov>:

> Ah. Interesting.
>
> The graceful shutdown would really help the Kubernetes situation too.
> Kubernetes can do easy rolling upgrades and having the processes being able
> to clean up after themselves as they are upgraded is important. Is this
> something that needs to go into oslo.messaging or does it have to be added
> to all projects using it?
>

It both needs to be fixed on oslo.messaging side (delete fanout queue on
RPC server stop, which is done by Kirill's CR) and on side of projects
using it, as they need to actually stop RPC server before shutting down. As
I wrote earlier, among Neutron processes right now only openvswitch and
metadata agents do not stop RPC server.

I am not sure how that relates to Kubernates, as I not much familiar with
it.

Thanks,

Dmitry


>
> Thanks,
> Kevin
> --
> *From:* Dmitry Mescheryakov [dmescherya...@mirantis.com]
> *Sent:* Monday, July 25, 2016 3:47 AM
> *To:* Sam Morrison
> *Cc:* OpenStack Operators
> *Subject:* Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues
> moving to Liberty
>
> Sam,
>
> For your case I would suggest to lower rabbit_transient_queues_ttl until
> you are comfortable with volume of messages which comes during that time.
> Setting the parameter to 1 will essentially replicate bahaviour of
> auto_delete queues. But I would suggest not to set it that low, as
> otherwise your OpenStack will suffer from the original bug. Probably a
> value like 20 seconds should work in most cases.
>
> I think that there is a space for improvement here - we can delete reply
> and fanout queues on graceful shutdown. But I am not sure if it will be
> easy to implement, as it requires services (Nova, Neutron, etc.) to stop
> RPC server on sigint and I don't know if they do it right now.
>
> I don't think we can make case with sigkill any better. Other than that,
> the issue could be investigated on Neutron side, maybe number of messages
> could be reduced there.
>
> Thanks,
>
> Dmitry
>
> 2016-07-25 9:27 GMT+03:00 Sam Morrison <sorri...@gmail.com>:
>
>> We recently upgraded to Liberty and have come across some issues with
>> queue build ups.
>>
>> This is due to changes in rabbit to set queue expiries as opposed to
>> queue auto delete.
>> See https://bugs.launchpad.net/oslo.messaging/+bug/1515278 for more
>> information.
>>
>> The fix for this bug is in liberty and it does fix an issue however it
>> causes another one.
>>
>> Every time you restart something that has a fanout queue. Eg.
>> cinder-scheduler or the neutron agents you will have
>> a queue in rabbit that is still bound to the rabbitmq exchange (and so
>> still getting messages in) but no consumers.
>>
>> These messages in these queues are basically rubbish and don’t need to
>> exist. Rabbit will delete these queues after 10 mins (although the default
>> in master is now changed to 30 mins)
>>
>> During this time the queue will grow and grow with messages. This sets
>> off our nagios alerts and our ops guys have to deal with something that
>> isn’t really an issue. They basically delete the queue.
>>
>> A bad scenario is when you make a change to your cloud that means all
>> your 1000 neutron agents are restarted, this causes a couple of dead queues
>> per agent to hang around. (port updates and security group updates) We get
>> around 25 messages / second on these queues and so you can see after 10
>> minutes we have a ton of messages in these queues.
>>
>> 1000 x 2 x 25 x 600 = 30,000,000 messages in 10 minutes to be precise.
>>
>> Has anyone else been suffering with this before a raise a bug?
>>
>> Cheers,
>> Sam
>>
>>
>> ___
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

2016-07-25 Thread Sam Morrison
The queue TTL happens on reply queues and fanout queues. I don’t think it 
should happen on fanout queues. They should auto delete. I can understand the 
reason for having them on reply queues though so maybe that would be a way to 
forward?

Or am I missing something and it is needed on fanout queues too?

Cheers,
Sam



> On 25 Jul 2016, at 8:47 PM, Dmitry Mescheryakov  
> wrote:
> 
> Sam,
> 
> For your case I would suggest to lower rabbit_transient_queues_ttl until you 
> are comfortable with volume of messages which comes during that time. Setting 
> the parameter to 1 will essentially replicate bahaviour of auto_delete 
> queues. But I would suggest not to set it that low, as otherwise your 
> OpenStack will suffer from the original bug. Probably a value like 20 seconds 
> should work in most cases.
> 
> I think that there is a space for improvement here - we can delete reply and 
> fanout queues on graceful shutdown. But I am not sure if it will be easy to 
> implement, as it requires services (Nova, Neutron, etc.) to stop RPC server 
> on sigint and I don't know if they do it right now.
> 
> I don't think we can make case with sigkill any better. Other than that, the 
> issue could be investigated on Neutron side, maybe number of messages could 
> be reduced there.
> 
> Thanks,
> 
> Dmitry
> 
> 2016-07-25 9:27 GMT+03:00 Sam Morrison  >:
> We recently upgraded to Liberty and have come across some issues with queue 
> build ups.
> 
> This is due to changes in rabbit to set queue expiries as opposed to queue 
> auto delete.
> See https://bugs.launchpad.net/oslo.messaging/+bug/1515278 
>  for more information.
> 
> The fix for this bug is in liberty and it does fix an issue however it causes 
> another one.
> 
> Every time you restart something that has a fanout queue. Eg. 
> cinder-scheduler or the neutron agents you will have
> a queue in rabbit that is still bound to the rabbitmq exchange (and so still 
> getting messages in) but no consumers.
> 
> These messages in these queues are basically rubbish and don’t need to exist. 
> Rabbit will delete these queues after 10 mins (although the default in master 
> is now changed to 30 mins)
> 
> During this time the queue will grow and grow with messages. This sets off 
> our nagios alerts and our ops guys have to deal with something that isn’t 
> really an issue. They basically delete the queue.
> 
> A bad scenario is when you make a change to your cloud that means all your 
> 1000 neutron agents are restarted, this causes a couple of dead queues per 
> agent to hang around. (port updates and security group updates) We get around 
> 25 messages / second on these queues and so you can see after 10 minutes we 
> have a ton of messages in these queues.
> 
> 1000 x 2 x 25 x 600 = 30,000,000 messages in 10 minutes to be precise.
> 
> Has anyone else been suffering with this before a raise a bug?
> 
> Cheers,
> Sam
> 
> 
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org 
> 
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators 
> 
> 

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

2016-07-25 Thread Fox, Kevin M
Yeah, we've experienced it but hadn't had time yet to really dig in like this, 
or gotten a good workaround. If you file a bug, please let me know what number.

Thanks,
Kevin

From: Sam Morrison [sorri...@gmail.com]
Sent: Sunday, July 24, 2016 11:27 PM
To: OpenStack Operators
Subject: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to   
Liberty

We recently upgraded to Liberty and have come across some issues with queue 
build ups.

This is due to changes in rabbit to set queue expiries as opposed to queue auto 
delete.
See https://bugs.launchpad.net/oslo.messaging/+bug/1515278 for more information.

The fix for this bug is in liberty and it does fix an issue however it causes 
another one.

Every time you restart something that has a fanout queue. Eg. cinder-scheduler 
or the neutron agents you will have
a queue in rabbit that is still bound to the rabbitmq exchange (and so still 
getting messages in) but no consumers.

These messages in these queues are basically rubbish and don’t need to exist. 
Rabbit will delete these queues after 10 mins (although the default in master 
is now changed to 30 mins)

During this time the queue will grow and grow with messages. This sets off our 
nagios alerts and our ops guys have to deal with something that isn’t really an 
issue. They basically delete the queue.

A bad scenario is when you make a change to your cloud that means all your 1000 
neutron agents are restarted, this causes a couple of dead queues per agent to 
hang around. (port updates and security group updates) We get around 25 
messages / second on these queues and so you can see after 10 minutes we have a 
ton of messages in these queues.

1000 x 2 x 25 x 600 = 30,000,000 messages in 10 minutes to be precise.

Has anyone else been suffering with this before a raise a bug?

Cheers,
Sam


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

2016-07-25 Thread Fox, Kevin M
Ah. Interesting.

The graceful shutdown would really help the Kubernetes situation too. 
Kubernetes can do easy rolling upgrades and having the processes being able to 
clean up after themselves as they are upgraded is important. Is this something 
that needs to go into oslo.messaging or does it have to be added to all 
projects using it?

Thanks,
Kevin

From: Dmitry Mescheryakov [dmescherya...@mirantis.com]
Sent: Monday, July 25, 2016 3:47 AM
To: Sam Morrison
Cc: OpenStack Operators
Subject: Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to 
Liberty

Sam,

For your case I would suggest to lower rabbit_transient_queues_ttl until you 
are comfortable with volume of messages which comes during that time. Setting 
the parameter to 1 will essentially replicate bahaviour of auto_delete queues. 
But I would suggest not to set it that low, as otherwise your OpenStack will 
suffer from the original bug. Probably a value like 20 seconds should work in 
most cases.

I think that there is a space for improvement here - we can delete reply and 
fanout queues on graceful shutdown. But I am not sure if it will be easy to 
implement, as it requires services (Nova, Neutron, etc.) to stop RPC server on 
sigint and I don't know if they do it right now.

I don't think we can make case with sigkill any better. Other than that, the 
issue could be investigated on Neutron side, maybe number of messages could be 
reduced there.

Thanks,

Dmitry

2016-07-25 9:27 GMT+03:00 Sam Morrison 
<sorri...@gmail.com<mailto:sorri...@gmail.com>>:
We recently upgraded to Liberty and have come across some issues with queue 
build ups.

This is due to changes in rabbit to set queue expiries as opposed to queue auto 
delete.
See https://bugs.launchpad.net/oslo.messaging/+bug/1515278 for more information.

The fix for this bug is in liberty and it does fix an issue however it causes 
another one.

Every time you restart something that has a fanout queue. Eg. cinder-scheduler 
or the neutron agents you will have
a queue in rabbit that is still bound to the rabbitmq exchange (and so still 
getting messages in) but no consumers.

These messages in these queues are basically rubbish and don’t need to exist. 
Rabbit will delete these queues after 10 mins (although the default in master 
is now changed to 30 mins)

During this time the queue will grow and grow with messages. This sets off our 
nagios alerts and our ops guys have to deal with something that isn’t really an 
issue. They basically delete the queue.

A bad scenario is when you make a change to your cloud that means all your 1000 
neutron agents are restarted, this causes a couple of dead queues per agent to 
hang around. (port updates and security group updates) We get around 25 
messages / second on these queues and so you can see after 10 minutes we have a 
ton of messages in these queues.

1000 x 2 x 25 x 600 = 30,000,000 messages in 10 minutes to be precise.

Has anyone else been suffering with this before a raise a bug?

Cheers,
Sam


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org<mailto:OpenStack-operators@lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

2016-07-25 Thread Dmitry Mescheryakov
I have filed a bug in oslo.messaging to track the issue [1] and my
colleague Kirill Bespalov posted a fix for it [2].

We have checked the fix and it is working for neutron-server, l3-agent and
dhcp-agent. It does not work for openvswitch-agent and metadata-agent
meaning they do not stop RPC server on shutdown.

But I would expect that absolute majority of fanout messages come from l3
agent and we can neglect these two. Does it coincide with your observations?

Thanks,

Dmitry

[1] https://bugs.launchpad.net/oslo.messaging/+bug/1606213
[2] https://review.openstack.org/#/c/346732/

2016-07-25 13:47 GMT+03:00 Dmitry Mescheryakov :

> Sam,
>
> For your case I would suggest to lower rabbit_transient_queues_ttl until
> you are comfortable with volume of messages which comes during that time.
> Setting the parameter to 1 will essentially replicate bahaviour of
> auto_delete queues. But I would suggest not to set it that low, as
> otherwise your OpenStack will suffer from the original bug. Probably a
> value like 20 seconds should work in most cases.
>
> I think that there is a space for improvement here - we can delete reply
> and fanout queues on graceful shutdown. But I am not sure if it will be
> easy to implement, as it requires services (Nova, Neutron, etc.) to stop
> RPC server on sigint and I don't know if they do it right now.
>
> I don't think we can make case with sigkill any better. Other than that,
> the issue could be investigated on Neutron side, maybe number of messages
> could be reduced there.
>
> Thanks,
>
> Dmitry
>
> 2016-07-25 9:27 GMT+03:00 Sam Morrison :
>
>> We recently upgraded to Liberty and have come across some issues with
>> queue build ups.
>>
>> This is due to changes in rabbit to set queue expiries as opposed to
>> queue auto delete.
>> See https://bugs.launchpad.net/oslo.messaging/+bug/1515278 for more
>> information.
>>
>> The fix for this bug is in liberty and it does fix an issue however it
>> causes another one.
>>
>> Every time you restart something that has a fanout queue. Eg.
>> cinder-scheduler or the neutron agents you will have
>> a queue in rabbit that is still bound to the rabbitmq exchange (and so
>> still getting messages in) but no consumers.
>>
>> These messages in these queues are basically rubbish and don’t need to
>> exist. Rabbit will delete these queues after 10 mins (although the default
>> in master is now changed to 30 mins)
>>
>> During this time the queue will grow and grow with messages. This sets
>> off our nagios alerts and our ops guys have to deal with something that
>> isn’t really an issue. They basically delete the queue.
>>
>> A bad scenario is when you make a change to your cloud that means all
>> your 1000 neutron agents are restarted, this causes a couple of dead queues
>> per agent to hang around. (port updates and security group updates) We get
>> around 25 messages / second on these queues and so you can see after 10
>> minutes we have a ton of messages in these queues.
>>
>> 1000 x 2 x 25 x 600 = 30,000,000 messages in 10 minutes to be precise.
>>
>> Has anyone else been suffering with this before a raise a bug?
>>
>> Cheers,
>> Sam
>>
>>
>> ___
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

2016-07-25 Thread Dmitry Mescheryakov
Sam,

For your case I would suggest to lower rabbit_transient_queues_ttl until
you are comfortable with volume of messages which comes during that time.
Setting the parameter to 1 will essentially replicate bahaviour of
auto_delete queues. But I would suggest not to set it that low, as
otherwise your OpenStack will suffer from the original bug. Probably a
value like 20 seconds should work in most cases.

I think that there is a space for improvement here - we can delete reply
and fanout queues on graceful shutdown. But I am not sure if it will be
easy to implement, as it requires services (Nova, Neutron, etc.) to stop
RPC server on sigint and I don't know if they do it right now.

I don't think we can make case with sigkill any better. Other than that,
the issue could be investigated on Neutron side, maybe number of messages
could be reduced there.

Thanks,

Dmitry

2016-07-25 9:27 GMT+03:00 Sam Morrison :

> We recently upgraded to Liberty and have come across some issues with
> queue build ups.
>
> This is due to changes in rabbit to set queue expiries as opposed to queue
> auto delete.
> See https://bugs.launchpad.net/oslo.messaging/+bug/1515278 for more
> information.
>
> The fix for this bug is in liberty and it does fix an issue however it
> causes another one.
>
> Every time you restart something that has a fanout queue. Eg.
> cinder-scheduler or the neutron agents you will have
> a queue in rabbit that is still bound to the rabbitmq exchange (and so
> still getting messages in) but no consumers.
>
> These messages in these queues are basically rubbish and don’t need to
> exist. Rabbit will delete these queues after 10 mins (although the default
> in master is now changed to 30 mins)
>
> During this time the queue will grow and grow with messages. This sets off
> our nagios alerts and our ops guys have to deal with something that isn’t
> really an issue. They basically delete the queue.
>
> A bad scenario is when you make a change to your cloud that means all your
> 1000 neutron agents are restarted, this causes a couple of dead queues per
> agent to hang around. (port updates and security group updates) We get
> around 25 messages / second on these queues and so you can see after 10
> minutes we have a ton of messages in these queues.
>
> 1000 x 2 x 25 x 600 = 30,000,000 messages in 10 minutes to be precise.
>
> Has anyone else been suffering with this before a raise a bug?
>
> Cheers,
> Sam
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

2016-07-25 Thread Sam Morrison
We recently upgraded to Liberty and have come across some issues with queue 
build ups.

This is due to changes in rabbit to set queue expiries as opposed to queue auto 
delete. 
See https://bugs.launchpad.net/oslo.messaging/+bug/1515278 for more information.

The fix for this bug is in liberty and it does fix an issue however it causes 
another one.

Every time you restart something that has a fanout queue. Eg. cinder-scheduler 
or the neutron agents you will have 
a queue in rabbit that is still bound to the rabbitmq exchange (and so still 
getting messages in) but no consumers.

These messages in these queues are basically rubbish and don’t need to exist. 
Rabbit will delete these queues after 10 mins (although the default in master 
is now changed to 30 mins)

During this time the queue will grow and grow with messages. This sets off our 
nagios alerts and our ops guys have to deal with something that isn’t really an 
issue. They basically delete the queue.

A bad scenario is when you make a change to your cloud that means all your 1000 
neutron agents are restarted, this causes a couple of dead queues per agent to 
hang around. (port updates and security group updates) We get around 25 
messages / second on these queues and so you can see after 10 minutes we have a 
ton of messages in these queues.

1000 x 2 x 25 x 600 = 30,000,000 messages in 10 minutes to be precise.

Has anyone else been suffering with this before a raise a bug?

Cheers,
Sam


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators