Re: [openstack-dev] [Openstack-operators] [nova] Rabbit-mq 3.4 crashing (anyone else seen this?)

2016-07-06 Thread Rochelle Grober
repository is:  http://git.openstack.org/cgit/openstack/osops-tools-contrib/

FYI, there are also:  osops-tools-generic, osops-tools-logging, 
osops-tools-monitoring, osops-example-configs and osops-coda

Wish I could help more,

--Rocky

-Original Message-
From: Joshua Harlow [mailto:harlo...@fastmail.com] 
Sent: Tuesday, July 05, 2016 10:44 AM
To: Matt Fischer
Cc: openstack-dev@lists.openstack.org; OpenStack Operators
Subject: Re: [openstack-dev] [Openstack-operators] [nova] Rabbit-mq 3.4 
crashing (anyone else seen this?)

Ah, those sets of command sound pretty nice to run periodically,

Sounds like a useful script that could be placed in the ops tools repo 
(I forget where this repo exists at, but pretty sure it does exist?).

Some other oddness though is that this issue seems to go away when we 
don't run cross-release; do you see that also?

Another hypothesis was that the following fix may be triggering part of 
this @ https://bugs.launchpad.net/oslo.messaging/+bug/1495568

So that if we have some queues being set up as auto-delete and some 
beign set up with expiry that perhaps the combination of these causes 
more work (and therefore eventually it falls behind and falls over) for 
the management database.

Matt Fischer wrote:
> Yes! This happens often but I'd not call it a crash, just the mgmt db
> gets behind then eats all the memory. We've started monitoring it and
> have runbooks on how to bounce just the mgmt db. Here are my notes on that:
>
> restart rabbitmq mgmt server - this seems to clear the memory usage.
>
> rabbitmqctl eval 'application:stop(rabbitmq_management).'
> rabbitmqctl eval 'application:start(rabbitmq_management).'
>
> run GC on rabbit_mgmt_db:
> rabbitmqctl eval
> '(erlang:garbage_collect(global:whereis_name(rabbit_mgmt_db)))'
>
> status of rabbit_mgmt_db:
> rabbitmqctl eval 'sys:get_status(global:whereis_name(rabbit_mgmt_db)).'
>
> Rabbitmq mgmt DB how much memory is used:
> /usr/sbin/rabbitmqctl status | grep mgmt_db
>
> Unfortunately I didn't see that an upgrade would fix for sure and any
> settings changes to reduce the number of monitored events also require a
> restart of the cluster. The other issue with an upgrade for us is the
> ancient version of erlang shipped with trusty. When we upgrade to Xenial
> we'll upgrade erlang and rabbit and hope it goes away. I'll also
> probably tweak the settings on retention of events then too.
>
> Also for the record the GC doesn't seem to help at all.
>
> On Jul 5, 2016 11:05 AM, "Joshua Harlow" <harlo...@fastmail.com
> <mailto:harlo...@fastmail.com>> wrote:
>
> Hi ops and dev-folks,
>
> We over at godaddy (running rabbitmq with openstack) have been
> hitting a issue that has been causing the `rabbit_mgmt_db` consuming
> nearly all the processes memory (after a given amount of time),
>
> We've been thinking that this bug (or bugs?) may have existed for a
> while and our dual-version-path (where we upgrade the control plane
> and then slowly/eventually upgrade the compute nodes to the same
> version) has somehow triggered this memory leaking bug/issue since
> it has happened most prominently on our cloud which was running
> nova-compute at kilo and the other services at liberty (thus using
> the versioned objects code path more frequently due to needing
> translations of objects).
>
> The rabbit we are running is 3.4.0 on CentOS Linux release 7.2.1511
> with kernel 3.10.0-327.4.4.el7.x86_64 (do note that upgrading to
> 3.6.2 seems to make the issue go away),
>
> # rpm -qa | grep rabbit
>
> rabbitmq-server-3.4.0-1.noarch
>
> The logs that seem relevant:
>
> ```
> **
> *** Publishers will be blocked until this alarm clears ***
> **
>
> =INFO REPORT 1-Jul-2016::16:37:46 ===
> accepting AMQP connection <0.23638.342> (127.0.0.1:51932
> <http://127.0.0.1:51932> -> 127.0.0.1:5671 <http://127.0.0.1:5671>)
>
> =INFO REPORT 1-Jul-2016::16:37:47 ===
> vm_memory_high_watermark clear. Memory used:29910180640
> allowed:47126781542
> ```
>
> This happens quite often, the crashes have been affecting our cloud
> over the weekend (which made some dev/ops not so happy especially
> due to the july 4th mini-vacation),
>
> Looking to see if anyone else has seen anything similar?
>
> For those interested this is the upstream bug/mail that I'm also
> seeing about getting confirmation from the upstream users/devs
> (which also has erlang crash dumps attached/linked),
>
> https://groups.go

Re: [openstack-dev] [Openstack-operators] [nova] Rabbit-mq 3.4 crashing (anyone else seen this?)

2016-07-05 Thread Joshua Harlow

Ah, those sets of command sound pretty nice to run periodically,

Sounds like a useful script that could be placed in the ops tools repo 
(I forget where this repo exists at, but pretty sure it does exist?).


Some other oddness though is that this issue seems to go away when we 
don't run cross-release; do you see that also?


Another hypothesis was that the following fix may be triggering part of 
this @ https://bugs.launchpad.net/oslo.messaging/+bug/1495568


So that if we have some queues being set up as auto-delete and some 
beign set up with expiry that perhaps the combination of these causes 
more work (and therefore eventually it falls behind and falls over) for 
the management database.


Matt Fischer wrote:

Yes! This happens often but I'd not call it a crash, just the mgmt db
gets behind then eats all the memory. We've started monitoring it and
have runbooks on how to bounce just the mgmt db. Here are my notes on that:

restart rabbitmq mgmt server - this seems to clear the memory usage.

rabbitmqctl eval 'application:stop(rabbitmq_management).'
rabbitmqctl eval 'application:start(rabbitmq_management).'

run GC on rabbit_mgmt_db:
rabbitmqctl eval
'(erlang:garbage_collect(global:whereis_name(rabbit_mgmt_db)))'

status of rabbit_mgmt_db:
rabbitmqctl eval 'sys:get_status(global:whereis_name(rabbit_mgmt_db)).'

Rabbitmq mgmt DB how much memory is used:
/usr/sbin/rabbitmqctl status | grep mgmt_db

Unfortunately I didn't see that an upgrade would fix for sure and any
settings changes to reduce the number of monitored events also require a
restart of the cluster. The other issue with an upgrade for us is the
ancient version of erlang shipped with trusty. When we upgrade to Xenial
we'll upgrade erlang and rabbit and hope it goes away. I'll also
probably tweak the settings on retention of events then too.

Also for the record the GC doesn't seem to help at all.

On Jul 5, 2016 11:05 AM, "Joshua Harlow" > wrote:

Hi ops and dev-folks,

We over at godaddy (running rabbitmq with openstack) have been
hitting a issue that has been causing the `rabbit_mgmt_db` consuming
nearly all the processes memory (after a given amount of time),

We've been thinking that this bug (or bugs?) may have existed for a
while and our dual-version-path (where we upgrade the control plane
and then slowly/eventually upgrade the compute nodes to the same
version) has somehow triggered this memory leaking bug/issue since
it has happened most prominently on our cloud which was running
nova-compute at kilo and the other services at liberty (thus using
the versioned objects code path more frequently due to needing
translations of objects).

The rabbit we are running is 3.4.0 on CentOS Linux release 7.2.1511
with kernel 3.10.0-327.4.4.el7.x86_64 (do note that upgrading to
3.6.2 seems to make the issue go away),

# rpm -qa | grep rabbit

rabbitmq-server-3.4.0-1.noarch

The logs that seem relevant:

```
**
*** Publishers will be blocked until this alarm clears ***
**

=INFO REPORT 1-Jul-2016::16:37:46 ===
accepting AMQP connection <0.23638.342> (127.0.0.1:51932
 -> 127.0.0.1:5671 )

=INFO REPORT 1-Jul-2016::16:37:47 ===
vm_memory_high_watermark clear. Memory used:29910180640
allowed:47126781542
```

This happens quite often, the crashes have been affecting our cloud
over the weekend (which made some dev/ops not so happy especially
due to the july 4th mini-vacation),

Looking to see if anyone else has seen anything similar?

For those interested this is the upstream bug/mail that I'm also
seeing about getting confirmation from the upstream users/devs
(which also has erlang crash dumps attached/linked),

https://groups.google.com/forum/#!topic/rabbitmq-users/FeBK7iXUcLg

Thanks,

-Josh

___
OpenStack-operators mailing list
openstack-operat...@lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Openstack-operators] [nova] Rabbit-mq 3.4 crashing (anyone else seen this?)

2016-07-05 Thread Kris G. Lindgren
We tried some of these (well I did last night), but the issue was that 
eventually rabbitmq actually died.  I was trying some of the eval commands to 
try to get what was in the mgmt_db, bet any get-status call eventually lead to 
a timeout error.  Part of the problem is that we can go from a warning to a 
zomg out of memory in under 2 minutes.  Last night it was taking only 2 hours 
to chew thew 40GB of ram.  Messaging rates were in the 150-300/s which is not 
all that high (another cell is doing a constant 1k-2k).

___
Kris Lindgren
Senior Linux Systems Engineer
GoDaddy

From: Matt Fischer >
Date: Tuesday, July 5, 2016 at 11:25 AM
To: Joshua Harlow >
Cc: 
"openstack-dev@lists.openstack.org" 
>, 
OpenStack Operators 
>
Subject: Re: [Openstack-operators] [nova] Rabbit-mq 3.4 crashing (anyone else 
seen this?)


Yes! This happens often but I'd not call it a crash, just the mgmt db gets 
behind then eats all the memory. We've started monitoring it and have runbooks 
on how to bounce just the mgmt db. Here are my notes on that:

restart rabbitmq mgmt server - this seems to clear the memory usage.

rabbitmqctl eval 'application:stop(rabbitmq_management).'
rabbitmqctl eval 'application:start(rabbitmq_management).'

run GC on rabbit_mgmt_db:
rabbitmqctl eval '(erlang:garbage_collect(global:whereis_name(rabbit_mgmt_db)))'

status of rabbit_mgmt_db:
rabbitmqctl eval 'sys:get_status(global:whereis_name(rabbit_mgmt_db)).'

Rabbitmq mgmt DB how much memory is used:
/usr/sbin/rabbitmqctl status | grep mgmt_db

Unfortunately I didn't see that an upgrade would fix for sure and any settings 
changes to reduce the number of monitored events also require a restart of the 
cluster. The other issue with an upgrade for us is the ancient version of 
erlang shipped with trusty. When we upgrade to Xenial we'll upgrade erlang and 
rabbit and hope it goes away. I'll also probably tweak the settings on 
retention of events then too.

Also for the record the GC doesn't seem to help at all.

On Jul 5, 2016 11:05 AM, "Joshua Harlow" 
> wrote:
Hi ops and dev-folks,

We over at godaddy (running rabbitmq with openstack) have been hitting a issue 
that has been causing the `rabbit_mgmt_db` consuming nearly all the processes 
memory (after a given amount of time),

We've been thinking that this bug (or bugs?) may have existed for a while and 
our dual-version-path (where we upgrade the control plane and then 
slowly/eventually upgrade the compute nodes to the same version) has somehow 
triggered this memory leaking bug/issue since it has happened most prominently 
on our cloud which was running nova-compute at kilo and the other services at 
liberty (thus using the versioned objects code path more frequently due to 
needing translations of objects).

The rabbit we are running is 3.4.0 on CentOS Linux release 7.2.1511 with kernel 
3.10.0-327.4.4.el7.x86_64 (do note that upgrading to 3.6.2 seems to make the 
issue go away),

# rpm -qa | grep rabbit

rabbitmq-server-3.4.0-1.noarch

The logs that seem relevant:

```
**
*** Publishers will be blocked until this alarm clears ***
**

=INFO REPORT 1-Jul-2016::16:37:46 ===
accepting AMQP connection <0.23638.342> 
(127.0.0.1:51932 -> 
127.0.0.1:5671)

=INFO REPORT 1-Jul-2016::16:37:47 ===
vm_memory_high_watermark clear. Memory used:29910180640 allowed:47126781542
```

This happens quite often, the crashes have been affecting our cloud over the 
weekend (which made some dev/ops not so happy especially due to the july 4th 
mini-vacation),

Looking to see if anyone else has seen anything similar?

For those interested this is the upstream bug/mail that I'm also seeing about 
getting confirmation from the upstream users/devs (which also has erlang crash 
dumps attached/linked),

https://groups.google.com/forum/#!topic/rabbitmq-users/FeBK7iXUcLg

Thanks,

-Josh

___
OpenStack-operators mailing list
openstack-operat...@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Openstack-operators] [nova] Rabbit-mq 3.4 crashing (anyone else seen this?)

2016-07-05 Thread Matt Fischer
For the record we're on 3.5.6-1.
On Jul 5, 2016 11:27 AM, "Mike Lowe"  wrote:

> I was having just this problem last week.  We updated to 3.6.2 from 3.5.4
> on ubuntu and stated seeing crashes due to excessive memory usage. I did
> this on each node of my rabbit cluster and haven’t had any problems since
> 'rabbitmq-plugins disable rabbitmq_management’.  From what I could gather
> from rabbitmq mailing lists the stats collection part of the management
> console is single threaded and can’t keep up thus the ever growing memory
> usage from the ever growing backlog of stats to be processed.
>
>
> > On Jul 5, 2016, at 1:02 PM, Joshua Harlow  wrote:
> >
> > Hi ops and dev-folks,
> >
> > We over at godaddy (running rabbitmq with openstack) have been hitting a
> issue that has been causing the `rabbit_mgmt_db` consuming nearly all the
> processes memory (after a given amount of time),
> >
> > We've been thinking that this bug (or bugs?) may have existed for a
> while and our dual-version-path (where we upgrade the control plane and
> then slowly/eventually upgrade the compute nodes to the same version) has
> somehow triggered this memory leaking bug/issue since it has happened most
> prominently on our cloud which was running nova-compute at kilo and the
> other services at liberty (thus using the versioned objects code path more
> frequently due to needing translations of objects).
> >
> > The rabbit we are running is 3.4.0 on CentOS Linux release 7.2.1511 with
> kernel 3.10.0-327.4.4.el7.x86_64 (do note that upgrading to 3.6.2 seems to
> make the issue go away),
> >
> > # rpm -qa | grep rabbit
> >
> > rabbitmq-server-3.4.0-1.noarch
> >
> > The logs that seem relevant:
> >
> > ```
> > **
> > *** Publishers will be blocked until this alarm clears ***
> > **
> >
> > =INFO REPORT 1-Jul-2016::16:37:46 ===
> > accepting AMQP connection <0.23638.342> (127.0.0.1:51932 ->
> 127.0.0.1:5671)
> >
> > =INFO REPORT 1-Jul-2016::16:37:47 ===
> > vm_memory_high_watermark clear. Memory used:29910180640
> allowed:47126781542
> > ```
> >
> > This happens quite often, the crashes have been affecting our cloud over
> the weekend (which made some dev/ops not so happy especially due to the
> july 4th mini-vacation),
> >
> > Looking to see if anyone else has seen anything similar?
> >
> > For those interested this is the upstream bug/mail that I'm also seeing
> about getting confirmation from the upstream users/devs (which also has
> erlang crash dumps attached/linked),
> >
> > https://groups.google.com/forum/#!topic/rabbitmq-users/FeBK7iXUcLg
> >
> > Thanks,
> >
> > -Josh
> >
> > ___
> > OpenStack-operators mailing list
> > openstack-operat...@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
> ___
> OpenStack-operators mailing list
> openstack-operat...@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Openstack-operators] [nova] Rabbit-mq 3.4 crashing (anyone else seen this?)

2016-07-05 Thread Matt Fischer
Yes! This happens often but I'd not call it a crash, just the mgmt db gets
behind then eats all the memory. We've started monitoring it and have
runbooks on how to bounce just the mgmt db. Here are my notes on that:

restart rabbitmq mgmt server - this seems to clear the memory usage.

rabbitmqctl eval 'application:stop(rabbitmq_management).'
rabbitmqctl eval 'application:start(rabbitmq_management).'

run GC on rabbit_mgmt_db:
rabbitmqctl eval
'(erlang:garbage_collect(global:whereis_name(rabbit_mgmt_db)))'

status of rabbit_mgmt_db:
rabbitmqctl eval 'sys:get_status(global:whereis_name(rabbit_mgmt_db)).'

Rabbitmq mgmt DB how much memory is used:
/usr/sbin/rabbitmqctl status | grep mgmt_db

Unfortunately I didn't see that an upgrade would fix for sure and any
settings changes to reduce the number of monitored events also require a
restart of the cluster. The other issue with an upgrade for us is the
ancient version of erlang shipped with trusty. When we upgrade to Xenial
we'll upgrade erlang and rabbit and hope it goes away. I'll also probably
tweak the settings on retention of events then too.

Also for the record the GC doesn't seem to help at all.
On Jul 5, 2016 11:05 AM, "Joshua Harlow"  wrote:

> Hi ops and dev-folks,
>
> We over at godaddy (running rabbitmq with openstack) have been hitting a
> issue that has been causing the `rabbit_mgmt_db` consuming nearly all the
> processes memory (after a given amount of time),
>
> We've been thinking that this bug (or bugs?) may have existed for a while
> and our dual-version-path (where we upgrade the control plane and then
> slowly/eventually upgrade the compute nodes to the same version) has
> somehow triggered this memory leaking bug/issue since it has happened most
> prominently on our cloud which was running nova-compute at kilo and the
> other services at liberty (thus using the versioned objects code path more
> frequently due to needing translations of objects).
>
> The rabbit we are running is 3.4.0 on CentOS Linux release 7.2.1511 with
> kernel 3.10.0-327.4.4.el7.x86_64 (do note that upgrading to 3.6.2 seems to
> make the issue go away),
>
> # rpm -qa | grep rabbit
>
> rabbitmq-server-3.4.0-1.noarch
>
> The logs that seem relevant:
>
> ```
> **
> *** Publishers will be blocked until this alarm clears ***
> **
>
> =INFO REPORT 1-Jul-2016::16:37:46 ===
> accepting AMQP connection <0.23638.342> (127.0.0.1:51932 -> 127.0.0.1:5671
> )
>
> =INFO REPORT 1-Jul-2016::16:37:47 ===
> vm_memory_high_watermark clear. Memory used:29910180640 allowed:47126781542
> ```
>
> This happens quite often, the crashes have been affecting our cloud over
> the weekend (which made some dev/ops not so happy especially due to the
> july 4th mini-vacation),
>
> Looking to see if anyone else has seen anything similar?
>
> For those interested this is the upstream bug/mail that I'm also seeing
> about getting confirmation from the upstream users/devs (which also has
> erlang crash dumps attached/linked),
>
> https://groups.google.com/forum/#!topic/rabbitmq-users/FeBK7iXUcLg
>
> Thanks,
>
> -Josh
>
> ___
> OpenStack-operators mailing list
> openstack-operat...@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev