Last night my central server decided to crap out for one reason or another.
Upon bringing it back, I confirmed a few issues with zenoss. I believe one of
them to be a cache issue. I'm putting in as much detail as possible with the
hope that it has already been discovered and a patch is available.
Layout Overview:
1 central, 2 remote. One of the remote performance collectors is in a separate
data center. All are doing 1 minute polling.
The central collects 650 devices. The remote collects one (currently testing.)
On the central server, I am waiting for SAN disk allocation as the I/O
generated is too much for the local disk. In the interim, since the central
server has enough memory, an in memory filesystem for the rrds only
($ZENHOME/perf) was created. This is periodically backed up with tar and
compressed to local disk in case of a failure and has been working fine.
I have also increased the zodb cache-size to 15000 and the zeoclient cach-size
to 40MB in the zope.conf file. Host edits were painfully slow.
Events are in chronological order;
- Yesterday I added in an additional remote server (lets say R2). To test the
remote server I wanted to move the machine being monitored by the existing
remote server (lets call it R1) to R2. Lets call the monitored machined M1.
- I went to the Edit tab of device M1 and changed the performance monitor to
R2. R2 was able to pick this up fine. However, R1 continued to monitor M1. I
figure I will leave this overnight to see if it works itself out.
- Machine crashes hard overnight.
- After bringing up the machine, zenoss starts up as usual. But since the perf
directory is memory, it is now empty.
- The remote collectors reconnected ok after the machine rebooted.
- I stopped all of zenoss so the perf directory can be restored.
- Restarted zenoss.
- Now R1 is in the same segment as the central server and reconnects to zenhub
ok. But it is still collecting M1. (Problem 1).
- R2 never reconnects again. It is stuck somewhere... (Problem 2)
select(6, [5], [], [], {42, 22000}) = 0 (Timeout)
gettimeofday({1194534992, 620061}, NULL) = 0
futex(0x9267940, FUTEX_WAKE, 1) = 0
gettimeofday({1194534992, 620246}, NULL) = 0
futex(0x9267940, FUTEX_WAKE, 1) = 0
select(6, [5], [], [], {0, 7790}) = 0 (Timeout)
gettimeofday({1194534992, 627997}, NULL) = 0
futex(0x9267940, FUTEX_WAKE, 1) = 0
gettimeofday({1194534992, 628153}, NULL) = 0
futex(0x9267940, FUTEX_WAKE, 1) = 0
select(6, [5], [], [], {0, 0}) = 0 (Timeout)
Now there are 2 connections from R2 to the central server for zenhub. I thought
this was strange.
- Restarted zenoss on R1, zenperfsnmp discovers it should no longer collect M1.
- Restarted zenoss on R2, zenperfsnmp reconnects to zenhub ok and continues
collecting M1.
So problem 1 & 2 are cleared now but it is going to be troublesome if
zenperfsnmp needs to be restarted when moving devices around and that it can
get stuck if the master server goes down. This prevents any kind of FT from
being provided. FT/DR should be a component of an enterprise system, IMHO.
- Now I take a look at the central server after a few minutes of zenoss up and
running. Hmmm...zenperfsnmp is reporting there are ~205 out of 650 hosts bad.
It is stuck in this loop for ~1 hour now. (Problem 3). Usually after a restart
it takes 2-3 performance cycles to get everything normalized.
- Each one of these hosts were reporting that the snmp agent was down.
- I noticed that the hosts are pretty much the same as the hosts that had their
community string changed ~1 week ago. The community was reset via the drop down
menu for each device and then pushed.
- I did an snmpwalk on the command line and the agents were fine.
- I checked in the edit tab of a few devices and the community string was the
newest.
- I did an snmpwalk via the gui. Info came back fine.
- I confirmed with a packet capture that zenperfsnmp was using the old
community string! (Still problem 3 here.)
- So I restarted zenperfsnmp, all hosts came back after 2-3 performance cycles
but 9 hosts were still reporting bad (1 bad was valid).
- Now 100 hosts disappear as they are being retired.
- I got to my the group they are in under /System, select the hosts and choose
Delete devices...
- ok, I see them removed from the screen.
- Now one of the 9 bad hosts are part of the 100 I just deleted but it still
continues to be reported as bad (Problem 4).
- Restart zenperfsnmp again.
- zenperfsnmp says 650 out 650 hosts configured (strange, since I just deleted
100).
- Recovers after 2-3 iterations (normal) but 9 hosts still bad. (Problem 3)
- Deleted host is still there! (Same problem 4).
- Restarted all of zenoss this time.
- Now only 2 hosts bad (1 valid, 1 should be deleted)
- Note: R1 and R2 reconnected back to zenhub with a problem.
Now it turns out that the hosts I deleted were not deleted. They were only
removed from the systems group name I had them in. This is very misleading. I
deleted them from the "Device List" menu and then they were gone permanently.
The full restart seemed to flush or refresh the cache but zenoss was restarted
after the community string change. So I'm not sure what happened here but this
is a major problem if multiple restarts are needed to get the correct community
string to be sent.
I'm hoping the moving of a device between performance collectors really doesn't
require a restart of the collector.
Anyone seen Problems 1-4? Ideas? I believe that Problem 3 is a bug in the cache.
If you read this all, thanx.
-Paul
-------------------- m2f --------------------
Read this topic online here:
http://community.zenoss.com/forums/viewtopic.php?p=12970#12970
-------------------- m2f --------------------
_______________________________________________
zenoss-users mailing list
[email protected]
http://lists.zenoss.org/mailman/listinfo/zenoss-users