[zenoss-users] Recovery Issues

pmininni Thu, 08 Nov 2007 11:50:42 -0800

Last night my central server decided to crap out for one reason or another. 
Upon bringing it back, I confirmed a few issues with zenoss. I believe one of 
them to be a cache issue. I'm putting in as much detail as possible with the 
hope that it has already been discovered and a patch is available.


Layout Overview:
1 central, 2 remote. One of the remote performance collectors is in a separate 
data center. All are doing 1 minute polling.
The central collects 650 devices. The remote collects one (currently testing.)
On the central server, I am waiting for SAN disk allocation as the I/O 
generated is too much for the local disk. In the interim, since the central 
server has enough memory, an in memory filesystem for the rrds only 
($ZENHOME/perf) was created. This is periodically backed up with tar and 
compressed to local disk in case of a failure and has been working fine.
I have also increased the zodb cache-size to 15000 and the zeoclient cach-size 
to 40MB in the zope.conf file. Host edits were painfully slow.

Events are in chronological order;
- Yesterday I added in an additional remote server (lets say R2). To test the 
remote server I wanted to move the machine being monitored by the existing 
remote server (lets call it R1) to R2. Lets call the monitored machined M1.
- I went to the Edit tab of device M1 and changed the performance monitor to 
R2. R2 was able to pick this up fine. However, R1 continued to monitor M1. I 
figure I will leave this overnight to see if it works itself out.
- Machine crashes hard overnight.
- After bringing up the machine, zenoss starts up as usual. But since the perf 
directory is memory, it is now empty.
- The remote collectors reconnected ok after the machine rebooted.
- I stopped all of zenoss so the perf directory can be restored.
- Restarted zenoss.
- Now R1 is in the same segment as the central server and reconnects to zenhub 
ok. But it is still collecting M1. (Problem 1).
- R2 never reconnects again. It is stuck somewhere... (Problem 2)
select(6, [5], [], [], {42, 22000})     = 0 (Timeout)
gettimeofday({1194534992, 620061}, NULL) = 0
futex(0x9267940, FUTEX_WAKE, 1)         = 0
gettimeofday({1194534992, 620246}, NULL) = 0
futex(0x9267940, FUTEX_WAKE, 1)         = 0
select(6, [5], [], [], {0, 7790})       = 0 (Timeout)
gettimeofday({1194534992, 627997}, NULL) = 0
futex(0x9267940, FUTEX_WAKE, 1)         = 0
gettimeofday({1194534992, 628153}, NULL) = 0
futex(0x9267940, FUTEX_WAKE, 1)         = 0
select(6, [5], [], [], {0, 0})          = 0 (Timeout)
Now there are 2 connections from R2 to the central server for zenhub. I thought 
this was strange.
- Restarted zenoss on R1, zenperfsnmp discovers it should no longer collect M1.
- Restarted zenoss on R2, zenperfsnmp reconnects to zenhub ok and continues 
collecting M1.

So problem 1 & 2 are cleared now but it is going to be troublesome if 
zenperfsnmp needs to be restarted when moving devices around and that it can 
get stuck if the master server goes down. This prevents any kind of FT from 
being provided. FT/DR should be a component of an enterprise system, IMHO.

- Now I take a look at the central server after a few minutes of zenoss up and 
running. Hmmm...zenperfsnmp is reporting there are ~205 out of 650 hosts bad. 
It is stuck in this loop for ~1 hour now. (Problem 3). Usually after a restart 
it takes 2-3 performance cycles to get everything normalized.
- Each one of these hosts were reporting that the snmp agent was down.
- I noticed that the hosts are pretty much the same as the hosts that had their 
community string changed ~1 week ago. The community was reset via the drop down 
menu for each device and then pushed.
- I did an snmpwalk on the command line and the agents were fine.
- I checked in the edit tab of a few devices and the community string was the 
newest.
- I did an snmpwalk via the gui. Info came back fine.
- I confirmed with a packet capture that zenperfsnmp was using the old 
community string! (Still problem 3 here.)
- So I restarted zenperfsnmp, all hosts came back after 2-3 performance cycles 
but 9 hosts were still reporting bad (1 bad was valid).
- Now 100 hosts disappear as they are being retired.
- I got to my the group they are in under /System, select the hosts and choose 
Delete devices...
- ok, I see them removed from the screen.
- Now one of the 9 bad hosts are part of the 100 I just deleted but it still 
continues to be reported as bad (Problem 4).
- Restart zenperfsnmp again.
- zenperfsnmp says 650 out 650 hosts configured (strange, since I just deleted 
100).
- Recovers after 2-3 iterations (normal) but 9 hosts still bad. (Problem 3)
- Deleted host is still there! (Same problem 4).
- Restarted all of zenoss this time.
- Now only 2 hosts bad (1 valid, 1 should be deleted)
- Note: R1 and R2 reconnected back to zenhub with a problem.

Now it turns out that the hosts I deleted were not deleted. They were only 
removed from the systems group name I had them in. This is very misleading. I 
deleted them from the "Device List" menu and then they were gone permanently.

The full restart seemed to flush or refresh the cache but zenoss was restarted 
after the community string change. So I'm not sure what happened here but this 
is a major problem if multiple restarts are needed to get the correct community 
string to be sent.

I'm hoping the moving of a device between performance collectors really doesn't 
require a restart of the collector.

Anyone seen Problems 1-4? Ideas? I believe that Problem 3 is a bug in the cache.

If you read this all, thanx.
-Paul




-------------------- m2f --------------------

Read this topic online here:
http://community.zenoss.com/forums/viewtopic.php?p=12970#12970

-------------------- m2f --------------------



_______________________________________________
zenoss-users mailing list
[email protected]
http://lists.zenoss.org/mailman/listinfo/zenoss-users

[zenoss-users] Recovery Issues

Reply via email to