I know that 2.2.0 has been out for a while now, but it only became generally
available for Enterprise customers this past week in the form of 2.2.1. After
reading the many fixes and enhancements in this release, I was eager to get it
installed and try it out. Let's just say that this upgrade has made me
seriously consider bringing up a test VM the next time I upgrade so I don't
mess with the production instance. (Yes, yes, I know I should be doing this
anyway.) 2.2.1 works, it definitely works; it's just more of that old saying
"the devil you know...".
Throughout the last couple days I have been writing down things I have seen
with this update, and I am presenting them here for others to comment on. Note
that I have done no searching through the archives yet about any of these
issues; this is solely a first impressions post. Since I am an Enterprise
customer, some of these issues may also end up being support cases as well if I
can't find resolution on the forums--hopefully the issues I am seeing are all
easy fixes, though. That would be nice. :-)
Ok, here's the list, straight off my wiki article about the upgrade:
== Issues After Upgrade ==
Upgrading the Zenoss Enterprise ZenPack changes many Windows Services'
zMonitor property from false to true (kdc, ntfrs, exchange*, sql*), even though
I had explicitly disabled monitoring on those services (from the last time I
upgraded the enterprise zenpack--sigh). This even affects servers that are
*not* running any of the services listed above. For example, Zenoss will show
that the ntfrs service is down on all servers, even though it is only installed
and running on the domain controllers and two or three others. Also, after
changing all the relevant services back to "zMonitor = false", Zenoss still
reports that these services are down on a few servers (yet not all of
them--curious). I had to go into each affected server, find the service, and
change the zMonitor property from false to true, save, and then switch back to
false (and save) in order to completely fix the problem.
"Threshold of zenwin cycle time exceeded" and "zenwin heartbeat
failure" issues. Could these be contributing to the next issue below? Update:
this is still happening two days after the upgrade.
~1000 events regarding "Wmi communication failure during connect"--and
others--after the first zenmodeler (?) poll of Windows servers. LOTS of email
alerts generated (had paging disabled though). Even more "Wmi communication
failure during connect" events the day after the upgrade. Continuing to
monitor this issue. Update: two days after the upgrade there are 23 of these
errors in the event console. Either 2.2.1 broke something, or I didn't know
the problem existed before because of poor reporting. Also, if these errors
are transient, then shouldn't they be warnings by default? That way they'll go
away after a few hours and it won't look like the entire datacenter blew up in
a sea of orange. I assume they are transient because the count on all of them
is only 1. Or are these more errors like RPC_S_CALL_FAILED that must be
cleared before the servers will be monitored again?
The zenwinmodeler daemon is no longer listed under the Daemons tab (and
doesn't get started with a 'zenoss start' command), yet
$ZENHOME/bin/zenwinmodeler still exists. Is it still used, or should it be
deleted? Also, zenwinmodeler shows up as a component on certain events. If
it's no longer used, why are there still references to it?
I didn't see anything in the install that said I needed to reset the
Page Command to something useful. More of a documentation issue, really.
Why do the daemon "threshold" alerts insist on setting the device name
to "localhost", even after I changed the "Hostname" value (found at
/Monitors/Hub/localhost/localhost) to the actual name of the server? Is there
somewhere else to change this value?
On a good note, I haven't yet seen any RPC_S_CALL_FAILED errors.
Usually I get at least one per day (at 10:26am, if you can believe the
regularity). The fact that I haven't seen one yet makes me happy.
=== Modifications after upgrade ===
changed "Process Parallel Jobs" from 10 to 20 to try to get a little
better performance. (8 cores, and 7 are bored all day long. Hello,
parallelism?) Need to read up on this to make sure it does what I think it
does.
changed "Windows Modeler Cycle Interval" from 60s to 120s to try and
alleviate the "cycle time exceeded" error. (Note above, this didn't seem to
help. And interestingly, yesterday I saw one cycle take 84 seconds and
generate an "exceeded" error, even though the cycle time was 120. WTF?
changed "Page Command" value; substituted "snpp.metrocall.com" in place
of "localhost".
And there you go. If anyone has any comments on anything above, please let me
know! Like I said, I'll be trolling through the forums over the next few days
to see if any of these issues have been fixed. For those that I do find fixes
for, and any other issues that crop up and / or fix themselves, I'll update
this thread as well.
Finally: a big thank you to the Zenoss team for this release! Other than the
issues I outlined above, this looks like it will be a good step forward from
2.1.3.
--
seth wright ([EMAIL PROTECTED])
windows engineer
540.568.2912 (office)
james madison university
-------------------- m2f --------------------
Read this topic online here:
http://community.zenoss.com/forums/viewtopic.php?p=21782#21782
-------------------- m2f --------------------
_______________________________________________
zenoss-users mailing list
[email protected]
http://lists.zenoss.org/mailman/listinfo/zenoss-users