Hi LOPSA,
I need some advice on how to get everyone in an operations team to clearly
document their changes/actions and make sure that their actions are documented.
I will probably cross post in my local SAGE mailing list as well.
In short:
- I'm a slightly inexperienced young Sysadmin, who's just started with the
company last week
- From what I can gather I was hired largely on the basis of my reputation at
my previous company for establishing and following robust processes
(Operations
process design/maintenance, Server provisioning and handling of Level 2/3
escalated support requests are part of my job description)
- Operations Team consists of < 5 people
- The company I work for has recently taken over another one. We are aiming to
slowly transition taken-over company customers over to our systems. The
old-company network is complex and the current state of that network is
virtually undocumented. The existing company network is relatively new, relies
on
some parts of the taken-over company infrastructure. We want to be running own
stuff and be completely independent of taken-over-company systems.
- Two people in my Operations Team are very smart and very good technically at
what they do but do not see the need to document actions taken to resolve
a problem, or infrastructure configuration changes that are performed
- The company has a colossus, legacy web app (designed by one of the Operations
Team) which appears to be a one stop place for service creation, DNS changes
(to Bind), customer ticket creation (to RT) and monitoring (with Nagios), but
GUI is not fantastic and it appears no-one other than the person who coded it
likes to use it. There is no detail of what was actually changed, other than
who did the last change.
Any advice on how I get people to change the way they do things, or for that
matter any advice on how to go about such a large infrastructure transition
would be appreciated. Preferably, I'd like to not come across as some
know-it-all punk who's asking for things to implemented simply to create
electronic paperwork.
Thanks LOPSA,
Ben S
In more detail:
- While the legacy web app creates tickets for Request Tracker, there is no
documentation of what happens during L2/3 escalation (communication is through
side channels like a direct e-mail to L1 or a phone call)
- We have a a lot of infrastructure in multiple remote areas fail due to
circumstances beyond our control (weather, upstream provider problems etc).
There appears to be some auto-acknowledging of some Nagios alerts and
rate-limiting of e-mails due to what I think is a bad legacy Nagios
configuration, which the legacy web app generates
- Both people in my Operations Team surprisingly aren't from the taken-over
company
- Partial knowledge of the complete network remains in the head of 2 people in
my Operations Team, undocumented anywhere
- The Web app appears to have been developed with an emphasis on allowing Techs
to add services quickly, but the web GUI is
both information overload at times, and complex due to non-standard terminology
used
I seem to have hit a brick wall trying to convince them of a need to track
changes/actions.
Argument 1: Me: "Don't you think the fact that had to revive the
taken-over--company systems after an outage should be documented?
Operations Team: "The old-company systems are going away. We know how to fix
this common problem. The old-company
systems are going to be blown away anyway. Why bother documenting what was
performed?
Argument 2) Me: Don't you think the fact that you changed network routes to
work around an upstream problem should be
documented somewhere? How do other members of the Operations Team know that you
have already done so? How do
you know what other Operations Team member have already done to work around the
problem?
Operations Team: We're already in constant phone contact with each other when
such a problem happens, why should
what has been performed need to be documented?
Argument 3) Me: Even a one-liner of what was performed, would you be prepared
to do that?
Operations Team: No, I don't have time for that. I've got far too much to do.
(It is apparent that all Operations staff have
a lot to get done daily)
Argument 4) Me: Shouldn't item (X) be documented
Operations Team: You don't need to know about this particular component, you
won't be administrating it anyway
Argument 5) Me: The fact that company techs had to go onsite to replace a
component that died, fixing an issue -
do you think that should be documented?
Operations Team: I guess...it should. The bean-counters would probably want to
know about it....
My proposed plan is to:
- Get clarification of my role from the boss
- Get everyone to use RT properly for any kind of request (even e-mails are
deliberately not sent when a new request
is made)
- Get started performing some kind of documentation of the taken-over
infrastructure and the current infrastructure
using something like Racktables. Non-config technical descriptions will go into
Sharepoint (I would like to use a wiki,
but sadly cannot given big bucks have already been paid)
- See if I can get underlying config files checked into Subversion every time
the underlying config files for a service
is changed and a diff sent to the Operations Team. Longer term I am thinking of
transitioning some of the service
config changes performed by the web-app over to a manual config change process.
This may allow things
to be tracked properly with a Puppet+Subversion solution, this sounds terrible
as it will mean reduced automation.
--------------------------------Advertisement-----------------------------
_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/