Hi LOPSA,

I need some advice on how to get everyone in an operations team to clearly 
document their changes/actions and make sure that their actions are documented.
I will probably cross post in my local SAGE mailing list as well.

In short:
- I'm a slightly inexperienced young Sysadmin, who's just started with the 
company last week
- From what I can gather I was hired largely on the basis of my reputation at 
my previous company for establishing and following robust processes
(Operations
     process design/maintenance, Server provisioning and handling of Level 2/3 
escalated support requests are part of my job description)
- Operations Team consists of < 5 people
- The company I work for has recently taken over another one. We are aiming to 
slowly transition taken-over company customers over to our systems. The 
old-company network is complex and the current state of that network is 
virtually undocumented. The existing company network is relatively new, relies 
on
some parts of the taken-over company infrastructure. We want to be running own 
stuff and be completely independent of taken-over-company systems.
- Two people in my Operations Team are very smart and very good technically at 
what they do but do not see the need to document actions taken to resolve
a problem, or infrastructure configuration changes that are performed
- The company has a colossus, legacy web app (designed by one of the Operations 
Team) which appears to be a one stop place for service creation, DNS changes 
(to Bind), customer ticket creation (to RT) and monitoring (with Nagios), but 
GUI is not fantastic and it appears no-one other than the person who coded it 
likes to use it. There is no detail of what was actually changed, other than 
who did the last change.

Any advice on how I get people to change the way they do things, or for that 
matter any advice on how to go about such a large infrastructure transition 
would be appreciated. Preferably, I'd like to not come across as some 
know-it-all punk who's asking for things to implemented simply to create 
electronic paperwork.

Thanks LOPSA,

Ben S

In more detail:
- While the legacy web app creates tickets for Request Tracker, there is no 
documentation of what happens during L2/3 escalation (communication is through 
side channels like a direct e-mail to L1 or a phone call)
- We have a a lot of infrastructure in multiple remote areas fail due to 
circumstances beyond our control (weather, upstream provider problems etc). 
There appears to be some auto-acknowledging of some Nagios alerts and 
rate-limiting of e-mails due to what I think is a bad legacy Nagios 
configuration, which the legacy web app generates
- Both people in my Operations Team surprisingly aren't from the taken-over 
company
- Partial knowledge of the complete network remains in the head of 2 people in 
my Operations Team, undocumented anywhere
- The Web app appears to have been developed with an emphasis on allowing Techs 
to add services quickly, but the web GUI is
both information overload at times, and complex due to non-standard terminology 
used

I seem to have hit a brick wall trying to convince them of a need to track 
changes/actions.
Argument 1: Me: "Don't you think the fact that had to revive the 
taken-over--company systems after an outage should be documented?
Operations Team: "The old-company systems are going away. We know how to fix 
this common problem. The old-company
systems are going to be blown away anyway. Why bother documenting what was 
performed?

Argument 2)  Me: Don't you think the fact that you changed network routes to 
work around an upstream problem should be
documented somewhere? How do other members of the Operations Team know that you 
have already done so? How do
you know what other Operations Team member have already done to work around the 
problem?
Operations Team: We're already in constant phone contact with each other when 
such a problem happens, why should
what has been performed need to be documented?

Argument 3) Me: Even a one-liner of what was performed, would you be prepared 
to do that?
Operations Team: No, I don't have time for that. I've got far too much to do. 
(It is apparent that all Operations staff have
a lot to get done daily)

Argument 4) Me: Shouldn't item (X) be documented
Operations Team: You don't need to know about this particular component, you 
won't be administrating it anyway

Argument 5) Me: The fact that company techs had to go onsite to replace a 
component that died, fixing an issue -
do you think that should be documented?
Operations Team: I guess...it should. The bean-counters would probably want to 
know about it....

My proposed plan is to:
- Get clarification of my role from the boss
- Get everyone to use RT properly for any kind of request (even e-mails are 
deliberately not sent when a new request
is made)
- Get started performing some kind of documentation of the taken-over 
infrastructure and the current infrastructure
using something like Racktables. Non-config technical descriptions will go into 
Sharepoint (I would like to use a wiki,
but sadly cannot given big bucks have already been paid)
- See if I can get underlying config files checked into Subversion every time 
the underlying config files for a service
is changed and a diff sent to the Operations Team. Longer term I am thinking of 
transitioning some of the service
config changes performed by the web-app over to a manual config change process. 
This may allow things
to be tracked properly with a Puppet+Subversion solution, this sounds terrible 
as it will mean reduced automation.


--------------------------------Advertisement-----------------------------


                                          
_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to