Re: NOC Best Practices
On Fri, Jul 16, 2010 at 09:34:53PM +0300, Kasper Adel wrote: Thanks for all the people that replied off list, asking me to send them responses i will get. [snip] Which is useful but i am looking for more stuff from the best people that run the best NOCs in the world. So i'm throwing this out again. I am looking for pointers, suggestions, URLs, documents, donations on what a professional NOC would have on the below topics: A lot, as others have said, depending on the business, staffing, goals, SLA, contracts, etc. 1) Briefly, how they handle their own tickets with vendors or internal Run a proper ticketing system over which you have control (RT and friends rather than locking you into something you have to pay for changes). Don't just by ticket closure rate, judge by succesfully resolving problems. Encourage folks to use the system for tracking projects and keeping notes on work in progress rather than private datastores. Inculcate a culture of open exploration to solve problems rather than rote memorization. This gets you a large way to #2. 2) How they create a learning environment for their people (Documenting Syslog, lessons learned from problems...etc) Mentoring, shoulder surfing. Keep your senior people in the mix of triage response so they don't get dull and cross-pollenate skills. When someone is new, have their probationary period be shadowing the primary on-call the entire time. Your third shift [or whatever spans your maintenance windows] should be the folks who actually wind up executing well-specified maintenances (with guidance as needed) and be the breeding ground of some of your better hands-on folks. 3) Shift to Shift hand over procedures This will depend on your systems for tickets, logbooks, etc. Sole that first and this should become evident. 4) Manual tests they start their day with and what they automate (common stuff) This will vary on the business and what's on-site; I can't advise you to always include the genset is you don't have one. 5) Change management best practices and working with operations/engineering when a change will be implemented Standing maintenance windows (of varying severity if that matters yo your business), clear definition of what needs to be done only duringthose and what can be done anytime [hint: policy tuning shouldn't be restructed to them, and you shouldn't make it so an urgent things like a BGP leak can't be fixed]. Linear rather than parallel workflows for approval, and not too many approval stages else your staff will be spending time trying to get things through the administrative stages instead of actual work. Very simply, have a standard for specifying what needs to be done, the minimal tests needed to verify success, and how you fallback if you fail the tests. If someone can't specify it and insist on frobbing around, they likely don't understand the problem or the needed work. Cheers, Joe -- RSUC / GweepNet / Spunk / FnB / Usenix / SAGE
Re: NOC Best Practices
I have to agree that this is all good information. Your question on ITIL: My personal opinion is that ITIL best practices are great to apply to all environments. It makes sense, specifically in the change control systems. However, as stated, it's also highly dependent on how many devices being managed/monitored. I come from a NOC managing 8600+ network devices across 190+ countries. Strict change management policies, windows, approvers. All depending on times relative to the operations in different countries. We were growing so rapidly that we continued purchasing companies and bringing over their infrastructure. Each time bringing in new ticket systems, etc. NNM is by far one of my favorite choices for network monitoring. The issue with it is really the views and getting them organized in an easily viewable fashion. RT is a great ticketing tool for specific needs. It allows for approvers and approval tracking of tickets. However, it isn't extremely robust. I would recommend something like HP ServiceCenter since it can integrate and automate the alert output directly to tickets. This also allows the capability to use Alarmpoint for automated paging of your on-calls based on their schedules, by device, etc. Not to say that I'm a complete HP fan boy but I will say that it works extremely well. Easy to use and simplicity is the key to less mistakes. All of our equipment was 99% Cisco so the combination worked extremely well. Turnover : I firmly believe shift changes should be verbally handed off. Build a template for the days top items or most critical issues. List out the ongoing issues and any tickets being carried over with the status. Allot 15 minutes for the team to sit down with the printout and review it. Contracts/SLA's: We placed all of our systems in a bulk 99.999% uptime critical SLA. However, this was a mistake on our part and the lack of time to plan well when adapting to an ever-changing environment. It would be best to setup your appliances/hardware in your ticket system and monitoring tool based on the SLA you intend to apply to it. Also ensure you include all hardware information: Supply Vendor, Support Vendor, Support coverage, ETR from Vendor, Replacement time. There are many tools that do automated discovery on your network and monitors changes on the network. This is key if you have a changing environment. The more devices you have, the more difficult it is to pinpoint what a failed router or switch ACTUALLY affects upstream or downstream. If this is your chance, take the opportunity to map your hardware/software dependencies. If a switch fails and it provides service to: example: db01 and db01 drives the service in another location. Then you should know that failure is there. It's far too common for companies to get so large they have no idea what the impact of 1 port failure in xyz does to the entire infrastructure. Next: Build your monitoring infrastructure completely separate than the rest of the network. If you don't do switch redundancy (active/passive) on all of your systems or NIC teaming (active/passive) then ensure you do it at least on your monitoring systems. Build your logging out in a PCI/SOX fashion. Ensure you have remote logging on everything, log retention based on your need. Tripwire with approved reports being sent weekly on the systems requiring PCI/SOX monitoring. Remember, if your monitoring systems go down, your NOC is blind. It's highly recommend that the NOC have gateway/jump box systems available to all parts of the network. Run the management completely on RFC1918 for security. Ensure all on-calls have access, use a VPN solution that requires a password + vpn keygen. Utilize TACACs/LDAP the most you can. Tighten everything. Log everything. I can't say that enough. Enforce pw changes every 89 days, require strong passwords/non dictionary, etc. Build an internal site, use a wiki-based format, allow the team the ability to add/modify with approval. Build a FAQ/Knowledgebase. Possibly create a forum so your team can post extra tips/notes, one-offs. Anything that may help new members or people who run across something in the middle of the night they may have never seen. This keeps from waking up your lead staff in the middle of the night. On-calls: Always have a primary/secondary with a clear on-call procedure 'documented'. Example (critical): 1. Issue occurs 2. Page on-call within 10 minutes 3. Allow 10 minutes for return call. 4. Page again 5. Allow 5 minutes 6. Page secondary Etc. Ensure the staff documents every step they take and they copy/paste every page they send into the ticket system. Build templated paging formats. Understand that most txt messages with several carriers have hard limits. Use something like: Time InitialsofNOCPerson SystemAlerting Error CallbackNumber (Ie. 14:05 KH nycgw01 System reports down 555-555- xt103) Use a paging internal website/software or as mentioned,
Re: NOC Best Practices
What about e-TOM? Is it better than ITIL V3? Regards, Xavier Telconet S.A -Original Message- From: Joe Provo nanog-p...@rsuc.gweep.net Date: Sat, 17 Jul 2010 14:56:04 To: Kasper Adelkarim.a...@gmail.com Reply-To: nanog-p...@rsuc.gweep.net Cc: NANOG listnanog@nanog.org Subject: Re: NOC Best Practices On Fri, Jul 16, 2010 at 09:34:53PM +0300, Kasper Adel wrote: Thanks for all the people that replied off list, asking me to send them responses i will get. [snip] Which is useful but i am looking for more stuff from the best people that run the best NOCs in the world. So i'm throwing this out again. I am looking for pointers, suggestions, URLs, documents, donations on what a professional NOC would have on the below topics: A lot, as others have said, depending on the business, staffing, goals, SLA, contracts, etc. 1) Briefly, how they handle their own tickets with vendors or internal Run a proper ticketing system over which you have control (RT and friends rather than locking you into something you have to pay for changes). Don't just by ticket closure rate, judge by succesfully resolving problems. Encourage folks to use the system for tracking projects and keeping notes on work in progress rather than private datastores. Inculcate a culture of open exploration to solve problems rather than rote memorization. This gets you a large way to #2. 2) How they create a learning environment for their people (Documenting Syslog, lessons learned from problems...etc) Mentoring, shoulder surfing. Keep your senior people in the mix of triage response so they don't get dull and cross-pollenate skills. When someone is new, have their probationary period be shadowing the primary on-call the entire time. Your third shift [or whatever spans your maintenance windows] should be the folks who actually wind up executing well-specified maintenances (with guidance as needed) and be the breeding ground of some of your better hands-on folks. 3) Shift to Shift hand over procedures This will depend on your systems for tickets, logbooks, etc. Sole that first and this should become evident. 4) Manual tests they start their day with and what they automate (common stuff) This will vary on the business and what's on-site; I can't advise you to always include the genset is you don't have one. 5) Change management best practices and working with operations/engineering when a change will be implemented Standing maintenance windows (of varying severity if that matters yo your business), clear definition of what needs to be done only duringthose and what can be done anytime [hint: policy tuning shouldn't be restructed to them, and you shouldn't make it so an urgent things like a BGP leak can't be fixed]. Linear rather than parallel workflows for approval, and not too many approval stages else your staff will be spending time trying to get things through the administrative stages instead of actual work. Very simply, have a standard for specifying what needs to be done, the minimal tests needed to verify success, and how you fallback if you fail the tests. If someone can't specify it and insist on frobbing around, they likely don't understand the problem or the needed work. Cheers, Joe -- RSUC / GweepNet / Spunk / FnB / Usenix / SAGE
Re: NOC Best Practices
eTOM is best regarded as a companion to ITIL practices. It has additional layers not covered by ITIL and vice versa. I think a combination of practices from both is the best method. -Kevin -Original Message- From: Xavier Banchon xbanc...@telconet.net Date: Sat, 17 Jul 2010 20:20:26 To: nanog-p...@rsuc.gweep.net; Kasper Adelkarim.a...@gmail.com Reply-To: xbanc...@telconet.net Cc: NANOG listnanog@nanog.org Subject: Re: NOC Best Practices What about e-TOM? Is it better than ITIL V3? Regards, Xavier Telconet S.A -Original Message- From: Joe Provo nanog-p...@rsuc.gweep.net Date: Sat, 17 Jul 2010 14:56:04 To: Kasper Adelkarim.a...@gmail.com Reply-To: nanog-p...@rsuc.gweep.net Cc: NANOG listnanog@nanog.org Subject: Re: NOC Best Practices On Fri, Jul 16, 2010 at 09:34:53PM +0300, Kasper Adel wrote: Thanks for all the people that replied off list, asking me to send them responses i will get. [snip] Which is useful but i am looking for more stuff from the best people that run the best NOCs in the world. So i'm throwing this out again. I am looking for pointers, suggestions, URLs, documents, donations on what a professional NOC would have on the below topics: A lot, as others have said, depending on the business, staffing, goals, SLA, contracts, etc. 1) Briefly, how they handle their own tickets with vendors or internal Run a proper ticketing system over which you have control (RT and friends rather than locking you into something you have to pay for changes). Don't just by ticket closure rate, judge by succesfully resolving problems. Encourage folks to use the system for tracking projects and keeping notes on work in progress rather than private datastores. Inculcate a culture of open exploration to solve problems rather than rote memorization. This gets you a large way to #2. 2) How they create a learning environment for their people (Documenting Syslog, lessons learned from problems...etc) Mentoring, shoulder surfing. Keep your senior people in the mix of triage response so they don't get dull and cross-pollenate skills. When someone is new, have their probationary period be shadowing the primary on-call the entire time. Your third shift [or whatever spans your maintenance windows] should be the folks who actually wind up executing well-specified maintenances (with guidance as needed) and be the breeding ground of some of your better hands-on folks. 3) Shift to Shift hand over procedures This will depend on your systems for tickets, logbooks, etc. Sole that first and this should become evident. 4) Manual tests they start their day with and what they automate (common stuff) This will vary on the business and what's on-site; I can't advise you to always include the genset is you don't have one. 5) Change management best practices and working with operations/engineering when a change will be implemented Standing maintenance windows (of varying severity if that matters yo your business), clear definition of what needs to be done only duringthose and what can be done anytime [hint: policy tuning shouldn't be restructed to them, and you shouldn't make it so an urgent things like a BGP leak can't be fixed]. Linear rather than parallel workflows for approval, and not too many approval stages else your staff will be spending time trying to get things through the administrative stages instead of actual work. Very simply, have a standard for specifying what needs to be done, the minimal tests needed to verify success, and how you fallback if you fail the tests. If someone can't specify it and insist on frobbing around, they likely don't understand the problem or the needed work. Cheers, Joe -- RSUC / GweepNet / Spunk / FnB / Usenix / SAGE
Re: Vyatta as a BRAS
On Wed, 14 Jul 2010 14:12:07 + Dobbins, Roland rdobb...@arbor.net wrote: On Jul 14, 2010, at 8:48 PM, Florian Weimer wrote: From or to your customers? Both. Stopping customer-sourced attacks is probably a good thing for the Internet at learge. Concur 100%. And you can't combat attacks targeted at customers within your own network unless you've got very large WAN pipes, moving you into the realm of special-purpose hardware for other reasons. Sure, you can, via S/RTBH, IDMS, et. al. While DNS reflection/amplification attacks are used to create crushing volumes of attack traffic, and even smallish botnets can create high-volume attacks, most packet-flooding attacks are predicated on throughput - i.e., pps - rather than bandwidth, and tend to use small packets. Of course, they can use *lots and lots* of small packets, and often do, but one can drop these packets via the various mechanisms one has available, then reach out to the global opsec community for filtering closer to the sources. The thing is, with many DDoS attacks, the pps/bps/cps/tps required to disrupt the targets can be quite small, due to the unpreparedness of the defenders. Many high-profile attacks discussed in the press such as the Mafiaboy attacks, the Estonian attacks, the Russian/Georgian/Azerbaijan attacks, the China DNS meltdown, and the RoK/USA DDoS attacks were all a) low-volume, b) low-throughput, c) exceedingly unsophisticated, and d) eminently avoidable via sound architecture, deployment of BCPs, and sound operational practices. In fact, many DDoS attacks are quite simplistic in nature and many are low in bandwidth/throughput; the miscreants only use the resources necessary to achieve their goals, and due to the unpreparedness of defenders, they don't have a need to make use of overwhelming and/or complex attack methodologies. This doesn't mean that high-bandwidth, high-throughput, and/or complex DDoS attacks don't occur, or that folks shouldn't be prepared to handle them; quite the opposite, we see a steady increase in attack volume, thoughput and sophistication at the high end. But the fact of the matter is that many DDoS targets - and associated network infrastructure, and services such as DNS - are surprisingly fragile, and thus are vulnerable to surprisingly simple/small attacks, or even inadvertent/accidental attacks. Previously, this was really a no-brainer because you couldn't get PCI cards with the required interfaces, but with Ethernet everywhere, the bandwidths you can handle on commodity hardware will keep increasing. Concur 100%. Eventually, you'll need special-purpose hardware only for a smallish portion at the top of the router market, or if you can't get the software with the required protocol support on other devices. I believe that the days of software-based routers are numbered, period, due to the factors you describe. Of course, the 'top of the router market' seems to keep moving upwards, despite many predictions to the contrary. Since specific routers have been mentioned, care to comment on the Cisco ASR? If the days of software-based routers are numbered, I'm sure Cisco would have recognised that, and not gone and developed it (or rather, bought the company that did). It seems to me that three key factors that haven't been discussed in this thread are the chances of failure, types of failure triggers and consequence of failure. It seems to have been a binary hardware = no failure, software = failure. If you put large amounts of traffic on a single router, you're likely to need a hardware router, driving up the cost, sacrificing flexibility and re-deployability, and impacting very large numbers of network users if it fails. You may not be vulnerable or as vulnerable to a DoS (software punt mentioned), but DoS's aren't the only type of failure you can suffer from. Software faults on these high end platforms can be a far more common issue within the first few years of release, because they're less widely deployed. Hardware forwarding doesn't protect you from protocol or protocol implementation vulnerabilities on the control plane, and since these are big boxes with a big consequence if they fail, they're a much larger target to aim for. OTOH, if you have options to divide the traffic load across a number of smaller routers, then you may gain the cost effectiveness of more commodity platforms (with the ultimate commodity platform being a PC acting as a router), more robustness because the platform is being used by far more people in far more environments, and less of a consequence when failures occur (DoS or not). I don't think the hardware/software augment is as simple as it is being made out to be. It is completely context dependent. Cost, availability, scalability and flexibility all need to be considered. I personally put a fair bit of weight on flexibility, because I can't tell