Re: NOC Best Practices

2010-07-17 Thread Joe Provo
On Fri, Jul 16, 2010 at 09:34:53PM +0300, Kasper Adel wrote:
 Thanks for all the people that replied off list, asking me to send them
 responses i will get.
[snip]
 Which is useful but i am looking for more stuff from the best people that
 run the best NOCs in the world.
 
 So i'm throwing this out again.
 
 I am looking for pointers, suggestions, URLs, documents, donations on what a
 professional NOC would have on the below topics:

A lot, as others have said, depending on the business, staffing, 
goals, SLA, contracts, etc.

 1) Briefly, how they handle their own tickets with vendors or internal

Run a proper ticketing system over which you have control (RT and 
friends rather than locking you into something you have to pay for 
changes).  Don't just by ticket closure rate, judge by succesfully 
resolving problems. Encourage folks to use the system for tracking 
projects and keeping notes on work in progress rather than private 
datastores. Inculcate a culture of open exploration to solve problems
rather than rote memorization. This gets you a large way to #2.

 2) How they create a learning environment for their people (Documenting
 Syslog, lessons learned from problems...etc)

Mentoring, shoulder surfing. Keep your senior people in the mix 
of triage  response so they don't get dull and cross-pollenate 
skills.  When someone is new, have their probationary period be 
shadowing the primary on-call the entire time.  Your third shift 
[or whatever spans your maintenance windows] should be the folks 
who actually wind up executing well-specified maintenances (with 
guidance as needed) and be the breeding ground of some of your 
better hands-on folks.

 3) Shift to Shift hand over procedures

This will depend on your systems for tickets, logbooks, etc. 
Sole that first and this should become evident.

 4) Manual tests  they start their day with and what they automate (common
 stuff)

This will vary on the business and what's on-site; I can't 
advise you to always include the genset is you don't have 
one.

 5) Change management best practices and working with operations/engineering
 when a change will be implemented

Standing maintenance windows (of varying severity if that 
matters yo your business), clear definition of what needs 
to be done only duringthose and what can be done anytime 
[hint: policy tuning shouldn't be restructed to them, and 
you shouldn't make it so an urgent things like a BGP leak 
can't be fixed].  Linear rather than parallel workflows 
for approval, and not too many approval stages else your 
staff will be spending time trying to get things through 
the administrative stages instead of actual work.  Very
simply, have a standard for specifying what needs to be 
done, the minimal tests needed to verify success, and how
you fallback if you fail the tests.  If someone can't 
specify it and insist on frobbing around, they likely don't 
understand the problem or the needed work.

Cheers,

Joe
-- 
 RSUC / GweepNet / Spunk / FnB / Usenix / SAGE



Re: NOC Best Practices

2010-07-17 Thread khatfield
I have to agree that this is all good information.

Your question on ITIL: My personal opinion is that ITIL best practices are 
great to apply to all environments. It makes sense, specifically in the change 
control systems.

However, as stated, it's also highly dependent on how many devices being 
managed/monitored. I come from a NOC managing 8600+ network devices across 190+ 
countries.

Strict change management policies, windows, approvers. All depending on times 
relative to the operations in different countries.

We were growing so rapidly that we continued purchasing companies and bringing 
over their infrastructure. Each time bringing in new ticket systems, etc.

NNM is by far one of my favorite choices for network monitoring. The issue with 
it is really the views and getting them organized in an easily viewable fashion.

RT is a great ticketing tool for specific needs. It allows for approvers and 
approval tracking of tickets. However, it isn't extremely robust.

I would recommend something like HP ServiceCenter since it can integrate and 
automate the alert output directly to tickets. This also allows the capability 
to use Alarmpoint for automated paging of your on-calls based on their 
schedules, by device, etc.

Not to say that I'm a complete HP fan boy but I will say that it works 
extremely well. Easy to use and simplicity is the key to less mistakes.

All of our equipment was 99% Cisco so the combination worked extremely well.

Turnover : I firmly believe shift changes should be verbally handed off. Build 
a template for the days top items or most critical issues. List out the ongoing 
issues and any tickets being carried over with the status. Allot 15 minutes for 
the team to sit down with the printout and review it.

Contracts/SLA's:
 We placed all of our systems in a bulk 99.999% uptime critical SLA. However, 
this was a mistake on our part and the lack of time to plan well when adapting 
to an ever-changing environment.

It would be best to setup your appliances/hardware in your ticket system and 
monitoring tool based on the SLA you intend to apply to it. Also ensure you 
include all hardware information: Supply Vendor, Support Vendor, Support 
coverage, ETR from Vendor, Replacement time.

There are many tools that do automated discovery on your network and monitors 
changes on the network. This is key if you have a changing environment. The 
more devices you have, the more difficult it is to pinpoint what a failed 
router or switch ACTUALLY affects upstream or downstream.

If this is your chance, take the opportunity to map your hardware/software 
dependencies. If a switch fails and it provides service to: example: db01 and 
db01 drives the service in another location. Then you should know that failure 
is there. It's far too common for companies to get so large they have no idea 
what the impact of 1 port failure in xyz does to the entire infrastructure.

Next: Build your monitoring infrastructure completely separate than the rest of 
the network. If you don't do switch redundancy (active/passive) on all of your 
systems or NIC teaming (active/passive) then ensure you do it at least on your 
monitoring systems.

Build your logging out in a PCI/SOX fashion. Ensure you have remote logging on 
everything, log retention based on your need. Tripwire with approved reports 
being sent weekly on the systems requiring PCI/SOX monitoring.

Remember, if your monitoring systems go down, your NOC is blind. It's highly 
recommend that the NOC have gateway/jump box systems available to all parts of 
the network. Run the management completely on RFC1918 for security.

Ensure all on-calls have access, use a VPN solution that requires a password + 
vpn keygen. Utilize TACACs/LDAP the most you can. Tighten everything. Log 
everything. I can't say that enough.

Enforce pw changes every 89 days, require strong passwords/non dictionary, etc.

Build an internal site, use a wiki-based format, allow the team the ability to 
add/modify with approval. Build a FAQ/Knowledgebase. Possibly create a forum so 
your team can post extra tips/notes, one-offs. Anything that may help new 
members or people who run across something in the middle of the night they may 
have never seen. This keeps from waking up your lead staff in the middle of the 
night.

On-calls: Always have a primary/secondary with a clear on-call procedure 
'documented'.
Example (critical):
1. Issue occurs
2. Page on-call within 10 minutes
3. Allow 10 minutes for return call.
4. Page again
5. Allow 5 minutes
6. Page secondary
Etc.

Ensure the staff documents every step they take and they copy/paste every page 
they send into the ticket system.

Build templated paging formats. Understand that most txt messages with several 
carriers have hard limits. Use something like:
Time InitialsofNOCPerson SystemAlerting Error CallbackNumber

(Ie. 14:05 KH nycgw01 System reports down 555-555- xt103)

Use a paging internal website/software or as mentioned, 

Re: NOC Best Practices

2010-07-17 Thread Xavier Banchon
What about e-TOM?  Is it better than ITIL V3?

Regards,

Xavier


Telconet S.A

-Original Message-
From: Joe Provo nanog-p...@rsuc.gweep.net
Date: Sat, 17 Jul 2010 14:56:04 
To: Kasper Adelkarim.a...@gmail.com
Reply-To: nanog-p...@rsuc.gweep.net
Cc: NANOG listnanog@nanog.org
Subject: Re: NOC Best Practices

On Fri, Jul 16, 2010 at 09:34:53PM +0300, Kasper Adel wrote:
 Thanks for all the people that replied off list, asking me to send them
 responses i will get.
[snip]
 Which is useful but i am looking for more stuff from the best people that
 run the best NOCs in the world.
 
 So i'm throwing this out again.
 
 I am looking for pointers, suggestions, URLs, documents, donations on what a
 professional NOC would have on the below topics:

A lot, as others have said, depending on the business, staffing, 
goals, SLA, contracts, etc.

 1) Briefly, how they handle their own tickets with vendors or internal

Run a proper ticketing system over which you have control (RT and 
friends rather than locking you into something you have to pay for 
changes).  Don't just by ticket closure rate, judge by succesfully 
resolving problems. Encourage folks to use the system for tracking 
projects and keeping notes on work in progress rather than private 
datastores. Inculcate a culture of open exploration to solve problems
rather than rote memorization. This gets you a large way to #2.

 2) How they create a learning environment for their people (Documenting
 Syslog, lessons learned from problems...etc)

Mentoring, shoulder surfing. Keep your senior people in the mix 
of triage  response so they don't get dull and cross-pollenate 
skills.  When someone is new, have their probationary period be 
shadowing the primary on-call the entire time.  Your third shift 
[or whatever spans your maintenance windows] should be the folks 
who actually wind up executing well-specified maintenances (with 
guidance as needed) and be the breeding ground of some of your 
better hands-on folks.

 3) Shift to Shift hand over procedures

This will depend on your systems for tickets, logbooks, etc. 
Sole that first and this should become evident.

 4) Manual tests  they start their day with and what they automate (common
 stuff)

This will vary on the business and what's on-site; I can't 
advise you to always include the genset is you don't have 
one.

 5) Change management best practices and working with operations/engineering
 when a change will be implemented

Standing maintenance windows (of varying severity if that 
matters yo your business), clear definition of what needs 
to be done only duringthose and what can be done anytime 
[hint: policy tuning shouldn't be restructed to them, and 
you shouldn't make it so an urgent things like a BGP leak 
can't be fixed].  Linear rather than parallel workflows 
for approval, and not too many approval stages else your 
staff will be spending time trying to get things through 
the administrative stages instead of actual work.  Very
simply, have a standard for specifying what needs to be 
done, the minimal tests needed to verify success, and how
you fallback if you fail the tests.  If someone can't 
specify it and insist on frobbing around, they likely don't 
understand the problem or the needed work.

Cheers,

Joe
-- 
 RSUC / GweepNet / Spunk / FnB / Usenix / SAGE


Re: NOC Best Practices

2010-07-17 Thread khatfield
eTOM is best regarded as a companion to ITIL practices. It has additional 
layers not covered by ITIL and vice versa.

I think a combination of practices from both is the best method. 

-Kevin
-Original Message-
From: Xavier Banchon xbanc...@telconet.net
Date: Sat, 17 Jul 2010 20:20:26 
To: nanog-p...@rsuc.gweep.net; Kasper Adelkarim.a...@gmail.com
Reply-To: xbanc...@telconet.net
Cc: NANOG listnanog@nanog.org
Subject: Re: NOC Best Practices

What about e-TOM?  Is it better than ITIL V3?

Regards,

Xavier


Telconet S.A

-Original Message-
From: Joe Provo nanog-p...@rsuc.gweep.net
Date: Sat, 17 Jul 2010 14:56:04 
To: Kasper Adelkarim.a...@gmail.com
Reply-To: nanog-p...@rsuc.gweep.net
Cc: NANOG listnanog@nanog.org
Subject: Re: NOC Best Practices

On Fri, Jul 16, 2010 at 09:34:53PM +0300, Kasper Adel wrote:
 Thanks for all the people that replied off list, asking me to send them
 responses i will get.
[snip]
 Which is useful but i am looking for more stuff from the best people that
 run the best NOCs in the world.
 
 So i'm throwing this out again.
 
 I am looking for pointers, suggestions, URLs, documents, donations on what a
 professional NOC would have on the below topics:

A lot, as others have said, depending on the business, staffing, 
goals, SLA, contracts, etc.

 1) Briefly, how they handle their own tickets with vendors or internal

Run a proper ticketing system over which you have control (RT and 
friends rather than locking you into something you have to pay for 
changes).  Don't just by ticket closure rate, judge by succesfully 
resolving problems. Encourage folks to use the system for tracking 
projects and keeping notes on work in progress rather than private 
datastores. Inculcate a culture of open exploration to solve problems
rather than rote memorization. This gets you a large way to #2.

 2) How they create a learning environment for their people (Documenting
 Syslog, lessons learned from problems...etc)

Mentoring, shoulder surfing. Keep your senior people in the mix 
of triage  response so they don't get dull and cross-pollenate 
skills.  When someone is new, have their probationary period be 
shadowing the primary on-call the entire time.  Your third shift 
[or whatever spans your maintenance windows] should be the folks 
who actually wind up executing well-specified maintenances (with 
guidance as needed) and be the breeding ground of some of your 
better hands-on folks.

 3) Shift to Shift hand over procedures

This will depend on your systems for tickets, logbooks, etc. 
Sole that first and this should become evident.

 4) Manual tests  they start their day with and what they automate (common
 stuff)

This will vary on the business and what's on-site; I can't 
advise you to always include the genset is you don't have 
one.

 5) Change management best practices and working with operations/engineering
 when a change will be implemented

Standing maintenance windows (of varying severity if that 
matters yo your business), clear definition of what needs 
to be done only duringthose and what can be done anytime 
[hint: policy tuning shouldn't be restructed to them, and 
you shouldn't make it so an urgent things like a BGP leak 
can't be fixed].  Linear rather than parallel workflows 
for approval, and not too many approval stages else your 
staff will be spending time trying to get things through 
the administrative stages instead of actual work.  Very
simply, have a standard for specifying what needs to be 
done, the minimal tests needed to verify success, and how
you fallback if you fail the tests.  If someone can't 
specify it and insist on frobbing around, they likely don't 
understand the problem or the needed work.

Cheers,

Joe
-- 
 RSUC / GweepNet / Spunk / FnB / Usenix / SAGE


Re: Vyatta as a BRAS

2010-07-17 Thread Mark Smith
On Wed, 14 Jul 2010 14:12:07 +
Dobbins, Roland rdobb...@arbor.net wrote:

 
 On Jul 14, 2010, at 8:48 PM, Florian Weimer wrote:
 
  From or to your customers?
 
 Both.
 
  Stopping customer-sourced attacks is probably a good thing for the Internet 
  at learge.
 
 Concur 100%.
 
   And you can't combat attacks targeted at customers within your own network 
  unless you've got very large WAN
  pipes, moving you into the realm of special-purpose hardware for other 
  reasons.
 
 Sure, you can, via S/RTBH, IDMS, et. al.  While DNS reflection/amplification 
 attacks are used to create crushing volumes of attack traffic, and even 
 smallish botnets can create high-volume attacks, most packet-flooding attacks 
 are predicated on throughput - i.e., pps - rather than bandwidth, and tend to 
 use small packets.  Of course, they can use *lots and lots* of small packets, 
 and often do, but one can drop these packets via the various mechanisms one 
 has available, then reach out to the global opsec community for filtering 
 closer to the sources.
 
 The thing is, with many DDoS attacks, the pps/bps/cps/tps required to disrupt 
 the targets can be quite small, due to the unpreparedness of the defenders.  
 Many high-profile attacks discussed in the press such as the Mafiaboy 
 attacks, the Estonian attacks, the Russian/Georgian/Azerbaijan attacks, the 
 China DNS meltdown, and the RoK/USA DDoS attacks were all a) low-volume, b) 
 low-throughput, c) exceedingly unsophisticated, and d) eminently avoidable 
 via sound architecture, deployment of BCPs, and sound operational practices.
 
 In fact, many DDoS attacks are quite simplistic in nature and many are low in 
 bandwidth/throughput; the miscreants only use the resources necessary to 
 achieve their goals, and due to the unpreparedness of defenders, they don't 
 have a need to make use of overwhelming and/or complex attack methodologies.
 
 This doesn't mean that high-bandwidth, high-throughput, and/or complex DDoS 
 attacks don't occur, or that folks shouldn't be prepared to handle them; 
 quite the opposite, we see a steady increase in attack volume, thoughput and 
 sophistication at the high end.  But the fact of the matter is that many DDoS 
 targets - and associated network infrastructure, and services such as DNS - 
 are surprisingly fragile, and thus are vulnerable to surprisingly 
 simple/small attacks, or even inadvertent/accidental attacks.
 
  Previously, this was really a no-brainer because you couldn't get PCI
  cards with the required interfaces, but with Ethernet everywhere, the
  bandwidths you can handle on commodity hardware will keep increasing.
 
 Concur 100%.
 
  Eventually, you'll need special-purpose hardware only for a smallish
  portion at the top of the router market, or if you can't get the
  software with the required protocol support on other devices.
 
 I believe that the days of software-based routers are numbered, period, due 
 to the factors you describe.  Of course, the 'top of the router market' seems 
 to keep moving upwards, despite many predictions to the contrary.
 

Since specific routers have been mentioned, care to comment on the Cisco
ASR? If the days of software-based routers are numbered, I'm sure
Cisco would have recognised that, and not gone and developed it (or
rather, bought the company that did).

It seems to me that three key factors that haven't been discussed in
this thread are the chances of failure, types of failure triggers and
consequence of failure. It seems to have been a binary hardware = no
failure, software = failure.

If you put large amounts of traffic on a single router, you're likely
to need a hardware router, driving up the cost, sacrificing flexibility
and re-deployability, and impacting very large numbers of network users
if it fails. You may not be vulnerable or as vulnerable to a DoS
(software punt mentioned), but DoS's aren't the only type of failure you
can suffer from. Software faults on these high end platforms can be a
far more common issue within the first few years of release, because
they're less widely deployed. Hardware forwarding doesn't protect you
from protocol or protocol implementation vulnerabilities on the control
plane, and since these are big boxes with a big consequence if they
fail, they're a much larger target to aim for.

OTOH, if you have options to divide the traffic load across a number of
smaller routers, then you may gain the cost effectiveness of more
commodity platforms (with the ultimate commodity platform being a PC
acting as a router), more robustness because the platform is being used
by far more people in far more environments, and less of a consequence
when failures occur (DoS or not).

I don't think the hardware/software augment is as simple as it is being
made out to be. It is completely context dependent. Cost, availability,
scalability and flexibility all need to be considered. I personally put
a fair bit of weight on flexibility, because I can't tell