ALT-DB Question

2010-12-08 Thread Chadwick Sorrell
Hello,

I'm sending a new MAINT-AS object to the db-ad...@altdb.net, but it
doesn't appear to be in the database after a few weeks.  Are there any
requirements that I may be missing on my new request, or some sort of
way I can help get it processed?

Basically wondering if I'm just not waiting long enough, or if I've
done something wrong.

Thanks,
-chad



ALT-DB Question

2010-12-08 Thread Chadwick Sorrell
Hello,

I'm sending a new MAINT-AS object to the db-ad...@altdb.net, but it doesn't
appear to be in the database after a few weeks.  Are there any requirements
that I may be missing on my new request, or some sort of way I can help get
it processed?

Basically wondering if I'm just not waiting long enough, or if I've done
something wrong.

Thanks,
-chad




Niksun Probe

2010-04-19 Thread Chadwick Sorrell
Hello Nanog,

Looking for information on a Niksun probe, http://www.niksun.com/.
Anyone have any experience, good or bad with them?

Thanks!



Re: Alcatel-Lucent

2010-03-04 Thread Chadwick Sorrell
I'll have to second everything everyone is saying.  Absolutely pleased
with everything about them.  Just wish I had more 7750s instead of
7450s.

On Thu, Mar 4, 2010 at 5:59 PM, Craig cvulja...@gmail.com wrote:
 Very good routers. We have been using them for several years now. Very solid
 product, and very easy to setup services: ie vprn/ vpls/ epipe, etc.

 The qos on the box is very scalable. I could talk more about them off line
 with you or discuss more over phone.





 On Mar 4, 2010, at 5:22 PM, Scott Weeks sur...@mauigateway.com wrote:



 --- li...@iamchriswallace.com wrote:
 I am hoping to get some peoples opinions on Alcatel-Lucent routers.  We
 are looking at the 7750 SR line and the 7450 ESS line.  We are currently a
 Cisco shop but these would be deployed in a completely new network
 delivering mostly MPLS based services and DIA.  Any comments are welcome,
  good and bad.
 ---


 We deploy these.  They are very different from cisco (so there will be a
 big learning curve) and kick ass.  Be sure to go to 7.something as cflowd
 (their netflow) does not report correctly on things like ASN.

 scott






Revelation Networks

2010-02-26 Thread Chadwick Sorrell
Does anyone have any experiences with Revelation Networks?  They're
AS26821, and I'm  looking for good or bad experiences with their
services.  Prefer an off-list reply.

Thanks



Re: Mitigating human error in the SP

2010-02-02 Thread Chadwick Sorrell
On Tue, Feb 2, 2010 at 9:09 AM, Paul Corrao pcor...@voxeo.com wrote:
 Humans make errors.

 For your upper management to think  they can build a foundation of 
 reliability on the theory that humans won't make errors is self deceiving.

 But that isn't where the story ends.  That's where it begins.  Your 
 infrastructure, processes and tools should all be designed with that in mind 
 so as to reduce or eliminate the impact that human error will have on the 
 reliability of the service you provide to your customers.

 So, for the example you gave there are a few things that could be put in 
 place.  The first one, already mentioned by Chad, is that mission critical 
 services should not be designed with single points of failure - that 
 situation should be remediated.

Agreed.

 Another question  to be asked - since this was provisioning work being done, 
 and it was apparently being done on production equipment, could the work have 
 been done at a time of day (or night) when an error would not have been as 
 much of a problem?

As it stands now, business want to turn their services up when they
are in the office.  We do all new turn-ups during the day, anything
requiring a roll or maintenance window is schedule in the middle of
the night.

 You don't say how long the outage lasted, but given the reaction by your 
 upper management, I would infer that it lasted for a while.  That raises the 
 next question.  Who besides the engineer making the mistake was aware of the 
 fact that work on production equipment was occurring?  The reason this is 
 important is because having the NOC know that work is occurring would give 
 them a leg up on locating where the problem is once they get the trouble 
 notification.

The actual error happened when someone was troubleshooting a turn-up,
where in the past the customer in question has had their ethertype set
wrong.  It wasn't a provisioning problem as much as someone
troubleshooting why it didn't come up with the customer.  Ironically,
the NOC was on the phone when it happened, and the switch was rebooted
almost immediately and the outage lasted 5 minutes.

Chad



Re: Mitigating human error in the SP

2010-02-02 Thread Chadwick Sorrell
On Tue, Feb 2, 2010 at 12:45 PM, James Downs james.do...@egontech.com wrote:

 On Feb 2, 2010, at 9:33 AM, Jared Mauch wrote:

 We have solved 98% of this with standard configurations and templates.

 To deviate from this requires management approval/exception approval after
 an evaluation of the business risks.

 I would also point Chad to this book: http://bit.ly/cShEIo (Amazon Link to
 Visual Ops).
 It's very useful to have your management read it.  You may or may not be
 able to or want to use a full ITIL process, but understanding how these
 policies and procedures can/should work, and using the ones that apply makes
 sense.
 Change control, tracking, and configuration management are going to be key
 to avoiding mistakes, and being able to rapidly repair when one is made.
 Unfortunately, most management that demands No Tolerance, Zero Error from
 operations won't read the book.
 Good luck.. I'd bet most of the people on this list have been there one time
 or another.
 Cheers,
 -j

Interesting book, maybe I'll bring that to the next meeting.  Thanks
for the heads up on that.



Re: Mitigating human error in the SP

2010-02-02 Thread Chadwick Sorrell
Thanks for all the comments!

On Tue, Feb 2, 2010 at 1:01 PM, JC Dill jcdill.li...@gmail.com wrote:
 Chadwick Sorrell wrote:

 This outage, of a high profile customer, triggered upper management to
 react by calling a meeting just days after.  Put bluntly, we've been
 told Human errors are unacceptable, and they will be completely
 eliminated.  One is too many.

 Good, Fast, Cheap - pick any two.  No you can't have all three.

 Here, Good is defined by your pointy-haired bosses as an
 impossible-to-achieve zero error rate.[1]  Attempting to achieve this is
 either going to cost $$$, or your operations speed (how long it takes people
 to do things) is going to drop like a rock.  Your first action should be to
 make sure upper management understands this so they can set the appropriate
 priorities on Good, Fast, and Cheap, and make the appropriate budget
 changes.

 It's going to cost $$$ to hire enough people to have the staff necessary to
 double-check things in a timely manner, OR things are going to slow way down
 as the existing staff is burdened by necessary double-checking of everything
 and triple-checking of some things required to try to achieve a zero error
 rate.  They will also need to spend $$$ on software (to automate as much as
 possible) and testing equipment.  They will also never actually achieve a
 zero error rate as this is an impossible task that no organization has ever
 achieved, no matter how much emphasis or money they pour into it (e.g.
 Windows vulnerabilities) or how important (see Challenger, Columbia, and the
 Mars Climate Orbiter incidents).

 When you put a $$$ cost on trying to achieve a zero error rate,
 pointy-haired bosses are usually willing to accept a normal error rate.  Of
 course, they want you to try to avoid errors, and there are a lot of simple
 steps you can take in that effort (basic checklists, automation, testing)
 which have been mentioned elsewhere in this thread that will cost some money
 but not the $$$ that is required to try to achieve a zero error rate.  Make
 sure they understand that the budget they allocate for these changes will be
 strongly correlated to how Good (zero error rate) and Fast (quick
 operational responses to turn-ups and problems) the outcome of this
 initiative.

 jc

 [1]  http://www.godlessgeeks.com/LINKS/DilbertQuotes.htm

 2. What I need is a list of specific unknown problems we will encounter.
 (Lykes Lines Shipping)

 6. Doing it right is no excuse for not meeting the schedule. (RD
 Supervisor, Minnesota Mining  Manufacturing/3M Corp.)







Mitigating human error in the SP

2010-02-01 Thread Chadwick Sorrell
Hello NANOG,

Long time listener, first time caller.

A recent organizational change at my company has put someone in charge
who is determined to make things perfect.  We are a service provider,
not an enterprise company, and our business is doing provisioning work
during the day.  We recently experienced an outage when an engineer,
troubleshooting a failed turn-up, changed the ethertype on the wrong
port losing both management and customer data on said device.  This
isn't a common occurrence, and the engineer in question has a pristine
track record.

This outage, of a high profile customer, triggered upper management to
react by calling a meeting just days after.  Put bluntly, we've been
told Human errors are unacceptable, and they will be completely
eliminated.  One is too many.

I am asking the respectable NANOG engineers

What measures have you taken to mitigate human mistakes?

Have they been successful?

Any other comments on the subject would be appreciated, we would like
to come to our next meeting armed and dangerous.

Thanks!
Chad